ddRADSeq Demultiplexing

Setting up the Ecosystem for running a de novo assembly for SNP identification.

Rodney Dyer

04 Oct 2022 • 5 min read

Your sequencing facility will let you provide to you a ftp link to download compressed fasta files (ending in .fa.gz in my case). We are getting these data in raw form and will do own own demultiplexing and creating a de novo assembly from which to call SNPs.

Setting up SSH

One pain point in doing this is that you’ll have to do all the work on a remote server and logging in each time can lead to a pain point (because you are using unique, long, and complicated passwords for each of your accounts, right?).

To overcome this, set up your public/private SSH key pair. First, check if you have one setup.

rodney@Rodneys-Mac-Studio ~ % ls .ssh/id*.pub .ssh/id_rsa.pub

If there is nothing there, then you need to generate a new public key.rodney@Rodneys-Mac-Studio ~ % ssh-keygen -t rsa -b 4096 -C "your_email@domain.com"

and follow the online instructions. If you do use a passphrase MAKE SURE you have it written down somewhere (like in your password manager).

After this, you should see both the public and private versions in your local repository.

rodney@Rodneys-Mac-Studio ~ % ls .ssh/id* .ssh/id_rsa .ssh/id_rsa.pub

Now, what you need to do is to copy your identity over to the new server.

ssh-copy-id remoteUserID@remote.server.com

You should get some message like:

remoteUserID@remote.server.com's password: Number of key(s) added: 1 Now try logging into the machine, with: "ssh 'remoteUserID@remote.server.com'" and check to make sure that only the key(s) you wanted were added.

Now you should be able to log into the remote server without needing to copy over a password all the time.

One last thing that I find to help out. If your remote server has some long name to it, you can make a shortcut by adding an entry to the /etc/hosts file.

sudo vim /etc/hosts

To this file, add a line that has the full name of the server, a tab, and then a nickname you give it.

rodney@Rodneys-Mac-Studio ~ % cat /etc/hosts  
##  
# Host Database  
#  
# localhost is used to configure the loopback interface  
# when the system is booting. Do not change this entry.  
##  
127.0.0.1 localhost  
255.255.255.255 broadcasthost  
::1 localhost  
my.server.with.long.name nickname

Now, you should be able to log into your remote server as:

ssh user@nickname

much easier.

Getting STACKS

Stacks is a set of programs that are used to assemble sequence data. I use it commonly for demultiplexing. Go grab the link to the latest version here.

When I did it for this document it was version 2.62. Grab it from your server. I typically put all the source code into a src folder.

$ mkdir src $ cd src $ wget http://catchenlab.life.illinois.edu/stacks/source/stacks-2.62.tar.gz $ tar zxvf stacks-2.62.tar.gz $ cd stacks-2.62

Then you need to install it.

On my current server, they use a Module system for loading various development environments. You can find the documentation here. To load up the most recent version of the gcc compiler set, you enter.

$ module load gcc  
$ module list  
Currently Loaded Modules:  
  1) gcc/10.2.1

Then you can configure the compiliation and environment.

$ ./configure --prefix=/home/user/

And if all the stuff is there, you should get a long list of stuff ending with:

checking that generated files are newer than configure... done configure: creating ./config.status  
config.status: creating Makefile  
config.status: creating htslib/Makefile  
config.status: creating config.h  
config.status: executing depfiles commands

Now, compile the components (and go grab a cup of coffee)

If it died for some reason, you’ll have to do some spelunking and figure out why it died (probably due to a missing dependency) and then grab the components that you were missing.

After that, you can install it (to your local directories).

$ make install

And everything should be good.

Screen

Doing anything with this large amount of data will take time. It is good to remember that the vast majority of what you’ll be working with is actually stuff you are going to throw away in the end anyways. That being said, our first main tool will be screen. This is a program that allows you to create a "sub shell" for a process, do something that is going to take a long time, log out of the shell and continue on your daily routine. Then, at various times in the near (or distant) future, you can log back in and see how things are going.

You can run screen on your machine or on the remote server where things are happening. I’m going to run it on the other end.

$ screen --help

To start a new screen session, invoke it as:

$ screen

And you will be entered into a new session. To exit, hit CTRL-A D and you will be pooped out of the screen session.

To reattach to an existing session, you can list the current active ones

$ screen -list  
There are screens on:  
  12892.pts-10.server (Detached)  
  12775.pts-10.server (Detached)  
  12658.pts-10.server (Detached)  
3 Sockets in /var/run/screen/S-user.

Then reattach to one, you need to specify it by the id.

$ screen -r 12892

To exit a session you are in (and no longer retain it), type

Process Radtags

Last time I tried this, I ran it with this configuration on a single linux instance. I am using a single barcode on these tags that was cut with EcoRI. The -r -e options are recommended by dDoscent.

$ process_radtags -p ./raw/ -o ./samples/run_13_2/ -b ./barcodes/all/barcodes_all.csv --inline-null --disable-rad-check -i gzfastq -r -e ecoRI

Now I am on a cluster and we need to submit our jobs through SLURM. Here is a config file to run a batch run.

#!/bin/bash  
#SBATCH --time=72:00:00  
#SBATCH --chdir=./  
#SBATCH --mem=16G  
#SBATCH --mail-user=rjdyer@vcu.edu  
#SBATCH --mail-type=BEGIN  
#SBATCH --mail-type=END  
#SBATCH --mail-type=FAIL  
#SBATCH --mail-type=REQUEUE  
#SBATCH --mail-type=ALL #SBATCH -J "radtags"  
#SBATCH --ntasks=1

START=$(date +%s.%N)  
echo "Starting process radtags"

# for the run, you need to change the lane process_radtags -p ./raw/lane3/ -o ./samples/land3/ -b ./barcodes/all/barcodes_all.csv --inline-null --disable-rad-check -i gzfastq -r -D -e ecoRI

END=$(date +%s.%N)  
DIFF=$(echo "($END - $START)/60" | bc)  
echo "Processing radtags took ${DIFF} minutes to complete."

To run it, I saved the script as 01-process_radtags_slurm.sh, made it executable as

$ chmod +x 01-process_radtags_slurm.sh

and ran it as:

$ sbatch ./01-process_radtags_slurm.sh

Actually I was not able to get sbatch to run, so I just did a quick bash srun job.

$ srun -A rjdyer --nodes=1 --time=48:00:00 --mem=16G --pty /bin/bash $ process_radtags -p ./raw/lane3/ -o ./samples/land3/ -b ./barcodes/all/barcodes_all.csv --inline-null --disable-rad-check -i gzfastq -r -e ecoRI

This was a bit of a more direct approach, just invoking a distributed session for bash and then running it that way. Seemed to work fine.

Some pages of interest

Cover Photo by Sangharsh Lohakare on Unsplash

Originally published at https://dyerlab.org on October 5, 2022.