If one were to try, hypothetically that is, to create a new assembly from 1500 individuals being run through NovoSeq, one may find that your computational resrouces are sorely inadequate!
So, the first pass through these data are focused on a subset of the sequences. I'm starting with a 10% random sample of sequences from each of the 1500 individuals. At first pass, I was just going to whip up a quick R script and let it go in the background then I ran into a few problems. First, on my desktop where I write code on a subset of the data set, it appears that the
zcat function that is distributed with MacOS is no longer supporting
.gz file designations. Kept getting these odd errors, tried to get brew to update in case it was a bad version of
zcat but no. That is how it is intended to work. Stooopid.
So, just did a quick BASH script.
Here is a quick one.
#!/bin/bash for FILE in $(ls *.fq.gz); do seqtk sample $FILE 0.10 >> POPULATION_01.fq echo $FILE done
I found this to be helpful.