Dereplicate long sequences
Dereplicate looooooong sequences!
If you want to get rid of duplicate long sequences (i.e. contigs that are exact substrings of some other contigs), derep_seqs
is the tool for you!
Download the source code (either with git clone
or by downloading a release), cd
into the source directory, and then use make
to build it.
git clone https://github.com/mooreryan/derep_seqs.git
cd derep_seqs
make
This will install derep_seqs
to the bin
directory in the source directory. You can now move derep_seqs
and sort_fasta
to somewhere on your path if you’d like.
derep_seqs <num worker threads> <seqs.fasta> > seqs.derep.fa
The fasta file must be sorted by increasing sequence length. The program sort_fasta
(included in the bin
directory) will do this for you.
$ bin/derep_seqs 10 <(bin/sort_fasta contigs.fasta) > contigs.derep.fa
That’s it!