High-speed and high-ratio referential genome compression
A high performance referential genome compression algorithm (termed HiRGC)
There is typos in Example 1. We correct it in the following.
We present a high-performance referential genome compression algorithm named HiRGC. It is based on a 2-bit encoding scheme and an advanced greedy-matching search on a hash table. We compare the performance of HiRGC with four state-of-the-art compression methods on a benchmark data set of eight human genomes. HiRGC takes less than 30 minutes to compress about 21 gigabytes of each set of the seven target genomes into 96 to 260 megabytes, achieving compression ratios of 217 to 82 times. This performance is at least 1.9 times better than the best competing algorithm on its best case. Our compression speed is also at least 2.9 times faster. HiRGC is stable and robust to deal with different reference genomes. In contrast, the competing methods’ performance varies widely on different reference genomes. More experiments on 100 human genomes from the 1000 Genome Project and on genomes of several other species again demonstrate that HiRGC’s performance is consistently excellent.
make hirgc
make de_hirgc
The following three different commands:
(1) ./hirgc -m file -r YH_chr1.fa -t HG18_chr1.fa
(2) ./hirgc -m genome -r YH -t HG18 -n chr_name.txt/default
(3) ./hirgc -m set -r YH -t genome_set.txt -n chr_name.txt/default
Some other explaination
./de_hirgc -m file -r YH_chr1.fa -t HG18_chr1.fa_ref_YH_chr1.fa.7z
./de_hirgc -m genome -r YH -t HG18_ref_YH.7z -n chr_name.txt/default
./de_hirgc -m set -r YH -t de_genome_set.txt -n chr_name.txt/default
Published in Bioinformatics DOI: https://doi.org/10.1093/bioinformatics/btx412.
Yuansheng Liu, Hui Peng, Limsoon Wong, Jinyan Li; High-speed and high-ratio referential genome compression, Bioinformatics, btx412, 2017, https://doi.org/10.1093/bioinformatics/btx412
If any bugs during you run our code, please email to yyuanshengliu@gmail.com