项目作者: felipelouza

项目描述 :
gsufsort: building suffix arrays, LCP-arrays and BWTs for string collections [AMB 2020]
高级语言: C
项目地址: git://github.com/felipelouza/gsufsort.git
创建时间: 2019-03-11T10:50:02Z
项目社区:https://github.com/felipelouza/gsufsort

开源协议:GNU General Public License v3.0

下载


gsufsort

gsufsort [1] is a fast, portable and lightweight tool for constructing the
suffix array and related data structures for string collections.

gsufsort runs in internal memory and data structures are written to disk.

For a string collection, gsufsort can compute the following data structures:

  • Suffix array (SA)
  • Inverse suffix array (ISA)
  • LCP-array (LCP)
  • k-truncated LCP-array (k-LCP)
  • Document array (DA)
  • Burrows-Wheeler transform (BWT)
  • Inverse BWT
  • Generalized suffix array (GSA)
  • Generalized enhanced suffix array (GESA)

Compilation and installation

gsufsort will compile in systems with a standard C compiler (like gcc) and make.

  1. git clone https://github.com/felipelouza/gsufsort.git
  2. cd gsufsort
  3. make

Issuing these commands will build executables gsufsort and gsufsort-64.

For inputs larger than 2GB, gsufsort-64 must be used.

To enable support to compressed files, zlib is required. If zlib is
not installed in you system, build with option make GZ=0.

Execution

  1. ./gsufsort INPUT [options]

where INPUT is a single file or directory with a string collection.

Construction options:

  1. --build (default)
  2. --sa [w] compute the SA using w bytes (default 4), write to INPUT.w.sa
  3. --isa [w] compute the ISA, write to INPUT.w.isa
  4. --lcp [w] compute the LCP, write to INPUT.w.lcp
  5. --trlcp k compute the k-truncated LCP array, write to INPUT.w.lcp
  6. --da [w] compute the DA, write to INPUT.w.da
  7. --gsa [w1][w2] compute the GSA=(string,suffix) as pairs of w1+w2 bytes, write to INPUT.w1.w2.gsa
  8. --gesa [w1][w2][w3] compute the GESA=(GSA,LCP,BWT), write to INPUT.w1.w2.w3.1.gesa
  9. --bwt compute the BWT using 1 byte per symbol, write to INPUT.bwt
  10. --docs d process only the first d strings in the collection
  11. --light run the lightweight algorithm to compute DA, GSA and GESA
  12. --output DIR/NAME write output files to DIR and use NAME as a prefix for file names

Loading options:

  1. --load load data-structures from disk INPUT[.sa][.da][.lcp][.gsa][.str]
  2. --ibwt invert the BWT, given INPUT.bwt, write output to INPUT.bwt.ibwt

Input options:

  1. --txt handle input as raw text files (one string per line)
  2. --fasta handle input as fasta files
  3. --fastq handle input as fastq files
  4. --dir include all files at the input directory
  5. --lower convert input to lowercase before data structures construction
  6. --upper convert input to uppercase before data structures construction

Output options:

  1. --str write the collection concatenation (T^{cat}) to INPUT.1.str
  2. --print [p] print the first p elements of arrays to stdout, defaults to the collection length
  3. --qs write QS sequences in fastq permuted according to the BWT to INPUT.bwt.qs
  4. --lcp_max write the maximum LCP value, write to INPUT.w.lcp_max
  5. --lcp_max_text write the maximum LCP value text, write to INPUT.lcp_max.txt
  6. --lcp_avg write the average LCP value, write to INPUT.w.lcp_avg
  7. --time print the running time in seconds
  8. --verbose verbose output
  9. --help this help message

Input files

  • File types (text, fasta or fastq) will be selected by extensions:
    .txt, .fasta (or .fa, .fna.) and .fastq (or .fq).

  • Options --txt, --fasta and --fastq enable loading file disregarding extensions.

  • In txt files, each line is taken as a strings in the collection.
    In fasta and fastq files, each sequence is taken as a string in the
    collection.

  • gsufsort supports the ASCII alphabet, but values 0 and 255 are
    reserved and must not occur in the input.

  • IUPAC symbols and ‘N’ are not handled as special symbols in fasta
    or fastq files.

  • gzipped input files (with .gz extension) are supported using
    zlib
    and
    kseq
    libraries. If zlib is not installed in your system, build
    gsufsort with the option make GZ=0. If zlib is not available
    and a gzipped file is given as input, a runtime error will occur.

  • A directory may be given as input, selecting option --dir.
    Every file with expected extensions in the directory will be
    processed to compose the collection, and the default output file
    prefix will be all.
    See also
    Wiki.

Output files

  • Output files are written by default in the current directory.

  • If option --output DIR/ is set, files are written to directory
    DIR. Setting --output DIR/NAME will make files be written
    to directory DIR with suffix NAME.

  • Output files format is discussed below.

quick test

To run a test with all strings from dataset/example.txt, type:

  1. ./gsufsort dataset/example.txt --sa --bwt
  1. ## gsufsort ##
  2. ## store_to_disk ##
  3. example.txt.4.sa 76 bytes (n = 19)
  4. example.txt.bwt 19 bytes (n = 19)

To see the result (option --print) stored in disk INPUT.4.sa and INPUT.bwt, use --load option:

  1. ./gsufsort example.txt --sa --bwt --load --print
  1. ## load_from_disk ##
  2. example.txt.4.sa 76 bytes (n = 19)
  3. example.txt.bwt 19 bytes (n = 19)
  4. i SA BWT suffixes
  5. 0 18 $ #
  6. 1 6 a $
  7. 2 12 a $
  8. 3 17 n $
  9. 4 5 n a$
  10. 5 11 b a$
  11. 6 9 n aba$
  12. 7 15 n an$
  13. 8 3 n ana$
  14. 9 7 $ anaba$
  15. 10 13 $ anan$
  16. 11 1 b anana$
  17. 12 10 a ba$
  18. 13 0 # banana$
  19. 14 16 a n$
  20. 15 4 a na$
  21. 16 8 a naba$
  22. 17 14 a nan$
  23. 18 2 a nana$

output format

  • SA, ISA, LCP, k-LCP and DA are each written sequentially to a binary
    file. The file has no header and every integer takes w
    bytes. The default value of w is 4.

  • BWT and iBWT are written in ASCII format, using 1 byte per input symbol.

  • The GSA is written sequentially to a binary file, with no headers.
    The values of DA and SA are intercalated throughout the file with w1 and w2 bytes respectively:
    DA[0],SA[0],DA[1],SA[1],…

  • The GESA is written sequentially to a binary file, with no headers.
    The values of DA and SA, LCP and BWT are intercalated throughout the file with w1, w2, w3 and 1 bytes respectively:
    DA[0],SA[0],LCP[0],BWT[0],DA[1],SA[1],LCP[1],BWT[1]…

wiki

See more details and additional features in Wiki.

authors

thanks

We thank to Antonis Maronikolakis for his helpful suggestions.

references

[1] Louza, F.A., Telles, G.P., Gog, S., Prezza, N., Rosone, G.. gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections. Algorithms Mol Biol 15, 18 (2020). https://doi.org/10.1186/s13015-020-00177-y


Supported by the project Italian MIUR-SIR CMACBioSeq (“Combinatorial methods for analysis and compression of biological sequences“) grant n.~RBSI146R5L. P.I. Giovanna Rosone