gsufsort

gsufsort [1] is a fast, portable and lightweight tool for constructing the
suffix array and related data structures for string collections.

gsufsort runs in internal memory and data structures are written to disk.

For a string collection, gsufsort can compute the following data structures:

Suffix array (SA)
Inverse suffix array (ISA)
LCP-array (LCP)
k-truncated LCP-array (k-LCP)
Document array (DA)
Burrows-Wheeler transform (BWT)
Inverse BWT
Generalized suffix array (GSA)
Generalized enhanced suffix array (GESA)

Compilation and installation

gsufsort will compile in systems with a standard C compiler (like gcc) and make.

git clone https://github.com/felipelouza/gsufsort.git
cd gsufsort
make

Issuing these commands will build executables gsufsort and gsufsort-64.

For inputs larger than 2GB, gsufsort-64 must be used.

To enable support to compressed files, zlib is required. If zlib is
not installed in you system, build with option make GZ=0.

Execution

./gsufsort INPUT [options]

where INPUT is a single file or directory with a string collection.

Construction options:

--build                  (default)
--sa    [w]           compute the SA using w bytes (default 4), write to INPUT.w.sa
--isa   [w]           compute the ISA, write to INPUT.w.isa
--lcp   [w]           compute the LCP, write to INPUT.w.lcp
--trlcp k             compute the k-truncated LCP array, write to INPUT.w.lcp
--da    [w]           compute the DA, write to INPUT.w.da
--gsa   [w1][w2]      compute the GSA=(string,suffix) as pairs of w1+w2 bytes, write to INPUT.w1.w2.gsa
--gesa  [w1][w2][w3]  compute the GESA=(GSA,LCP,BWT), write to INPUT.w1.w2.w3.1.gesa
--bwt                 compute the BWT using 1 byte per symbol, write to INPUT.bwt
--docs  d             process only the first d strings in the collection
--light               run the lightweight algorithm to compute DA, GSA and GESA
--output DIR/NAME     write output files to DIR and use NAME as a prefix for file names

Loading options:

--load                load data-structures from disk INPUT[.sa][.da][.lcp][.gsa][.str]
--ibwt                invert the BWT, given INPUT.bwt, write output to INPUT.bwt.ibwt

Input options:

--txt                 handle input as raw text files (one string per line)
--fasta               handle input as fasta files 
--fastq               handle input as fastq files
--dir                 include all files at the input directory
--lower               convert input to lowercase before data structures construction
--upper               convert input to uppercase before data structures construction

Output options:

--str                 write the collection concatenation (T^{cat}) to INPUT.1.str
--print [p]           print the first p elements of arrays to stdout, defaults to the collection length
--qs                  write QS sequences in fastq permuted according to the BWT to INPUT.bwt.qs
--lcp_max             write the maximum LCP value, write to INPUT.w.lcp_max
--lcp_max_text        write the maximum LCP value text, write to INPUT.lcp_max.txt
--lcp_avg             write the average LCP value, write to INPUT.w.lcp_avg
--time                print the running time in seconds
--verbose             verbose output
--help                this help message

Input files

File types (text, fasta or fastq) will be selected by extensions:
.txt, .fasta (or .fa, .fna.) and .fastq (or .fq).
Options --txt, --fasta and --fastq enable loading file disregarding extensions.
In txt files, each line is taken as a strings in the collection.
In fasta and fastq files, each sequence is taken as a string in the
collection.
gsufsort supports the ASCII alphabet, but values 0 and 255 are
reserved and must not occur in the input.
IUPAC symbols and ‘N’ are not handled as special symbols in fasta
or fastq files.
gzipped input files (with .gz extension) are supported using
zlib
and
kseq
libraries. If zlib is not installed in your system, build
gsufsort with the option make GZ=0. If zlib is not available
and a gzipped file is given as input, a runtime error will occur.
A directory may be given as input, selecting option --dir.
Every file with expected extensions in the directory will be
processed to compose the collection, and the default output file
prefix will be all.
See also
Wiki.

Output files

Output files are written by default in the current directory.
If option --output DIR/ is set, files are written to directory
DIR. Setting --output DIR/NAME will make files be written
to directory DIR with suffix NAME.
Output files format is discussed below.

quick test

To run a test with all strings from dataset/example.txt, type:

./gsufsort dataset/example.txt --sa --bwt

## gsufsort ##
## store_to_disk ##
example.txt.4.sa    76 bytes (n = 19)
example.txt.bwt    19 bytes (n = 19)

To see the result (option --print) stored in disk INPUT.4.sa and INPUT.bwt, use --load option:

./gsufsort example.txt --sa --bwt --load --print

## load_from_disk ##
example.txt.4.sa    76 bytes (n = 19)
example.txt.bwt    19 bytes (n = 19)
i    SA    BWT    suffixes
0    18    $    #
1    6    a    $
2    12    a    $
3    17    n    $
4    5    n    a$
5    11    b    a$
6    9    n    aba$
7    15    n    an$
8    3    n    ana$
9    7    $    anaba$
10    13    $    anan$
11    1    b    anana$
12    10    a    ba$
13    0    #    banana$
14    16    a    n$
15    4    a    na$
16    8    a    naba$
17    14    a    nan$
18    2    a    nana$

output format

SA, ISA, LCP, k-LCP and DA are each written sequentially to a binary
file. The file has no header and every integer takes w
bytes. The default value of w is 4.
BWT and iBWT are written in ASCII format, using 1 byte per input symbol.
The GSA is written sequentially to a binary file, with no headers.
The values of DA and SA are intercalated throughout the file with w1 and w2 bytes respectively:
DA[0],SA[0],DA[1],SA[1],…
The GESA is written sequentially to a binary file, with no headers.
The values of DA and SA, LCP and BWT are intercalated throughout the file with w1, w2, w3 and 1 bytes respectively:
DA[0],SA[0],LCP[0],BWT[0],DA[1],SA[1],LCP[1],BWT[1]…

wiki

See more details and additional features in Wiki.

authors

thanks

We thank to Antonis Maronikolakis for his helpful suggestions.

references

[1] Louza, F.A., Telles, G.P., Gog, S., Prezza, N., Rosone, G.. gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections. Algorithms Mol Biol 15, 18 (2020). https://doi.org/10.1186/s13015-020-00177-y

Supported by the project Italian MIUR-SIR CMACBioSeq (“Combinatorial methods for analysis and compression of biological sequences“) grant n.~RBSI146R5L. P.I. Giovanna Rosone ↩