项目作者: eth-sri

项目描述 :
AStarix: Fast and Optimal Sequence-to-Graph Aligner
高级语言: C++
项目地址: git://github.com/eth-sri/astarix.git
创建时间: 2019-11-12T14:03:44Z
项目社区:https://github.com/eth-sri/astarix

开源协议:Mozilla Public License 2.0

下载


AStarix

SRILAB


License: MPL 2.0

AStarix is a sequence-to-graph aligner based on the A* shortest path
algorithm. It finds alignments that are optimal according to edit-distance with
non-negative weights. The motivation of AStarix is to speed up optimal
alignment in the average case. A trie index allows to scale with reference size
[1], while seed heuristic allows to scale with query length [2].

Publications

[1]
Ivanov, Bichsel, Mustafa, Kahles, Rätsch, Vechev (RECOMB 2020)
AStarix: Fast and Optimal Sequence-to-Graph Alignment (2020) — Introduces AStarix tool, the trie index which allows scaling to bigger references, and Prefix heuristic which outperforms Dijkstra. Evaluations here.

[2]
Ivanov, Bichsel, Vechev (RECOMB 2022)
Fast and Optimal Sequence-to-Graph Alignment Guided by Seeds — Introduces Seed heuristic which outperforms other aproaches on Illumina reads and allows to scale to long HiFi reads.

Installation

Prerequisites of AStarix are:

  • argp
    argument parsing library
  • pandas (optional) – dataframe library used for testing and evalutaions

To compile AStarix and run a test:

  1. make
  2. make test

Third-party libraries are located in the /ext directory and their own licenses
apply. Tested on Ubuntu 20.04.

Example run

This is an example run of AStarix on a toy data: aligning 100 simulated Illumina reads to a linear graph from the first 10000bp of Escherichia coli. You can expect a similar summary printed to standard output:

  1. $ make test
  2. release/astarix align-optimal -a astar-seeds -t 1 -v 0 -g data/ecoli_head10000_linear/graph.gfa -q data/ecoli_head10000_linear/illumina.fq -o tmp/ecoli_head10000_linear/astar-seeds --fixed_trie_depth 1 --seeds_len 25
  3. ----
  4. Assert() checks: OFF
  5. verbose: 0
  6. ----
  7. Loading reference graph... Added reverse complement... done in 0.001078s.
  8. Loading queries... done in 0.000216s.
  9. Contructing trie... done in 0.007017s.
  10. Initializing A* heuristic... done in 1.6e-05s.
  11. == General parameters and optimizations ==
  12. Alignment algo: astar-seeds
  13. Edit costs: 0, 1, 5, 5 (match, subst, ins, del)
  14. Greedy match?: true
  15. Threads: 1
  16. == A* parameters ==
  17. seed length: 25 bp
  18. skip near crumbs: 1
  19. == Data ==
  20. Original reference: 9826 nodes, 9824 edges
  21. Trie: 5186 nodes, 24822 edges, depth=7
  22. Reference+ReverseRef+Trie: 24838 nodes, 44470 edges, density: 0
  23. Trie: 5186 nodes, 24822 edges, depth=7
  24. Reference+ReverseRef+Trie: 24838 nodes, 44470 edges, density: 0
  25. Reads: 100 x 99bp, coverage: 1.00774x
  26. Avg phred value: 0.13640%
  27. == Aligning statistics ==
  28. Explored rate (avg): 1.70707 states/read_bp
  29. States with crumbs: 0.02992%
  30. Explored states: 0.01737%
  31. Skipped states: 99.95271%
  32. Pushed rate (avg, max): 0.77130, 0.01400 [states/bp] (states normalized by query length)
  33. Popped rate (avg, max): 0.09310, 0.00140
  34. Average popped: 7.16000 from trie (76.90655%) vs 2.15000 from ref (per read)
  35. Total cost of aligned reads: 15, 0.15000 per read, 0.15152% per letter
  36. Alignments: 100 unique (100.00000%), 0 ambiguous (0.00000%) and 0 overcost (0.00000%)
  37. == Heuristic stats (astar-seeds) ==
  38. For all reads:
  39. Seeds: 400 (4.00000 per read)
  40. Seed matches: 385 (3.85000 per read, 0.96250 per seed)
  41. States with crumbs: 29105 [+0.00000% repeated], (291.05000 per read)
  42. Heuristic (avg): 0.10000 of potential 4.00000
  43. == Performance ==
  44. Memory: measured | estimated
  45. total: 0.00510gb, 100% | -
  46. reference: 0.00267gb, 52.39521% | 5.74505%
  47. reads: 0.00000gb, 0.00000% | 0.18091%
  48. trie: 0.00100gb, 19.53593% | 5.82224%
  49. equiv. classes opt.: 0.00000gb, 0.00000%
  50. A*-memoization: 0.00000gb, 0.00000
  51. Total wall runtime: 0.04086s
  52. reference loading: 0.00108s
  53. queries loading: 0.00022s
  54. construct trie: 0.00702s
  55. precompute: 0.00002s
  56. align (wall time): 0.03040s = 3289.04470 reads/s = 325.61543 Kbp/s
  57. Total align cpu time: 0.02998s = 3335.11206 reads/s = 330.17609 Kbp/s
  58. | Preprocessing: 31.61686%
  59. | A* query: 18.47652%
  60. | greedy_match: 11.42609%
  61. | other: 38.48052%
  62. DONE

The directory tmp/ecoli_head10000_linear/astar-seeds/ will contain a file with execution logs and a file with alignment statistics.
Short aggregated statistics are print to standard output (to redirect, you can add >summary.txt).

Usage

AStarix only finds optimal alignments (specified by argument align-optimal). Currently supported formats are .gfa without overlapping nodes (for a graph reference) and .fa/.fasta (for a linear reference). The queries should be in .fq/.fastq format (the phred values are ignored).

  1. $ astarix --help
  2. ----
  3. Usage: astarix [OPTION...] align-optimal -g GRAPH.gfa -q READS.fq -o OUT_DIR/
  4. Optimal sequence-to-graph aligner based on A* shortest path.
  5. -a, --algorithm={dijkstra, astar-prefix, astar-seeds}
  6. Shortest path algorithm
  7. -c, --prefix_cost_cap=A*_COST_CAP
  8. The maximum prefix cost for the A* heuristic
  9. -d, --prefix_len_cap=A*_PREFIX_CAP
  10. The upcoming sequence length cap for the A*
  11. heuristic
  12. -D, --tree_depth=TREE_DEPTH Suffix tree depth
  13. -e, --prefix_equivalence_classes=A*_EQ_CLASSES
  14. Whether to partition all nodes to equivalence
  15. classes in order not to reuse the heuristic
  16. --fixed_trie_depth=FIXED_TRIE_DEPTH
  17. Some leafs depth can be less than tree_depth
  18. (variable=0, fixed=1)
  19. -f, --greedy_match=GREEDY_MATCH
  20. Proceed greedily forward if there is a unique
  21. matching outgoing edge
  22. -g, --graph=GRAPH Input graph (.gfa)
  23. -G, --gap=GAP_COST Gap (Insertion or Deletion) penalty [5]
  24. -k, --k_best_alignments=TOP_K Output at most k optimal alignments per read
  25. [1]
  26. -M, --match=MATCH_COST Match penalty [0]
  27. -o, --outdir=OUTDIR Output directory
  28. -q, --query=QUERY Input queries/reads (.fq, .fastq)
  29. --seeds_len=A*_SEED_LEN The length of the A* seeds.
  30. --seeds_skip_near_crumbs={0,1}
  31. -S, --subst=SUBST_COST Substitution penalty [1]
  32. -t, --threads=THREADS Number of threads [1]
  33. -v, --verbose=THREADS Verbosity (silent=0, info=1, debug=2), [0]
  34. -?, --help Give this help list
  35. --usage Give a short usage message
  36. -V, --version Print program version