项目作者: enormandeau

项目描述 :
Environmental DNA metabarcoding analysis
高级语言: Python
项目地址: git://github.com/enormandeau/barque.git
创建时间: 2017-02-21T15:11:38Z
项目社区:https://github.com/enormandeau/barque

开源协议:

下载


Barque v1.8.5

Environmental DNA metabarcoding analysis

Barque

Developed by Eric Normandeau in
Louis Bernatchez‘s
laboratory.

Licence information at the end of this file.

Description

Barque is a fast eDNA metabarcoding analysis pipeline that first denoises
and then annotates ASVs or OTUs, using high-quality barcoding databases.

Barque can produce denoised OTUs and annotate them using a custom database.
These annotated OTUs can then be used as a database themselves to find read
counts per OTU per sample, effectively annotating the reads with the OTUs that
were previously found. In this process, some of the OTUs are annotated to the
species level, some to the genus or higher levels.

Citation

Barque is described as an accurate and efficient eDNA analysis pipeline in:

Mathon L, Guérin P-E, Normandeau E, Valentini A, Noel C, Lionnet C, Linard B,
Thuiller W, Bernatchez L, Mouillot D, Dejean T, Manel S. 2021. Benchmarking
bioinformatic tools for fast and accurate eDNA metabarcoding species
identification. Molecular Ecology Resources.

https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13430

It is also presented in:

Hakimzadeh A et. al. 2023. A pile of pipelines: An overview of the
bioinformatics software for metabarcoding data analyses. Molecular Ecology
Resources.

https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13847

Use cases

  • Monitoring invasive species
  • Confirming the presence of specific species
  • Characterizing meta-communities in varied environments
  • Improving species distribution knowledge of cryptic taxa
  • Following loss of species over medium to long-term monitoring

Since Barque depends on the use of high-quality barcoding databases, it is
especially useful for amplicons that already have large databases, like COI
amplicons from the Barcode of Life Database (BOLD) or 12S amplicons from the
mitofish database, although it can also use any database once it is formatted
in its format, for example the Silva database for the 18s gene or any other
custom database. If for some reason species annotations are not possible,
Barque can be used in OTU mode.

Installation

To use Barque, you will need a local copy of its repository. Different
releases can be found here. It
is recommended to always use the latest release, even the development version.
You can either download an archive of the latest release at the above link or
get the latest commit (recommended) with the following git command:

  1. git clone https://github.com/enormandeau/barque

Dependencies

To run Barque, you will also need to have the following programs installed
on your computer.

  • Barque will only work on GNU Linux or OSX
  • bash 4+
  • python 3.5+ (you can use miniconda3 to install python)
  • python distutils package
  • R 3+ (ubuntu/mint: sudo apt-get install r-base-core)
  • java (ubuntu/mint: sudo apt-get install default-jre)
  • gnu parallel
  • flash (read merger) v1.2.11+
  • vsearch v2.14.2+
    • /!\ v2.14.2+ required /!\
    • Barque will not work with older versions of vsearch

Preparation

  • Install dependencies
  • Download a copy of the Barque repository (see Installation above)
  • Edit 02_info/primers.csv to provide information describing your primers
  • Get or prepare the database(s) (see Formatting database section below) and
    deposit the fasta.gz file in the 03_databases folder and give it a name
    that matches the information of the 02_info/primers.csv file.
  • Modify the parameters in 02_info/barque_config.sh for your run
  • Launch Barque, for example with ./barque 02_info/barque_config.sh

Overview of Barque steps

During the analyses, the following steps are performed:

  • Filter and trim raw reads (trimmomatic)
  • Merge paired-end reads (flash)
  • Split merged reads by amplicon (Python script)
  • Look for chimeras and denoise reads (vsearch and unoise3 algorithm)
  • Merge unique reads (Python script)
  • Find species or OTUs associated with unique, denoised reads (vsearch)
  • Summarize results (Python script)
    • Tables of phylum, genus, and species counts per sample, including multiple hits
    • Number of retained reads per sample at each analysis step with figure
    • Most frequent non-annotated sequences to blast on NCBI nt/nr
    • Species counts for these non-annotated sequences
    • Sequence groups for cases of multiple hits

Running the pipeline

For each new project, get a new copy of Barque from the source listed in
the Installation section. In this case, you do not need to modify the
primer and config files.

Running on the test dataset

If you want to test Barque, jump straight to the Test dataset section at
the end of this file. Later, be sure to read through the README to understand
the program and it’s outputs.

Preparing samples

Copy your demultiplexed paired-end sample files in the 04_data folder. You
need one pair of files per sample. The sequences in these files must contain
the sequences of the primer that you used during the PCR. Depending on the
format in which you received your sequences from the sequencing facility, you
may have to proceed to demultiplexing before you can use Barque.

IMPORTANT: The file names must follow this format:

  1. SampleID_*_R1_001.fastq.gz
  2. SampleID_*_R2_001.fastq.gz

Notes: Each sample name, or SampleID, must contain no underscore (_) and must
be followed by an underscore (_). The star (*) can be any string of text
that does not contain space characters. For example, you can use dashes
(-) to separate parts of your sample names, eg:

PopA-sample001_ANYTHING_R1_001.fastq.gz.

Formatting database

You need to put a database in gzip-compressed Fasta format, or .fasta.gz, in
the 03_databases folder.

An augmented version of the mitofish 12S database, as well as 16S and cytb, are
already available in Barque.

The pre-formatted BOLD databases ready for Barque can be downloaded below.
Note that you will need to rename the downloaded file to bold.fasta.gz

https://www.ibis.ulaval.ca/services/bioinformatique/barque_databases/

If you want to use a newer version of the BOLD database, you will need to
download all the animal BINs from
this page ,
put the downloaded Fasta files in 03_databases/bold_bins (you will need to create
that folder), and run the commands to format the bold database:

  1. # Format each BIN individually (~10 minutes)
  2. # Note: the `species_to_remove.txt` file is optional
  3. ls -1 03_databases/bold_bins/*.fas.gz |
  4. parallel ./01_scripts/util/format_databases/format_bold_database.py \
  5. {} {.}_prepared.fasta.gz
  6. # Concatenate the resulting formatted bins into one file
  7. gunzip -c 03_databases/bold_bins/*_prepared.fasta.gz | gzip - > 03_databases/bold.fasta.gz
  • For other databases, get the database and format it:
    • Name lines must contain 3 information fields separated by an underscore (_)
    • Ex: >Phylum_Genus_species
    • Ex: >Family_Genus_species
    • Ex: >Mammal_rattus_norvegicus
    • gzip-compressed Fasta format (DATABASE_NAME.fasta.gz)

Configuration file

Copy and modify the parameters in 02_info/barque_config.sh as needed.

Launching Barque

Launch the barque executable with the name of your configuration file as an
argument, like this:

  1. ./barque 02_info/<YOUR_CONFIG_FILE>

Reducing false positives

Two of the parameters in the config file can help reduce the presence of false
positive annotations in the results: MIN_HITS_EXPERIMENT and
MIN_HITS_SAMPLE. The defaults to both of these are very permissive and should
be modified if false positives are problematic in the results. Additionally,
the following script is provided to filter out species annotations that fall
below a minimum proportion of reads in each samples:
filter_sites_by_proportion.py. This filter is especially useful if the
different samples have very unequal numbers of reads. Having a high quality
database will also help reducing false annotations. Finally, manual curation of
the results is recommended with any eDNA analysis, regardless of the software
used.

Results

Once the pipeline has finished running, all result files are found in the
12_results folder.

After a run, it is recommended to make a copy of this folder and name it with the
current date, ex:

  1. cp -r 12_results 12_results_PROJECT_NAME_2024-02-29_SOME_ADDITIONAL_INFO

Taxa count tables, named after the primer names

  • PRIMER_genus_table.csv
  • PRIMER_phylum_table.csv
  • PRIMER_species_table.csv

Sequence dropout report and figure

  • sequence_dropout.csv: Lists how many sequences were present in each sample
    for every analysis step. Depending on library and sequencing quality, as well
    as the biological diversity found at the sample site, more or less sequences
    are lost at each of the analysis steps. The figure sequence_dropout_figure.png
    shows how many sequences are retained for each sample at each step of the pipeline.

Most frequent non-annotated sequences

  • most_frequent_non_annotated_sequences.fasta: Sequences that are frequent in
    the samples but were not annotated by the pipeline. This Fasta file should be
    used to query the NCBI nt/nr database using the online portal
    found here
    to see what species may have been missed. Use blastn with default parameters.
    Once the NCBI blastn search is finished, download the results as a text file
    and use the following command (you will need to adjust the input and output
    file names) to generate a report of the most frequently found species in the
    non-annotated sequences:

Fasta files with sequences from multiple hit groups

  • 12_results/01_multihits contains fasta file with database and sample
    sequences to help understand why some of the sequences cannot be unambiguously
    assigned to one species. For example, sometimes two different species can have
    identical reads in the database. At other times, sample sequences can have the
    same distance to sequences of two different species in the database.

Summarize species found in non-annotated sequences

  1. ./01_scripts/10_report_species_for_non_annotated_sequences.py \
  2. 12_results/NCBI-Alignment.txt \
  3. 12_results/most_frequent_non_annotated_sequences_species_ncbi.csv 97 |
  4. sort -u -k 2,3 | cut -c 2- | perl -pe 's/ /\t/' > missing_species_97_percent.txt

The first result file will contain one line per identified taxon and the number
of sequences for each taxon, sorted in decreasing order. For any species of
interest found in this file, it is a good idea to download the representative
sequences from NCBI, add them to the database, and rerun the analysis.

You can modify the percentage value, here 97. The
missing_species_97_percent.txt file will list the sequence identifiers from
NCBI so that you can download them from the online database and add them to
your own database as needed.

One way to do this automatically is to make a file with only the first column,
that is: one NCBI sequence identifier per line, and load it on this page:

https://www.ncbi.nlm.nih.gov/sites/batchentrez

You will need to rename the sequences to follow the database name format
described in the Formatting database section and add them to your current
database.

Hit similarity values per species and site

  • Use 12_results/similarity_by_species_graph.R to explore hit similarity
    values per species per site.

Log files and parameters

For each Barque run, three files are written in the 99_logfiles folder.
Each contain a timestamp with the time of the run:

  1. The exact barque config file that has been used
  2. The exact primer file as it was used
  3. The full log of the run

Lather, Rinse, Repeat

Once the pipeline has been run, it is normal to find that unexpected species
have been found or that a proportion of the reads have not been identified,
either because the sequenced species are absent from the database or because
the sequences have the exact same distance from two or more sequences in the
database. In these cases, you may need to remove unwanted species from the
database or download additional sequences for the non-annotated species from
NCBI to add them to the database. Once the database has been improved, simply
run the pipeline again with this new database. You can putSKIP_DATA_PREP=1 in
your config file if you wish to avoid repeating the initial data preparation
steps of Barque. You may need to repeat this procedure again until you are
satisfied with the completeness of the results.

NOTE: You should provide justifications in your publications if you decide to
remove some species from the database or results based on available knowledge
about species distribution.

Test dataset

A test dataset is available as a
sister repository on GitHub.
It is composed of 10 mitofish-12S metabarcoding samples, each with 10,000
forward and reverse sequence pairs.

Download the repository and then move the data from
barque_test_dataset/04_data to Barque‘s 04_data folder.

If you have git and Barque‘s dependencies installed, the following commands
will download the Barque repository and the test data and put them in the
appropriate folder.

  1. git clone https://github.com/enormandeau/barque
  2. git clone https://github.com/enormandeau/barque_test_dataset
  3. cp barque_test_dataset/04_data/* barque/04_data/

To run the analysis, move to the barque folder and launch:

  1. cd barque
  2. ./barque 02_info/barque_config.sh

The analysis of this test dataset should take less than a minute.

License

CC share-alike

Creative Commons Licence
Barque by Eric Normandeau is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.