项目作者: ShawHahnLab

项目描述 :
Antibody analysis umbrella package
高级语言: Python
项目地址: git://github.com/ShawHahnLab/igseq.git
创建时间: 2020-04-06T17:59:59Z
项目社区:https://github.com/ShawHahnLab/igseq

开源协议:GNU Affero General Public License v3.0

下载


IgSeq Utilities

CircleCI Build Status

A command-line tool for various common Ig-Seq tasks. These are heavily biased
toward the peculiarities of our protocol and for rhesus macaque antibody
sequences. Your mileage may vary.

Install

First, install mamba via Miniforge.

Then install from the latest version via https://anaconda.org/ShawHahnLab/igseq:

  1. mamba create --name igseq -c conda-forge -c bioconda -c ShawHahnLab igseq
  2. mamba activate igseq

Or, install from the latest source here:

  1. git clone https://github.com/ShawHahnLab/igseq.git
  2. cd igseq
  3. mamba env update --file igseq/data/environment.yml
  4. mamba activate igseq
  5. pip install .

Some Instructions

The igseq command is organized into subcommands, grouped into two
categories: early read processing tasks (demultiplex, trim, merge, etc.), and
various convenience tools (IgBLAST this against that, what database has the
closest V gene, etc.).

Read Processing

Read processing subcommands:

  • getreads: Run bcl2fastq with some customized settings to write
    Undetermined I1/R1/R2 fastq.gz files.
  • demux: Demultiplex the I1/R1/R2 files according to per-sample barcodes.
  • phix: Map reads left unassigned post-demux to the PhiX genome for
    troubleshooting.
  • trim: Run Cutadapt to remove adapter/primer/barcode and low-quality
    sequences on a per-sample basis.
  • merge: Merge R1/R2 for each sample with PEAR.

Each step in the read processing produces a read counts summary CSV table
<step>.counts.csv and has default output paths derived from the inputs, and
most can work per-sample or per-directory, so it’s easy to chain together the
steps for a given run:

  1. igseq getreads /seq/runs/211105_M05588_0469_000000000-JWV49
  2. igseq demux -s samples.csv analysis/reads/211105_M05588_0469_000000000-JWV49
  3. igseq phix analysis/demux/211105_M05588_0469_000000000-JWV49
  4. igseq trim -s samples.csv -S rhesus analysis/demux/211105_M05588_0469_000000000-JWV49
  5. igseq merge analysis/trim/211105_M05588_0469_000000000-JWV49

The samples.csv file is a table matching sample names to run IDs, barcode
IDs, and antibody chain types, like:

  1. Sample,Run,BarcodeFwd,BarcodeRev,Type
  2. wk12H,211105_M05588_0469_000000000-JWV49,1,1,gamma
  3. wk12K,211105_M05588_0469_000000000-JWV49,2,2,kappa
  4. wk24H,211105_M05588_0469_000000000-JWV49,3,3,gamma
  5. wk24K,211105_M05588_0469_000000000-JWV49,4,4,kappa

The barcode IDs refer to the numbered barcodes for the protocol, with the
varying-length randomized prefix for the forward barcodes:

  1. $ igseq show barcodes
  2. Direction BC Seq
  3. F 1 NNNNAACCACTA
  4. F 2 NNNNNAACTCTAA
  5. F 3 NNNNNNAAGGCCCT
  6. F 4 NNNNNNNAATATGTC
  7. F 5 NNNNNNNNAATCGTCA
  8. ...
  9. R 1 TAGTGGTT
  10. R 2 TTAGAGTT
  11. R 3 AGGGCCTT
  12. R 4 GACATATT
  13. R 5 TGACGATT
  14. ...

The chain type is used to select the appropriate constant region primer:

  1. $ igseq show primers
  2. Species Type Seq
  3. human gamma GCCAGGGGGAAGACCGATGGGCCCTTGGTGGA
  4. human alpha GAGGCTCAGCGGGAAGACCTTGGGGCTGGTCGG
  5. human mu AGGAGACGAGGGGGAAAAGGGTTGGGGCGGATG
  6. human epsilon GCGGGTCAAGGGGAAGACGGATGGGCTCTGTGT
  7. human delta CTGATATGATGGGGAACACATCCGGAGCCTTGG
  8. human kappa GCGGGAAGATGAAGACAGATGGTGCAGCCACAG
  9. human lambda GGCCTTGTTGGCTTGAAGCTCCTCAGAGGAGGG
  10. rhesus gamma GCCAGGGGGAAGACCGATGGGCCCTTGGTGGA
  11. rhesus alpha GAGGCTCAGCGGGAAGACCTTGGGGCTGGTCGG
  12. rhesus mu GAGACGAGGGGGAAAAGGGTTGGGGCGGATGCA
  13. rhesus epsilon CGGGTCAAGGGGAAGACGGATGGGCTCTGTGTG
  14. rhesus delta CTGATATGATGGGGAACACATCCGGAGCCTTGG
  15. rhesus kappa GCGGGAAGATGAAGACAGATGGTGCAGCCACAG
  16. rhesus lambda GGCCTTGTTGGCTTGAAGCTCCTCAGAGGAGGG

See igseq/data/examples/readproc.sh for an example read processing workflow
from start to finish with a small set of reads.

Convenience Tools

Various convenience subcommands:

  • igblast: Run IgBLAST with a streamlined interface. This can handle
    transparent database and auxiliary data file creation from rhesus or human
    germline V(D)J references.
  • summarize: Summarize attributes of antibody sequences in a table via
    IgBLAST.
  • vdj-gather: Gather VDJ sequences into one directory.
  • vdj-match: Find closest-matching germline VDJ sequences.
  • convert: Convert between FASTA/FASTQ/CSV/TSV formats.
  • identity: Calculate pairwise identities.
  • msa: Create multiple sequence alignments (using MUSCLE).
  • tree: Create and format phylogenetic trees (using FastTree).
  • list, show: list built-in reference data files, and show file contents with
    support for pretty-printing some common formats.