IgSeq Utilities

A command-line tool for various common Ig-Seq tasks. These are heavily biased
toward the peculiarities of our protocol and for rhesus macaque antibody
sequences. Your mileage may vary.

Install

First, install mamba via Miniforge.

Then install from the latest version via https://anaconda.org/ShawHahnLab/igseq:

mamba create --name igseq -c conda-forge -c bioconda -c ShawHahnLab igseq
mamba activate igseq

Or, install from the latest source here:

git clone https://github.com/ShawHahnLab/igseq.git
cd igseq
mamba env update --file igseq/data/environment.yml
mamba activate igseq
pip install .

Some Instructions

The igseq command is organized into subcommands, grouped into two
categories: early read processing tasks (demultiplex, trim, merge, etc.), and
various convenience tools (IgBLAST this against that, what database has the
closest V gene, etc.).

Read Processing

Read processing subcommands:

getreads: Run bcl2fastq with some customized settings to write
Undetermined I1/R1/R2 fastq.gz files.
demux: Demultiplex the I1/R1/R2 files according to per-sample barcodes.
phix: Map reads left unassigned post-demux to the PhiX genome for
troubleshooting.
trim: Run Cutadapt to remove adapter/primer/barcode and low-quality
sequences on a per-sample basis.
merge: Merge R1/R2 for each sample with PEAR.

Each step in the read processing produces a read counts summary CSV table
<step>.counts.csv and has default output paths derived from the inputs, and
most can work per-sample or per-directory, so it’s easy to chain together the
steps for a given run:

igseq getreads /seq/runs/211105_M05588_0469_000000000-JWV49
igseq demux -s samples.csv analysis/reads/211105_M05588_0469_000000000-JWV49
igseq phix analysis/demux/211105_M05588_0469_000000000-JWV49
igseq trim -s samples.csv -S rhesus analysis/demux/211105_M05588_0469_000000000-JWV49
igseq merge analysis/trim/211105_M05588_0469_000000000-JWV49

The samples.csv file is a table matching sample names to run IDs, barcode
IDs, and antibody chain types, like:

Sample,Run,BarcodeFwd,BarcodeRev,Type
wk12H,211105_M05588_0469_000000000-JWV49,1,1,gamma
wk12K,211105_M05588_0469_000000000-JWV49,2,2,kappa
wk24H,211105_M05588_0469_000000000-JWV49,3,3,gamma
wk24K,211105_M05588_0469_000000000-JWV49,4,4,kappa

The barcode IDs refer to the numbered barcodes for the protocol, with the
varying-length randomized prefix for the forward barcodes:

$ igseq show barcodes
 Direction BC              Seq
         F  1     NNNNAACCACTA
         F  2    NNNNNAACTCTAA
         F  3   NNNNNNAAGGCCCT
         F  4  NNNNNNNAATATGTC
         F  5 NNNNNNNNAATCGTCA
...
         R  1         TAGTGGTT
         R  2         TTAGAGTT
         R  3         AGGGCCTT
         R  4         GACATATT
         R  5         TGACGATT
...

The chain type is used to select the appropriate constant region primer:

$ igseq show primers
 Species    Type                               Seq
   human   gamma  GCCAGGGGGAAGACCGATGGGCCCTTGGTGGA
   human   alpha GAGGCTCAGCGGGAAGACCTTGGGGCTGGTCGG
   human      mu AGGAGACGAGGGGGAAAAGGGTTGGGGCGGATG
   human epsilon GCGGGTCAAGGGGAAGACGGATGGGCTCTGTGT
   human   delta CTGATATGATGGGGAACACATCCGGAGCCTTGG
   human   kappa GCGGGAAGATGAAGACAGATGGTGCAGCCACAG
   human  lambda GGCCTTGTTGGCTTGAAGCTCCTCAGAGGAGGG
  rhesus   gamma  GCCAGGGGGAAGACCGATGGGCCCTTGGTGGA
  rhesus   alpha GAGGCTCAGCGGGAAGACCTTGGGGCTGGTCGG
  rhesus      mu GAGACGAGGGGGAAAAGGGTTGGGGCGGATGCA
  rhesus epsilon CGGGTCAAGGGGAAGACGGATGGGCTCTGTGTG
  rhesus   delta CTGATATGATGGGGAACACATCCGGAGCCTTGG
  rhesus   kappa GCGGGAAGATGAAGACAGATGGTGCAGCCACAG
  rhesus  lambda GGCCTTGTTGGCTTGAAGCTCCTCAGAGGAGGG

See igseq/data/examples/readproc.sh for an example read processing workflow
from start to finish with a small set of reads.

Convenience Tools

Various convenience subcommands:

igblast: Run IgBLAST with a streamlined interface. This can handle
transparent database and auxiliary data file creation from rhesus or human
germline V(D)J references.
summarize: Summarize attributes of antibody sequences in a table via
IgBLAST.
vdj-gather: Gather VDJ sequences into one directory.
vdj-match: Find closest-matching germline VDJ sequences.
convert: Convert between FASTA/FASTQ/CSV/TSV formats.
identity: Calculate pairwise identities.
msa: Create multiple sequence alignments (using MUSCLE).
tree: Create and format phylogenetic trees (using FastTree).
list, show: list built-in reference data files, and show file contents with
support for pretty-printing some common formats.