population structural variant calling with smoove
Step by step guide on how to use my pipelines
Click here for an introduction to Snakemake
This is a pipeline to perform structural variant calling in a population using Smoove. It also runs VEP and performs PCA.
In addition to the VCF with the SVs, you also get a .tsv file with some summarized information on the SVs: it includes allele frequency per population, as well as VEP annotation and depth fold change as described in duphold:
DHBFC: fold-change for the variant depth relative to bins in the genome with similar GC-content.
DHFFC: fold-change for the variant depth relative to Flanking regions.
![]() |
---|
Pipeline workflow |
OUTDIR: /path/to/output
READS_DIR: /path/to/reads/ # don't add the reads files, just the directory where they are
SAMPLE_LIST: /path/to/file
REFERENCE: /path/to/assembly
CONTIGS_IGNORE: /path/to/file
SPECIES: <species_name>
PREFIX: <output name>
NUM_CHRS: <number of chromosomes>
BWA_MEM_M: Y/N
sample1,sample1.bam,Pop1
sample2,sample2.bam,Pop1
sample3,sample3.bam,Pop2
sample4,sample4.bam,Pop2
Tip: use the name of the bam file without the .bam extension as the sample name. Ex: from sample1.bam to sample1
bwa mem
using the -M
parameter and you want split read support in your VCF you need to run an extra step. For this write Y
.BWA_MEM_M: Y
an extra step will be done: split_disc_reads to create the split and disc bam files before smoove call runsBWA_MEM_M: N
split_disc_reads will not run and the pipeline will start with the smoove_call stepIf you want the results to be written to this directory (not to a new directory), comment out or remove
OUTDIR: /path/to/outdir
This pipeline uses VEP in offline mode, which increases performance. In order to use it in this mode, the cache for the species used needs to be installed:
Check if the cache file for your species already exist in /lustre/nobackup/SHARED/cache/
. If it doesn’t, create it with
/usr/bin/perl /cm/shared/apps/SHARED/ensembl-vep/INSTALL.pl --CACHEDIR /lustre/nobackup/SHARED/cache/ --AUTO c -n --SPECIES <species>
When multiple assemblies are found you need to run it again with --ASSEMBLY <assembly name>
, where “assembly name” is the name of the assembly you want to use.
You can install VEP with
conda install -c bioconda ensembl-vep
and install the cache with
vep_install --CACHEDIR <where/to/install/cache> --AUTO c -n --SPECIES <species>
When multiple assemblies are found you need to run it again with --ASSEMBLY <assembly name>
, where “assembly name” is the name of the assembly you want to use.
In the Snakefile, in rule run_vep
, replace /cm/shared/apps/SHARED/ensembl-vep/vep
with vep
First load R:module load R/3.6.2
Enter the R environment by writing R
and clicking enter. Install the packages:
list.of.packages <- c("optparse", "data.table", "ggplot2")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
If you get an error like this:
Warning in install.packages(new.packages) :
'lib = "/cm/shared/apps/R/3.6.2/lib64/R/library"' is not writable
Follow the instructions on how to install R packages locally here and try to install the packages again.
What you do with the results from this structural variant calling pipeline depends on your research question: a possible next step would be to explore the {prefix}_DUP_DEL_INV_table.tsv file and look at the largest SVs found (sort by SVLEN) or at a specific effect in the ANNOTATION column, such as “frameshift_variant”.
See VEP effect descriptions for a short description of the effects annotated by VEP
The *DEPTH field in the vcf has six fields, corresponding to the average depth across all samples.
DEPTH=(DHBFC_1/1, DHBFC_0/1, DHBFC_0/0, DHFFC_1/1, DHFFC_0/1, DHFFC_0/0)
Depth fold change as described in duphold:
DHBFC: fold-change for the variant depth relative to bins in the genome with similar GC-content.
DHFFC: fold-change for the variant depth relative to Flanking regions.
These fields are also in the {prefix}_DUP_DEL_INV_table.tsv
file