Pipeline overview
Dependencies
-
Conda is an open source package management system and environment management system. V-pipe uses it to automatically obtain reproducible environments and simplify installation of the individual components of the pipeline, thanks to the Bioconda channel - a distribution of bioinformatics software.
See the documentation of conda to install it.
-
Snakemake is the central workflow and dependency manager of V-pipe. It determines the order in which individual tools are invoked and checks that programs do not exit unexpectedly.
Once you have conda installed, you can in turn use it to obtain Snakemake (This is the recommended way to install it). Snakemake will subsequently obtain all the necessary components to V-pipe.
-
FastQC gives an overview of the raw sequencing data. Flowcells that have been overloaded or otherwise fail during sequencing can easily be determined with FastQC.
-
Trimming and clipping of reads is performed by PRINSEQ. It is currently the most versatile raw read processor with many customization options.
-
Vicuna is a de novo assembler designed for generating rough reference contigs of viral NGS data. It can deal with the inherent heterogeneity such as high single-base heterogeneity and structural variants.
-
InDelFixer is a sensitive aligner employing a full Smith-Waterman alignment against a reference, used to polish up consensus.
-
ConsensusFixer is also used to polish up consensus. It computes a consensus sequence with wobbles, ambiguous bases, and in-frame insertions, from a NGS read alignment.
-
We perform the alignment of the curated NGS data using our custom ngshmmalign that takes structural variants into account. It produces multiple consensus sequences that include either majority bases or ambiguous bases.
-
In order to detect specific cross-contaminations with other probes, the Burrows-Wheeler aligner is used. It quickly yields estimates for foreign genomic material in an experiment. Additionally, It can be used as an alternative aligner to ngshmmalign.
-
To standardise multiple samples to the same reference genome (say HXB2 for HIV-1), the multiple sequence aligner MAFFT is employed. The multiple sequence alignment helps in determining regions of low conservation and thus makes standardisation of alignments more robust.
-
The Swiss Army knife of alignment postprocessing and diagnostics. bcftools is also used to generate consensus sequence with indels. Samtools can also be used to trim primers.
-
iVar is used to trim primers.
-
The Short Reads Assembly into Haplotypes (ShoRAH) program for inferring viral haplotypes from NGS data is used to perform local haplotype reconstruction for heterogeneous viral populations by using a Gibbs sampler.
-
LoFreq (version 2) is SNVs and indels caller from next-generation sequencing data, and can be used as an alternative engine for SNV calling.
-
Global haplotype inference using a propagating dirichlet process mixture model.
-
SAVAGE is a tool for viral haplotype reconstruction. It can be executed in two modes: (1) using a reference sequence, or (2) assembling viral haplotypes de novo. We employ the latter.
-
Viral quasispecies assembly via maximal clique finding is used as another selectable engine for global haplotype reconstruction for heterogeneous viral populations.
-
QuasiRecomb
QuasiRecomb performs local and global haplotype reconstruction for heterogeneous viral populations by using a hidden Markov model.
-
We perform genomic liftovers to standardised reference genomes using our in-house developed python library of utilities for rewriting alignments.
-
Java tools for working with NGS data in the BAM format