This tutorial shows the basics of how to interact with V-pipe. A recording of our webinar covering the subject is available at the bottom of the current page.

For the purpose of this Tutorial, we will work with the sars-cov2 branch which is adapted for the SARS-CoV-2 virus.

Organizing Data:

V-pipe expects the input samples to be organized in a two-level hierarchy:

  • At the first level, input files grouped by samples (e.g.: patients or biological replicates of an experiment).
  • A second level for distinction of datasets belonging to the same sample (e.g.: sample dates).
  • Inside that directory, the sub-directory raw_data holds the sequencing data in FASTQ format (optionally compressed with GZip).
  • Paired-ended reads need to be in split files with _R1 and _R2 suffixes.

Preparing a small dataset

You can run the first test on your workstation or a good laptop.

First, you need to prepare the data:

samples
├── SRR10903401
│   └── 20200102
│       └── raw_data
│           ├── wuhan2_R1.fastq
│           └── wuhan2_R2.fastq
└── SRR10903402
    └── 20200102
        └── raw_data
            ├── wuhan1_R1.fastq
            └── wuhan1_R2.fastq

Install V-pipe

V-pipe uses the Bioconda1 bioinformatics software repository for all its pipeline components. The pipeline itself is written using snakemake2.

For advanced users: If your are fluent with these tools, you can:

The present tutorial will show simplified commands that automate much of this process.

To deploy V-pipe, you can use the installation script with the following parameters:

curl -O 'https://raw.githubusercontent.com/cbg-ethz/V-pipe/master/utils/quick_install.sh'
bash quick_install.sh -b sars-cov2 -p testing -w work
cd ./testing/work/

Running V-pipe

Copy the samples directory you created in the step Preparing a small dataset to this working directory. You can display the directory structure with tree samples or find samples3.

Check the parameters of the vpipe.config file with you favorite editor (vim, emacs, nano, butterflies, etc.). For SNVs and Local (windowed) haplotype reconstruction, you will need at least the following options:

[input]
reference = references/NC_045512.2.fasta

[output]
snv = True
local = True
global = False

[general]
aligner = bwa

Check what will be executed:

./vpipe --dryrun

As it is your first run of V-pipe, this will also generate the sample collection table. Check samples.tsv in your editor.

Note that the demo files you downloaded have reads of length 150 only. V-pipe’s default parameters are optimized for reads of length 250 ; add the third column in the tab-separated file:

SRR10903401	20200102	150
SRR10903402	20200102	150

Tips: Always check the content of the samples.tsv file.

  • If you didn’t use the correct structure, this file might end up empty or some entries might be missing.
  • You can safely delete it and re-run the --dryrun to regenerate it.

Run the V-pipe analysis (the necessary dependencies will be downloaded and installed in conda environments managed by snakemake):

./vpipe -p --cores 2

Tips: you can learn more about running V-pipe in the following wiki sections:

Output

The Wiki contains an overview of the output files. The output of the SNV calling is aggregated in a standard VCF file, located in samples/{hierarchy}/variants/SNVs/snvs.vcf, you can open it with your favorite VCF tools for visualisation or downstream processing. It is also available in a tabular format in samples/{hierarchy}/variants/SNVs/snvs.csv.

Note: The visualization and reporting features are still being continuously updated.

Expected output

The small dataset that we used in this tutorial section has been analyzed by doi:10.1093/nsr/nwaa036. The results of the original analysis (using bwa, samtools mpileup, and bcftools) are displayed in Table 2 in the article:

Accession number Genomic position Ref allele Alt allele Ref reads Alt reads Location_date GISAID ID
SRR10903401 1821 G A 52 5 WH_2020/01/02.a EPI_ISL_406716
SRR10903401 19164 C T 40 12 WH_2020/01/02.a EPI_ISL_406716
SRR10903401 24323 A C 102 67 WH_2020/01/02.a EPI_ISL_406716
SRR10903401 26314 G A 15 2 WH_2020/01/02.a EPI_ISL_406716
SRR10903401 26590 T C 10 2 WH_2020/01/02.a EPI_ISL_406716
SRR10903402 11563 C T 164 26 WH_2020/01/02.b EPI_ISL_406717

Using either the VCF or CSV files, compare with the results given out by V-pipe (with bwa and ShoRAH).

Swapping component

The default configuration uses ShoRAH to call the SNVs and to reconstruct the local (windowed) haplotypes.

Components can be swapped simply by changing the vpipe.config file. For example to call SNVs using lofreq:

[output]
snv = True
local = False

[general]
snv_caller=lofreq

Cluster deployment

It is possible to ask snakemake to submit jobs on a cluster using the batch submission command-line interface of your cluster.

Platform LSF by IBM is one of the popular systems you might find (Others include SLURM, Grid Engine).

To deploy on the cluster:

wget 'https://raw.githubusercontent.com/cbg-ethz/V-pipe/master/utils/quick_install.sh'
bash quick_install.sh -b sars-cov2 -p $SCRATCH -w working
cd $SCRATCH/working/

Tips: As V-pipe for SARS-CoV-2 matures, it will be possible to download snapshots frozen at specific version. This enables more reproducible results. To specify a release use the -r option :

wget 'https://raw.githubusercontent.com/cbg-ethz/V-pipe/master/utils/quick_install.sh'
bash quick_install.sh -r sars-cov2-snapshot-20200406 -p $SCRATCH -w working
cd $SCRATCH/working/

this will download the tarball sars-cov2-snapshot-20200406.tar.gz and uncompress it into a directory called V-pipe-sars-cov2-snapshot-20200406

Running V-pipe on the cluster

In the working directory, create a samples sub-directory and populate it. Check its structure with tree or find. Perform the necessary adjustments to vpipe.config.

To run V-pipe on a cluster :

Tips: There are snakemake parameters for conda that can help management of dependencies:

  • using --conda-create-envs-only enables to download the dependencies only without running the pipeline itself.
  • using --conda-prefix {DIR} stores the conda environments of dependencies in a common directory (thus possible to share re-use between multiple instances of V-pipe).

When using V-pipe in production environments, plan the -p prefix, -w working and --conda-prefix environments directories according to the cluster quotas and time limits

# Download everything in advance
./vpipe --conda-prefix $SCRATCH/snake-envs --cores 1 --conda-create-envs-only

# Cluster LSF submitting
./vpipe --conda-prefix $SCRATCH/snake-envs -p --cluster 'bsub' --jobs 2

# Using bsub on the master job too, instead of running it on the login node
bsub ./vpipe --conda-prefix $SCRATCH/snake-envs -p --cluster 'bsub' --jobs 2

# Alternative for running everything from a single interactive SSH node
bsub -I <<<"./vpipe --conda-prefix $SCRATCH/snake-envs -p --cores 2"

Tips: See the V-pipe documentation for more cluster commands.

Check the other options for running snakemake on clusters if you need more advanced uses.

Webinar: Applying V-pipe to SARS-Coronavirus-2 data

This webinar was recorded by the SIB on June 22th 4.


  1. Grüning, Björn, Ryan Dale, Andreas Sjödin, Brad A. Chapman, Jillian Rowe, Christopher H. Tomkins-Tinch, Renan Valieris, the Bioconda Team, and Johannes Köster. 2018. “Bioconda: Sustainable and Comprehensive Software Distribution for the Life Sciences”. Nature Methods, 2018 doi:10.1038/s41592-018-0046-7

  2. Johannes Köster and Sven Rahmann. Snakemake – a scalable bioinformatics workflow engine. Bioinformatics, 28(19):2520–2522, 2012. doi:10.1093/bioinformatics/bts480 

  3. find samples | sed -e "s/[^-][^\/]*\// |/g" -e "s/|\([^ ]\)/|-\1/" is overly long but produces slightly prettier output than find samples 

  4. Image of an Illumina MiSeq sequencer used under license from Illumina, Inc. All Rights Reserved.