sr2silo
Wrangele BAM nucleotide alignments to cleartext alignments
General Use: Convert Nucleotide Alignment Reads - CIGAR in .BAM to Cleartext JSON
sr2silo can convert millions of Short-Read nucleotide reads in the form of .bam CIGAR
alignments to cleartext alignments compatible with LAPIS-SILO v0.8.0+. It gracefully extracts insertions
and deletions. Optionally, sr2silo can translate and align each read using diamond / blastX, handling insertions and deletions in amino acid sequences as well.
Your input .bam/.sam with one line as:
sr2silo outputs per read a JSON (compatible with LAPIS-SILO v0.8.0+):
{
"readId": "AV233803:AV044:2411515907:1:10805:5199:3294",
"sampleId": "A1_05_2024_10_08",
"batchId": "20241024_2411515907",
"samplingDate": "2024-10-08",
"locationName": "Lugano (TI)",
"locationCode": "5",
"sr2siloVersion": "1.3.0",
"main": {
"sequence": "CGGTTTCGTCCGTGTTGCAGCCG...GTGTCAACATCTTAAAGATGGCACTTGTG",
"insertions": ["10:ACTG", "456:TACG"],
"offset": 4545
},
"S": {
"sequence": "MESLVPGFNEKTHVQLSLPVLQVRVRGFGDSVEEVLSEARQHLKDGTCGLVEVEKGV",
"insertions": ["23:A", "145:KLM"],
"offset": 78
},
"ORF1a": {
"sequence": "XXXMESLVPGFNEKTHVQLSLPVLQVRVRGFGDSVEEVLSEARQHLKDGTCGLV",
"insertions": ["2323:TG", "2389:CA"],
"offset": 678
},
"E": null,
"M": null,
"N": null,
"ORF1b": null,
"ORF3a": null,
"ORF6": null,
"ORF7a": null,
"ORF7b": null,
"ORF8": null,
"ORF10": null
}
The total output is handled in an .ndjson.zst.
Resource Requirements
When running sr2silo, particularly the process-from-vpipe command, be aware of memory and storage requirements:
- Standard configuration uses 8GB RAM and one CPU core
- Processing batches of 100k reads requires ~3GB RAM plus ~3GB for Diamond
- Temporary storage needs (especially on clusters) can reach 30-50GB
For detailed information about resource requirements, especially for cluster environments, please refer to the Resource Requirements documentation.
Wrangling Short-Read Genomic Alignments for SILO Database
Originally this was started for wrangling short-read genomic alignments from wastewater-sampling, into a format for easy import into Loculus and its sequence database SILO.
sr2silo is designed to process nucleotide alignments from .bam files with metadata, translate and align reads in amino acids, gracefully handling all insertions and deletions and upload the results to the backend LAPIS-SILO v0.8.0+.
Output Format for LAPIS-SILO v0.8.0+:
- Metadata fields use camelCase naming (e.g., readId, sampleId, batchId) to align with Loculus standards
- Metadata fields are at the root level (no nested "metadata" object)
- Genomic segments use a structured format with sequence, insertions, and offset fields
- The main nucleotide segment is required and contains the primary alignment
- Gene segments (S, ORF1a, etc.) contain amino acid sequences or null if empty
- Insertions use the format "position:sequence" (e.g., "123:ACGT")
Output Schema Configuration:
The output schema is defined in src/sr2silo/silo_read_schema.py using Pydantic models with field aliases for camelCase output. To modify the metadata fields:
- Edit
src/sr2silo/silo_read_schema.py- Add/modify fields inReadMetadataclass - Update
resources/silo/database_config.yaml- Ensure field names match the Pydantic aliases - Run validation:
python tests/test_database_config_validation.py
The validation ensures your Pydantic schema matches the SILO database configuration.
For the V-Pipe to Silo implementation we include the following metadata fields at the root level:
{
"readId": "AV233803:AV044:2411515907:1:10805:5199:3294",
"sampleId": "A1_05_2024_10_08",
"batchId": "20241024_2411515907",
"samplingDate": "2024-10-08",
"locationName": "Lugano (TI)",
"locationCode": "5",
"sr2siloVersion": "1.3.0"
}
Setting up the repository
To build the package and maintain dependencies, we use Poetry. In particular, it's good to install it and become familiar with its basic functionalities by reading the documentation.
Installation
sr2silo can be installed either from Bioconda or from source.
Install from Bioconda
The easiest way to install sr2silo is through the Bioconda channel:
# Add necessary channels if you haven't already
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
# Install sr2silo
conda install sr2silo
Install from Source
For development purposes or to install the latest version, you can install from source using Poetry:
The project uses a modular environment system to separate core functionality, development requirements, and workflow dependencies. Environment files are located in the environments/ directory:
Core Environment Setup
For basic usage of sr2silo:
This creates the core conda environment with essential dependencies and installs the package using Poetry.Development Environment
For development work:
This command sets up the development environment with Poetry.Workflow Environment
For working with the snakemake workflow:
This creates an environment specifically configured for running the sr2silo in snakemake workflows.All Environments
You can set up all environments at once:
Additional Setup for Development
After setting up the development environment:
Run Tests
orUsage
sr2silo follows a two-step workflow:
- Process data:
sr2silo process-from-vpipe --help - Submit to Loculus:
sr2silo submit-to-loculus --help
# Example: Process V-Pipe data
sr2silo process-from-vpipe \
--input-file input.bam \
--sample-id SAMPLE_001 \
--timeline-file timeline.tsv \
--output-fp output.ndjson
# Example: Submit to Loculus (use environment variables for credentials)
export KEYCLOAK_TOKEN_URL=https://auth.example.com/token
export BACKEND_URL=https://api.example.com/submit
export GROUP_ID=123
export USERNAME=your-username
export PASSWORD=your-password
sr2silo submit-to-loculus --processed-file output.ndjson.zst
Note: Use environment variables for credentials to avoid exposing sensitive information in command history.
Note: The --lapis-url parameter is optional. If not provided, sr2silo uses default SARS-CoV-2 references (NC_045512.2). See sr2silo process-from-vpipe --help for details.
Environment Variable Configuration
sr2silo supports flexible configuration through environment variables, making it easy to use in different deployment scenarios including conda packages and pip installations.
Note: CLI parameters override environment variables
Common configuration via environment variables:
# Authentication credentials (recommended approach for security)
export KEYCLOAK_TOKEN_URL=https://auth.example.com/token
export BACKEND_URL=https://backend.example.com/api
export GROUP_ID=123
export USERNAME=your-username
export PASSWORD=your-password
# Run with environment variables set
sr2silo process-from-vpipe \
--input-file input.bam \
--sample-id SAMPLE_001 \
--timeline-file /path/to/timeline.tsv \
--output-fp output.ndjson
# Submission using environment variables for credentials
sr2silo submit-to-loculus \
--processed-file output.ndjson.zst