Processing Module
Core functions for BAM processing, translation, and alignment.
Data Structures
AlignedRead
sr2silo.process.AlignedRead
Class to represent an aligned read.
__init__(read_id, unaligned_nucleotide_sequence, aligned_nucleotide_sequence, aligned_nucleotide_sequence_offset, nucleotide_insertions, amino_acid_insertions, aligned_amino_acid_sequences, metadata=None)
Initialize with a AlignedRead object.
__str__()
toString method as pretty JSON string.
get_amino_acid_insertions()
Return the amino acid insertions.
get_metadata()
Return the metadata.
set_metadata(metadata)
Set the metadata. Keep as ReadMetadata object to preserve alias information.
set_nuc_insertion(nuc_insertion)
Append a nucleotide insertion to the list of nucleotide insertions.
to_dict()
Return a dictionary / json representation of the object.
Uses camelCase field names (via Pydantic aliases) for metadata fields to match SILO database schema.
to_silo_json(indent=False)
Validate the aligned read dict using the new SILO schema and return a nicely formatted JSON string conforming to the DB requirements.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
indent
|
bool
|
Whether to indent the JSON string, True for pretty print. |
False
|
Gene
sr2silo.process.Gene
Class to represent a gene with a name and a length.
__init__(gene_name, gene_length)
Initialize with a gene name and a gene length.
to_dict()
Return a dictionary with gene name and gene length.
GeneSet
sr2silo.process.GeneSet
Class to represent a set of genes for a pathogen
__init__(genes)
Initialize with a list of genes.
get_gene(gene_name)
Return a gene by name.
get_gene_length(gene_name)
Return the length of a gene.
get_gene_name_list()
Return a list of genes.
set_gene_length(gene_name, gene_length)
Set the length of a gene.
to_dict()
Return a dictionary with gene names as keys and gene length as values.
NucInsertion
sr2silo.process.NucInsertion
Bases: Insertion
A nuclotide insertion.
AAInsertion
sr2silo.process.AAInsertion
Bases: Insertion
An amino acid insertion.
AAInsertionSet
sr2silo.process.AAInsertionSet
Class to represent the set of amino acid insertions for a full set of genes.
__eq__(other)
Compare two AAInsertionSet objects by their values.
This compares the dictionaries generated by to_dict() method, ensuring a value-based comparison rather than identity-based.
__init__(genes)
Initialize with an empty set of insertions for each gene.
from_dict(data)
staticmethod
Create an AAInsertionSet object from a dictionary.
set_insertions_for_gene(gene_name, aa_insertions)
Set the amino acid insertions for a particular gene.
to_dict()
Return a dictionary with gene names as keys.
BAM Conversion
sr2silo.process.bam_to_sam(bam_file, sam_file)
Converts a BAM file to SAM format and writes it to the specified output file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bam_file
|
Path
|
Path to the input BAM file. |
required |
sam_file
|
Path
|
Path to the output SAM file. |
required |
sam_file
|
Path
|
Path to the output SAM file. |
required |
sr2silo.process.sam_to_bam(sam_file, bam_file)
Convert a SAM file to a BAM file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sam_file
|
Path
|
Path to the input SAM file. |
required |
bam_file
|
Path
|
Path to the output BAM file. |
required |
sr2silo.process.bam_to_fasta_query(bam_file, fasta_file)
Convert a BAM file to a FASTA file. Bluntly resolved the sam to fasta, removing soft clippings, keeping insertions and deletions, ignoring skipped regions and paddings.
Outputs the sequence as it is read from the molecule.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bam_file
|
Path
|
Path to the input BAM file. |
required |
fasta_file
|
Path
|
Path to the output FASTQ file. |
required |
sr2silo.process.sort_bam_file(input_bam_path, output_bam_path, sort_by_qname=False)
Sorts a BAM file using pysam.sort by alignment positions by default, but can also sort by query name if specified.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_bam_path
|
Path
|
Path to the input BAM file. |
required |
output_bam_path
|
Path
|
Path to the output sorted BAM file. |
required |
sort_by_qname
|
bool
|
If True, sorts by query name. Defaults to False. |
False
|
sr2silo.process.sort_and_index_bam(input_bam_fp, output_bam_fp)
Function to sort and index the input BAM file,
implements checks to see if the input BAM file is already sorted and indexed.
If not sorted and indexed, the function will sort and index the input BAM file.
sr2silo.process.sort_sam_by_qname(input_sam_path, output_sam_path)
Sorts a sam file using pysam.sort command by query name. Args: input_bam_path (Path): Path to the input BAM file. output_bam_path (Path): Path to the output sorted BAM file.
sr2silo.process.get_gene_set_from_ref(reference_fp)
Load the gene ref fasta and create a GeneSet with gene short names and lengths.
Translation and Alignment
sr2silo.process.nuc_to_aa_alignment(in_nuc_alignment_fp, in_aa_db_fp, out_aa_alignment_fp)
Function to convert files and translate and align with Diamond / blastx.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
in_nuc_alignment_fp
|
Path
|
Path to the input nucleotide alignment file. |
required |
in_aa_db_fp
|
Path
|
Path to the input amino acid diamond database file. |
required |
out_aa_alignment_fp
|
Path
|
Path to the output amino acid alignment file. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
Description
Uses Diamond with the settings: --evalue 1 --gapopen 6 --gapextend 2 --outfmt 101 --matrix BLOSUM62 --unal 0 --max-hsps 1 --block-size 0.5
sr2silo.process.parse_translate_align(nuc_reference_fp, aa_reference_fp, nuc_alignment_fp, aa_db_fp, target_reference=None)
Parse nucleotides, translate and align amino acids the input files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
nuc_reference_fp
|
Path
|
Path to nucleotide reference FASTA file. |
required |
aa_reference_fp
|
Path
|
Path to amino acid reference FASTA file. |
required |
nuc_alignment_fp
|
Path
|
Path to nucleotide alignment BAM file. |
required |
aa_db_fp
|
Path
|
Path to Diamond database file. |
required |
target_reference
|
str | None
|
Filter reads to only include those aligned to this reference accession. If None, all reads are processed. |
None
|
Returns:
| Type | Description |
|---|---|
Dict[str, AlignedRead]
|
Dictionary mapping read IDs to AlignedRead objects. |
sr2silo.process.parse_translate_align_in_batches(nuc_reference_fp, aa_reference_fp, nuc_alignment_fp, metadata_fp, output_fp, chunk_size=100000, write_chunk_size=20, target_reference=None)
Parse nucleotides, translate and align amino acids in batches.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
nuc_reference_fp
|
Path
|
Path to the nucleotide reference genome - .fasta |
required |
aa_reference_fp
|
Path
|
Path to the amino acid reference genome - .fasta |
required |
nuc_alignment_fp
|
Path
|
Path to the nucleotide alignment file - .bam |
required |
metadata_fp
|
Path
|
Path to the metadata file - .json |
required |
output_fp
|
Path
|
Path to the output file - .ndjson |
required |
chunk_size
|
int
|
Size of each batch, in number of reads. |
100000
|
write_chunk_size
|
int
|
Size of each write batch. |
20
|
target_reference
|
str | None
|
Filter reads to only include those aligned to this reference accession. Should match @SQ SN field in BAM header. If None, all reads are processed. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
The path to the output file with the correct suffix. |
Resources
A chunk_size of 100000 reads is a good starting point for most cases. This will take about 3.7 GB ram for Covid Genomes and 1-2 minutes to process.
Constraints
No sorting constraints are enforced on the input files. The output file is compressed with zstd.
Constraints
No sorting constraints are enforced on the input files. The output file is compressed with zstd.
Logs
All logs of INFO and below are suppressed.
sr2silo.process.curry_read_with_metadata(metadata_fp)
Returns a function that enriches an AlignedRead with metadata.
Reads metadata from a JSON file and returns a function that enriches an AlignedRead with the loaded metadata.
Read Merging
sr2silo.process.paired_end_read_merger(nuc_align_sam_fp, ref_genome_fasta_fp, output_merged_sam_fp)
Merge paired-end reads using smallgenomutilitiestool paired_end_read_merger.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
nuc_align_sam_fp
|
Path
|
Path to the nucleotide alignment SAM file. |
required |
ref_genome_fasta_fp
|
Path
|
Path to the reference genome FASTA file. |
required |
output_merged_sam_fp
|
Path
|
Path to the output merged SAM file. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
Exceptions
sr2silo.process.ZeroFilteredReadsError
Bases: Exception
Raised when reference filtering results in zero reads.
This error indicates that a target reference was specified for filtering, but no reads in the BAM file aligned to that reference. This typically means the wrong reference was specified or the data doesn't contain reads for the expected reference.