Processing Module

Core functions for BAM processing, translation, and alignment.

Data Structures

AlignedRead

`sr2silo.process.AlignedRead`

Class to represent an aligned read.

`init(read_id, unaligned_nucleotide_sequence, aligned_nucleotide_sequence, aligned_nucleotide_sequence_offset, nucleotide_insertions, amino_acid_insertions, aligned_amino_acid_sequences, metadata=None)`

Initialize with a AlignedRead object.

`str()`

toString method as pretty JSON string.

`get_amino_acid_insertions()`

Return the amino acid insertions.

`get_metadata()`

Return the metadata.

`set_metadata(metadata)`

Set the metadata. Keep as ReadMetadata object to preserve alias information.

`set_nuc_insertion(nuc_insertion)`

Append a nucleotide insertion to the list of nucleotide insertions.

`to_dict()`

Return a dictionary / json representation of the object.

Uses camelCase field names (via Pydantic aliases) for metadata fields to match SILO database schema.

`to_silo_json(indent=False)`

Validate the aligned read dict using the new SILO schema and return a nicely formatted JSON string conforming to the DB requirements.

Parameters:

Name	Type	Description	Default
`indent`	`bool`	Whether to indent the JSON string, True for pretty print.	`False`

Gene

`sr2silo.process.Gene`

Class to represent a gene with a name and a length.

`init(gene_name, gene_length)`

Initialize with a gene name and a gene length.

`to_dict()`

Return a dictionary with gene name and gene length.

GeneSet

`sr2silo.process.GeneSet`

Class to represent a set of genes for a pathogen

`init(genes)`

Initialize with a list of genes.

`get_gene(gene_name)`

Return a gene by name.

`get_gene_length(gene_name)`

Return the length of a gene.

`get_gene_name_list()`

Return a list of genes.

`set_gene_length(gene_name, gene_length)`

Set the length of a gene.

`to_dict()`

Return a dictionary with gene names as keys and gene length as values.

NucInsertion

`sr2silo.process.NucInsertion`

Bases: Insertion

A nuclotide insertion.

AAInsertion

`sr2silo.process.AAInsertion`

Bases: Insertion

An amino acid insertion.

AAInsertionSet

`sr2silo.process.AAInsertionSet`

Class to represent the set of amino acid insertions for a full set of genes.

`eq(other)`

Compare two AAInsertionSet objects by their values.

This compares the dictionaries generated by to_dict() method, ensuring a value-based comparison rather than identity-based.

`init(genes)`

Initialize with an empty set of insertions for each gene.

`from_dict(data)` `staticmethod`

Create an AAInsertionSet object from a dictionary.

`set_insertions_for_gene(gene_name, aa_insertions)`

Set the amino acid insertions for a particular gene.

`to_dict()`

Return a dictionary with gene names as keys.

BAM Conversion

`sr2silo.process.bam_to_sam(bam_file, sam_file)`

Converts a BAM file to SAM format and writes it to the specified output file.

Parameters:

Name	Type	Description	Default
`bam_file`	`Path`	Path to the input BAM file.	required
`sam_file`	`Path`	Path to the output SAM file.	required
`sam_file`	`Path`	Path to the output SAM file.	required

`sr2silo.process.sam_to_bam(sam_file, bam_file)`

Convert a SAM file to a BAM file.

Parameters:

Name	Type	Description	Default
`sam_file`	`Path`	Path to the input SAM file.	required
`bam_file`	`Path`	Path to the output BAM file.	required

`sr2silo.process.bam_to_fasta_query(bam_file, fasta_file)`

Convert a BAM file to a FASTA file. Bluntly resolved the sam to fasta, removing soft clippings, keeping insertions and deletions, ignoring skipped regions and paddings.

Outputs the sequence as it is read from the molecule.

Parameters:

Name	Type	Description	Default
`bam_file`	`Path`	Path to the input BAM file.	required
`fasta_file`	`Path`	Path to the output FASTQ file.	required

`sr2silo.process.sort_bam_file(input_bam_path, output_bam_path, sort_by_qname=False)`

Sorts a BAM file using pysam.sort by alignment positions by default, but can also sort by query name if specified.

Parameters:

Name	Type	Description	Default
`input_bam_path`	`Path`	Path to the input BAM file.	required
`output_bam_path`	`Path`	Path to the output sorted BAM file.	required
`sort_by_qname`	`bool`	If True, sorts by query name. Defaults to False.	`False`

`sr2silo.process.sort_and_index_bam(input_bam_fp, output_bam_fp)`

Function to sort and index the input BAM file,

implements checks to see if the input BAM file is already sorted and indexed.

If not sorted and indexed, the function will sort and index the input BAM file.

`sr2silo.process.sort_sam_by_qname(input_sam_path, output_sam_path)`

Sorts a sam file using pysam.sort command by query name. Args: input_bam_path (Path): Path to the input BAM file. output_bam_path (Path): Path to the output sorted BAM file.

`sr2silo.process.get_gene_set_from_ref(reference_fp)`

Load the gene ref fasta and create a GeneSet with gene short names and lengths.

Translation and Alignment

`sr2silo.process.nuc_to_aa_alignment(in_nuc_alignment_fp, in_aa_db_fp, out_aa_alignment_fp)`

Function to convert files and translate and align with Diamond / blastx.

Parameters:

Name	Type	Description	Default
`in_nuc_alignment_fp`	`Path`	Path to the input nucleotide alignment file.	required
`in_aa_db_fp`	`Path`	Path to the input amino acid diamond database file.	required
`out_aa_alignment_fp`	`Path`	Path to the output amino acid alignment file.	required

Returns:

Type	Description
`None`	None

Description

Uses Diamond with the settings: --evalue 1 --gapopen 6 --gapextend 2 --outfmt 101 --matrix BLOSUM62 --unal 0 --max-hsps 1 --block-size 0.5

`sr2silo.process.parse_translate_align(nuc_reference_fp, aa_reference_fp, nuc_alignment_fp, aa_db_fp, target_reference=None)`

Parse nucleotides, translate and align amino acids the input files.

Parameters:

Name	Type	Description	Default
`nuc_reference_fp`	`Path`	Path to nucleotide reference FASTA file.	required
`aa_reference_fp`	`Path`	Path to amino acid reference FASTA file.	required
`nuc_alignment_fp`	`Path`	Path to nucleotide alignment BAM file.	required
`aa_db_fp`	`Path`	Path to Diamond database file.	required
`target_reference`	`str \| None`	Filter reads to only include those aligned to this reference accession. If None, all reads are processed.	`None`

Returns:

Type	Description
`Dict[str, AlignedRead]`	Dictionary mapping read IDs to AlignedRead objects.

`sr2silo.process.parse_translate_align_in_batches(nuc_reference_fp, aa_reference_fp, nuc_alignment_fp, metadata_fp, output_fp, chunk_size=100000, write_chunk_size=20, target_reference=None)`

Parse nucleotides, translate and align amino acids in batches.

Parameters:

Name	Type	Description	Default
`nuc_reference_fp`	`Path`	Path to the nucleotide reference genome - .fasta	required
`aa_reference_fp`	`Path`	Path to the amino acid reference genome - .fasta	required
`nuc_alignment_fp`	`Path`	Path to the nucleotide alignment file - .bam	required
`metadata_fp`	`Path`	Path to the metadata file - .json	required
`output_fp`	`Path`	Path to the output file - .ndjson	required
`chunk_size`	`int`	Size of each batch, in number of reads.	`100000`
`write_chunk_size`	`int`	Size of each write batch.	`20`
`target_reference`	`str \| None`	Filter reads to only include those aligned to this reference accession. Should match @SQ SN field in BAM header. If None, all reads are processed.	`None`

Returns:

Name	Type	Description
`Path`	`Path`	The path to the output file with the correct suffix.

Resources

A chunk_size of 100000 reads is a good starting point for most cases. This will take about 3.7 GB ram for Covid Genomes and 1-2 minutes to process.

Constraints

No sorting constraints are enforced on the input files. The output file is compressed with zstd.

Constraints

No sorting constraints are enforced on the input files. The output file is compressed with zstd.

Logs

All logs of INFO and below are suppressed.

`sr2silo.process.curry_read_with_metadata(metadata_fp)`

Returns a function that enriches an AlignedRead with metadata.

Reads metadata from a JSON file and returns a function that enriches an AlignedRead with the loaded metadata.

Read Merging

`sr2silo.process.paired_end_read_merger(nuc_align_sam_fp, ref_genome_fasta_fp, output_merged_sam_fp)`

Merge paired-end reads using smallgenomutilitiestool paired_end_read_merger.

Parameters:

Name	Type	Description	Default
`nuc_align_sam_fp`	`Path`	Path to the nucleotide alignment SAM file.	required
`ref_genome_fasta_fp`	`Path`	Path to the reference genome FASTA file.	required
`output_merged_sam_fp`	`Path`	Path to the output merged SAM file.	required

Returns:

Type	Description
`None`	None

Exceptions

`sr2silo.process.ZeroFilteredReadsError`

Bases: Exception

Raised when reference filtering results in zero reads.

This error indicates that a target reference was specified for filtering, but no reads in the BAM file aligned to that reference. This typically means the wrong reference was specified or the data doesn't contain reads for the expected reference.

Processing Module

Data Structures

AlignedRead

sr2silo.process.AlignedRead

__init__(read_id, unaligned_nucleotide_sequence, aligned_nucleotide_sequence, aligned_nucleotide_sequence_offset, nucleotide_insertions, amino_acid_insertions, aligned_amino_acid_sequences, metadata=None)

__str__()

get_amino_acid_insertions()

get_metadata()

set_metadata(metadata)

set_nuc_insertion(nuc_insertion)

to_dict()

to_silo_json(indent=False)

Gene

sr2silo.process.Gene

__init__(gene_name, gene_length)

to_dict()

GeneSet

sr2silo.process.GeneSet

__init__(genes)

get_gene(gene_name)

get_gene_length(gene_name)

get_gene_name_list()

set_gene_length(gene_name, gene_length)

to_dict()

NucInsertion

sr2silo.process.NucInsertion

AAInsertion

sr2silo.process.AAInsertion

AAInsertionSet

sr2silo.process.AAInsertionSet

__eq__(other)

__init__(genes)

from_dict(data) staticmethod

set_insertions_for_gene(gene_name, aa_insertions)

to_dict()

BAM Conversion

sr2silo.process.bam_to_sam(bam_file, sam_file)

sr2silo.process.sam_to_bam(sam_file, bam_file)

sr2silo.process.bam_to_fasta_query(bam_file, fasta_file)

sr2silo.process.sort_bam_file(input_bam_path, output_bam_path, sort_by_qname=False)

sr2silo.process.sort_and_index_bam(input_bam_fp, output_bam_fp)

sr2silo.process.sort_sam_by_qname(input_sam_path, output_sam_path)

sr2silo.process.get_gene_set_from_ref(reference_fp)

Translation and Alignment

sr2silo.process.nuc_to_aa_alignment(in_nuc_alignment_fp, in_aa_db_fp, out_aa_alignment_fp)

sr2silo.process.parse_translate_align(nuc_reference_fp, aa_reference_fp, nuc_alignment_fp, aa_db_fp, target_reference=None)

sr2silo.process.parse_translate_align_in_batches(nuc_reference_fp, aa_reference_fp, nuc_alignment_fp, metadata_fp, output_fp, chunk_size=100000, write_chunk_size=20, target_reference=None)

sr2silo.process.curry_read_with_metadata(metadata_fp)

Read Merging

sr2silo.process.paired_end_read_merger(nuc_align_sam_fp, ref_genome_fasta_fp, output_merged_sam_fp)

Exceptions

sr2silo.process.ZeroFilteredReadsError

`sr2silo.process.AlignedRead`

`init(read_id, unaligned_nucleotide_sequence, aligned_nucleotide_sequence, aligned_nucleotide_sequence_offset, nucleotide_insertions, amino_acid_insertions, aligned_amino_acid_sequences, metadata=None)`

`str()`

`get_amino_acid_insertions()`

`get_metadata()`

`set_metadata(metadata)`

`set_nuc_insertion(nuc_insertion)`

`to_dict()`

`to_silo_json(indent=False)`

`sr2silo.process.Gene`

`init(gene_name, gene_length)`

`to_dict()`

`sr2silo.process.GeneSet`

`init(genes)`

`get_gene(gene_name)`

`get_gene_length(gene_name)`

`get_gene_name_list()`

`set_gene_length(gene_name, gene_length)`

`to_dict()`

`sr2silo.process.NucInsertion`

`sr2silo.process.AAInsertion`

`sr2silo.process.AAInsertionSet`

`eq(other)`

`init(genes)`

`from_dict(data)` `staticmethod`

`set_insertions_for_gene(gene_name, aa_insertions)`

`to_dict()`

`sr2silo.process.bam_to_sam(bam_file, sam_file)`

`sr2silo.process.sam_to_bam(sam_file, bam_file)`

`sr2silo.process.bam_to_fasta_query(bam_file, fasta_file)`

`sr2silo.process.sort_bam_file(input_bam_path, output_bam_path, sort_by_qname=False)`

`sr2silo.process.sort_and_index_bam(input_bam_fp, output_bam_fp)`

`sr2silo.process.sort_sam_by_qname(input_sam_path, output_sam_path)`

`sr2silo.process.get_gene_set_from_ref(reference_fp)`

`sr2silo.process.nuc_to_aa_alignment(in_nuc_alignment_fp, in_aa_db_fp, out_aa_alignment_fp)`

`sr2silo.process.parse_translate_align(nuc_reference_fp, aa_reference_fp, nuc_alignment_fp, aa_db_fp, target_reference=None)`

`sr2silo.process.parse_translate_align_in_batches(nuc_reference_fp, aa_reference_fp, nuc_alignment_fp, metadata_fp, output_fp, chunk_size=100000, write_chunk_size=20, target_reference=None)`

`sr2silo.process.curry_read_with_metadata(metadata_fp)`

`sr2silo.process.paired_end_read_merger(nuc_align_sam_fp, ref_genome_fasta_fp, output_merged_sam_fp)`

`sr2silo.process.ZeroFilteredReadsError`