API Referenceο
siRNAforge - Comprehensive siRNA design toolkit for gene silencing. Comprehensive gene silencing design and analysis.
This module exposes package metadata (author/email/version) in a single place. The version is resolved from installed package metadata (importlib.metadata). When running from a source checkout (not installed), it falls back to reading pyproject.toml if available, otherwise uses a conservative placeholder.
Command Line Interfaceο
Modern CLI for siRNAforge using Typer and Rich.
- sirnaforge.cli.patched_init(self, *args, **kwargs)[source]ο
Force simplified terminal capabilities for deterministic CI output.
- sirnaforge.cli.filter_transcripts(transcripts, include_types=None, exclude_types=None, canonical_only=False)[source]ο
Filter transcript records by type and canonical status.
- Parameters:
transcripts β Iterable of transcript-like objects that expose
transcript_typeandis_canonicalattributes.include_types β Optional iterable of transcript types to keep.
exclude_types β Optional iterable of transcript types to drop.
canonical_only β When True, keep only canonical isoforms.
- Returns:
A list of transcripts that match the requested filters.
- sirnaforge.cli.extract_canonical_transcripts(transcripts, gene_name, output_dir=None)[source]ο
Write canonical isoforms to a separate FASTA file.
- Parameters:
transcripts β Iterable of transcript-like objects (must expose
is_canonicaland sequence attributes used by the underlying save routine).gene_name β Name used to derive the output FASTA filename.
output_dir β Directory to write the FASTA file into (defaults to CWD).
- Returns:
A tuple of
(canonical_fasta_path, count)where the path is None when no canonical isoforms are available.
- sirnaforge.cli.search(query=<typer.models.ArgumentInfo object>, output=<typer.models.OptionInfo object>, database=<typer.models.OptionInfo object>, all_databases=<typer.models.OptionInfo object>, fallback=<typer.models.OptionInfo object>, no_sequence=<typer.models.OptionInfo object>, canonical_only=<typer.models.OptionInfo object>, extract_canonical=<typer.models.OptionInfo object>, transcript_types=<typer.models.OptionInfo object>, exclude_types=<typer.models.OptionInfo object>, verbose=<typer.models.OptionInfo object>)[source]ο
Search transcript references and optionally fetch sequences.
This command queries Ensembl/RefSeq/Gencode (depending on flags) for a gene or transcript identifier. When sequences are fetched, it writes them to a FASTA file and can optionally also emit a canonical-only FASTA.
- sirnaforge.cli.workflow(gene_query=<typer.models.ArgumentInfo object>, input_fasta=<typer.models.OptionInfo object>, output_dir=<typer.models.OptionInfo object>, database=<typer.models.OptionInfo object>, design_mode=<typer.models.OptionInfo object>, top_n_candidates=<typer.models.OptionInfo object>, species=<typer.models.OptionInfo object>, mirna_db=<typer.models.OptionInfo object>, mirna_species=<typer.models.OptionInfo object>, transcriptome_fasta=<typer.models.OptionInfo object>, transcriptome_filter=<typer.models.OptionInfo object>, offtarget_indices=<typer.models.OptionInfo object>, gc_min=<typer.models.OptionInfo object>, gc_max=<typer.models.OptionInfo object>, sirna_length=<typer.models.OptionInfo object>, modification_pattern=<typer.models.OptionInfo object>, overhang=<typer.models.OptionInfo object>, skip_off_targets=<typer.models.OptionInfo object>, snp=<typer.models.OptionInfo object>, snp_file=<typer.models.OptionInfo object>, variant_mode=<typer.models.OptionInfo object>, min_af=<typer.models.OptionInfo object>, clinvar_filter_levels=<typer.models.OptionInfo object>, variant_assembly=<typer.models.OptionInfo object>, verbose=<typer.models.OptionInfo object>, log_file=<typer.models.OptionInfo object>, nextflow_docker_image=<typer.models.OptionInfo object>, json_summary=<typer.models.OptionInfo object>)[source]ο
Run the end-to-end workflow: transcripts β siRNA design β off-target.
This is the main orchestration command. It resolves transcriptome and miRNA reference policies, designs candidates, and then runs off-target analysis on the selected top candidates.
- Parameters:
gene_query (
str)output_dir (
Path)database (
str)design_mode (
str)top_n_candidates (
int)species (
str)mirna_db (
str)gc_min (
float)gc_max (
float)sirna_length (
int)modification_pattern (
str)overhang (
str)skip_off_targets (
bool)variant_mode (
VariantMode)min_af (
float)clinvar_filter_levels (
str)variant_assembly (
str)verbose (
bool)json_summary (
bool)
- Return type:
- sirnaforge.cli.offtarget(input_candidates_fasta=<typer.models.OptionInfo object>, output_dir=<typer.models.OptionInfo object>, species=<typer.models.OptionInfo object>, mirna_db=<typer.models.OptionInfo object>, mirna_species=<typer.models.OptionInfo object>, transcriptome_fasta=<typer.models.OptionInfo object>, transcriptome_filter=<typer.models.OptionInfo object>, offtarget_indices=<typer.models.OptionInfo object>, verbose=<typer.models.OptionInfo object>, log_file=<typer.models.OptionInfo object>, nextflow_docker_image=<typer.models.OptionInfo object>)[source]ο
Run off-target analysis on pre-designed siRNA candidates.
This command accepts a FASTA file containing pre-designed siRNA guide sequences of any length and runs comprehensive off-target analysis including: - Transcriptome alignment (BWA-MEM2) - miRNA seed match analysis - Off-target hit classification and scoring
The embedded Nextflow pipeline is used for parallel processing across species.
Notes
--speciesdrives transcriptome fetching and miRNA lookup.--offtarget-indicescan override the indices used for alignment usingspecies:/abs/path/index_prefixentries.
- Parameters:
- Return type:
- sirnaforge.cli.design(input_file=<typer.models.ArgumentInfo object>, output=<typer.models.OptionInfo object>, design_mode=<typer.models.OptionInfo object>, length=<typer.models.OptionInfo object>, top_n=<typer.models.OptionInfo object>, gc_min=<typer.models.OptionInfo object>, gc_max=<typer.models.OptionInfo object>, max_poly_runs=<typer.models.OptionInfo object>, genome_index=<typer.models.OptionInfo object>, snp_file=<typer.models.OptionInfo object>, skip_structure=<typer.models.OptionInfo object>, skip_off_targets=<typer.models.OptionInfo object>, modification_pattern=<typer.models.OptionInfo object>, overhang=<typer.models.OptionInfo object>, verbose=<typer.models.OptionInfo object>)[source]ο
Design siRNA candidates from a transcript FASTA file.
Outputs a TSV/CSV-like table of candidates, optionally including secondary structure scoring, off-target checks, and chemical modification annotations.
- Parameters:
- Return type:
- sirnaforge.cli.validate(input_file=<typer.models.ArgumentInfo object>)[source]ο
Validate a FASTA file and report basic statistics.
This performs lightweight validation (parseable FASTA, presence of sequences, and common issues like short/ambiguous sequences).
- sirnaforge.cli.cache(clear=<typer.models.OptionInfo object>, clear_mirna=<typer.models.OptionInfo object>, clear_transcriptome=<typer.models.OptionInfo object>, dry_run=<typer.models.OptionInfo object>, info=<typer.models.OptionInfo object>)[source]ο
Inspect and clear the unified reference cache.
This command can display cache statistics and/or delete cached assets for miRNA databases and transcriptomes.
- exception sirnaforge.cli.SequencesShowError[source]ο
Bases:
RuntimeErrorRaised when sequence display/formatting input is invalid.
- sirnaforge.cli.sequences_show(input_file=<typer.models.ArgumentInfo object>, sequence_id=<typer.models.OptionInfo object>, format=<typer.models.OptionInfo object>)[source]ο
Show sequences from a FASTA file in table, JSON, or FASTA format.
Use
--idto select a single record.--formatcontrols output:table(default),json(header metadata only), orfasta.
- sirnaforge.cli.sequences_annotate(input_fasta=<typer.models.ArgumentInfo object>, metadata_json=<typer.models.ArgumentInfo object>, output=<typer.models.OptionInfo object>, verbose=<typer.models.OptionInfo object>)[source]ο
Merge metadata from a JSON file into FASTA headers.
The JSON is expected to conform to the project metadata schema used by the modification/annotation utilities.
Core Modulesο
Design Engineο
Core siRNA design algorithms and functionality.
- class sirnaforge.core.design.SiRNADesigner(parameters)[source]ο
Bases:
objectMain siRNA design engine following the algorithm specification.
- Parameters:
parameters (
DesignParameters)
- __init__(parameters)[source]ο
Initialize designer with given parameters.
- Parameters:
parameters (
DesignParameters)
- design_from_file(input_file)[source]ο
Design siRNAs from input FASTA file.
- Parameters:
input_file (
str)- Return type:
- class sirnaforge.core.design.MiRNADesigner(parameters)[source]ο
Bases:
SiRNADesignermiRNA-biogenesis-aware siRNA designer with specialized scoring.
Extends SiRNADesigner with scoring rules optimized for miRNA-like processing: - Argonaute selection preferences (pos1 A/U, mismatch at pos1) - 3β supplementary pairing analysis (positions 13-16) - Conservative thermodynamic thresholds - Seed region quality assessment
- Parameters:
parameters (
DesignParameters)
- __init__(parameters)[source]ο
Initialize miRNA designer with miRNA-specific config validation.
- Parameters:
parameters (
DesignParameters)
Thermodynamics Analysisο
Thermodynamic calculations for siRNA design using ViennaRNA.
- class sirnaforge.core.thermodynamics.ThermodynamicCalculator(temperature=37.0)[source]ο
Bases:
objectCalculate thermodynamic properties for siRNA candidates using ViennaRNA.
- Parameters:
temperature (
float)
- __init__(temperature=37.0)[source]ο
Initialize thermodynamic calculator.
- Parameters:
temperature (
float) β Temperature in Celsius for calculations
- calculate_duplex_stability(guide, passenger)[source]ο
Calculate duplex stability (deltaG) using ViennaRNA.
- calculate_asymmetry_score(candidate)[source]ο
Calculate thermodynamic asymmetry score using ViennaRNA.
- Return type:
- Returns:
Tuple of (5β end stability, 3β end stability, asymmetry score)
- Parameters:
candidate (
SiRNACandidate)
- calculate_target_accessibility(target_sequence, start_pos, sirna_length)[source]ο
Calculate target site accessibility using ViennaRNA.
- calculate_melting_temperature(guide, passenger)[source]ο
Calculate melting temperature using ViennaRNA thermodynamics.
- is_thermodynamically_favorable(candidate, threshold=0.5)[source]ο
Check if candidate meets thermodynamic asymmetry threshold.
- Parameters:
candidate (
SiRNACandidate)threshold (
float)
- Return type:
Off-Target Predictionο
Off-target analysis for siRNA design.
This module provides comprehensive off-target analysis functionality for siRNA design, including both miRNA seed match analysis and transcriptome off-target detection. Uses BWA-MEM2 for both short and long sequence alignments. Optimized for both standalone use and parallelized Nextflow workflows.
- class sirnaforge.core.off_target.BwaAnalyzer(index_prefix, mode='transcriptome', seed_length=12, min_score=15, max_hits=10000, seed_start=2, seed_end=8)[source]ο
Bases:
objectBWA-MEM2 based analyzer for both transcriptome and miRNA seed off-target search.
- Parameters:
- __init__(index_prefix, mode='transcriptome', seed_length=12, min_score=15, max_hits=10000, seed_start=2, seed_end=8)[source]ο
Initialize BWA-MEM2 analyzer.
- Parameters:
mode (
str) β Analysis mode - βtranscriptomeβ for long targets, βmirna_seedβ for short targetsseed_length (
int) β BWA seed length parametermin_score (
int) β Minimum alignment scoremax_hits (
int) β Maximum hits to returnseed_start (
int) β Seed region start (1-based)seed_end (
int) β Seed region end (1-based)
- class sirnaforge.core.off_target.OffTargetAnalysisManager(species, transcriptome_path=None, mirna_path=None, transcriptome_index=None, mirna_index=None)[source]ο
Bases:
objectManager class for comprehensive off-target analysis using BWA-MEM2.
- Parameters:
- __init__(species, transcriptome_path=None, mirna_path=None, transcriptome_index=None, mirna_index=None)[source]ο
Initialize the off-target analysis manager.
- analyze_mirna_off_targets(sequences, output_prefix)[source]ο
Analyze miRNA off-targets using BWA-MEM2 in miRNA seed mode.
- sirnaforge.core.off_target.create_temp_fasta(sequences)[source]ο
Create temporary FASTA file from sequences.
- sirnaforge.core.off_target.validate_and_write_sequences(input_file, output_file, expected_length=21)[source]ο
Validate siRNA sequences and write valid ones to output file.
- sirnaforge.core.off_target.build_bwa_index(fasta_file, index_prefix)[source]ο
Build BWA-MEM2 index for both transcriptome and miRNA off-target analysis.
- sirnaforge.core.off_target.validate_sirna_sequences(sequences, expected_length=21)[source]ο
Validate siRNA sequences using existing FastaUtils.
- sirnaforge.core.off_target.parse_fasta_file(fasta_file)[source]ο
Parse FASTA file using existing FastaUtils.
- sirnaforge.core.off_target.write_fasta_file(sequences, output_file)[source]ο
Write sequences to FASTA file using existing FastaUtils.
- sirnaforge.core.off_target.check_tool_availability(tool)[source]ο
Check if external tool is available.
- sirnaforge.core.off_target.validate_index_files(index_prefix, tool='bwa')[source]ο
Validate that index files exist for given tool.
- sirnaforge.core.off_target.run_bwa_alignment_analysis(candidates_file, index_prefix, species, output_dir, max_hits=10000, bwa_k=12, bwa_T=15, seed_start=2, seed_end=8)[source]ο
Run BWA-MEM2 alignment analysis for candidate sequences using Pydantic models.
This is the main function called by OFFTARGET_ANALYSIS Nextflow module.
- Parameters:
candidates_file (
str|Path) β Path to FASTA file with candidate sequencesspecies (
str) β Species identifiermax_hits (
int) β Maximum hits to report per candidatebwa_k (
int) β BWA seed length parameterbwa_T (
int) β BWA minimum score thresholdseed_start (
int) β Seed region start position (1-based)seed_end (
int) β Seed region end position (1-based)
- Return type:
- Returns:
Path to output directory containing results
- sirnaforge.core.off_target.aggregate_offtarget_results(results_dir, output_dir, genome_species)[source]ο
Aggregate transcriptome off-target analysis results using Pandera.
Uses pandas + Pandera for efficient bulk reading and validation instead of manual line-by-line parsing with Pydantic models.
NOTE: This function ONLY aggregates genome/transcriptome hits. miRNA results are aggregated separately by aggregate_mirna_results() to keep output files distinct and properly typed.
- Parameters:
- Return type:
- Returns:
Path to output directory containing aggregated results
- sirnaforge.core.off_target.run_mirna_seed_analysis(candidates_file, candidate_id, mirna_db, mirna_species, output_dir)[source]ο
Run miRNA seed match analysis for candidate sequences.
This function uses the MiRNADatabaseManager to download and cache miRNA databases, builds BWA indices if needed, and performs seed match analysis.
- Parameters:
- Return type:
- Returns:
Path to output directory containing results
- sirnaforge.core.off_target.aggregate_mirna_results(results_dir, output_dir, mirna_db, mirna_species)[source]ο
Aggregate miRNA seed analysis results from multiple candidates using pandas.
Uses pandas + Pandera for efficient bulk reading and validation instead of manual line-by-line parsing with Pydantic models.
- Parameters:
- Return type:
- Returns:
Path to output directory containing aggregated results
Data Modelsο
SiRNA Modelsο
Pydantic models for siRNA design data structures.
- class sirnaforge.models.sirna.FilterCriteria(**data)[source]ο
Bases:
BaseModelQuality filters for siRNA candidate selection based on thermodynamic and empirical criteria.
- Parameters:
data (
Any)
- classmethod gc_max_greater_than_min(v, info)[source]ο
Validate that gc_max is greater than or equal to gc_min.
- model_config: ClassVar[ConfigDict] = {}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class sirnaforge.models.sirna.OffTargetFilterCriteria(**data)[source]ο
Bases:
BaseModelFiltering criteria for off-target analysis results.
Controls which siRNA candidates fail due to excessive off-target potential.
- Parameters:
data (
Any)
- model_config: ClassVar[ConfigDict] = {}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class sirnaforge.models.sirna.ScoringWeights(**data)[source]ο
Bases:
BaseModelRelative weights for composite siRNA scoring components.
- Parameters:
data (
Any)
- classmethod weights_sum_to_one(v, info)[source]ο
Validate that scoring weights sum to approximately 1.0.
- model_config: ClassVar[ConfigDict] = {}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class sirnaforge.models.sirna.DesignMode(*values)[source]ο
-
Design mode for siRNA/miRNA-biogenesis-aware workflows.
- SIRNA = 'sirna'ο
- MIRNA = 'mirna'ο
- class sirnaforge.models.sirna.MiRNADesignConfig(**data)[source]ο
Bases:
BaseModelConfiguration preset for miRNA-biogenesis-aware siRNA design.
This config encapsulates thresholds, defaults, and scoring weights optimized for miRNA-like processing (Drosha/Dicer recognition, Argonaute loading preferences, seed-based off-target analysis).
- Parameters:
data (
Any)
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class sirnaforge.models.sirna.DesignParameters(**data)[source]ο
Bases:
BaseModelComplete configuration parameters for siRNA design workflow.
- Parameters:
data (
Any)
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
-
design_mode:
DesignModeο
-
filters:
FilterCriteriaο
-
scoring:
ScoringWeightsο
- class sirnaforge.models.sirna.SequenceType(*values)[source]ο
-
Categories of input sequence types for siRNA design.
- TRANSCRIPT = 'transcript'ο
- GENOMIC = 'genomic'ο
- CDS = 'cds'ο
- UTR = 'utr'ο
- class sirnaforge.models.sirna.SiRNACandidate(**data)[source]ο
Bases:
BaseModelIndividual siRNA candidate with computed thermodynamic and efficacy properties.
- Parameters:
data (
Any)
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class FilterStatus(*values)[source]ο
-
Filter status codes for quality control.
- PASS = 'PASS'ο
- GC_OUT_OF_RANGE = 'GC_OUT_OF_RANGE'ο
- POLY_RUNS = 'POLY_RUNS'ο
- EXCESS_PAIRING = 'EXCESS_PAIRING'ο
- LOW_ASYMMETRY = 'LOW_ASYMMETRY'ο
- DIRTY_CONTROL = 'DIRTY_CONTROL'ο
-
passes_filters:
bool|FilterStatusο
-
guide_metadata:
StrandMetadata|Noneο
-
passenger_metadata:
StrandMetadata|Noneο
- classmethod validate_nucleotide_sequence(v)[source]ο
Validate that sequence contains only valid nucleotides.
- classmethod sequences_same_length(v, info)[source]ο
Validate that passenger sequence is same length as guide sequence.
- class sirnaforge.models.sirna.DesignResult(**data)[source]ο
Bases:
BaseModelComplete results from siRNA design workflow with metadata and statistics.
- Parameters:
data (
Any)
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
-
parameters:
DesignParametersο
-
candidates:
list[SiRNACandidate]ο
-
top_candidates:
list[SiRNACandidate]ο
-
rejected_candidates:
list[SiRNACandidate]ο
- save_csv(filepath)[source]ο
Save siRNA candidates to CSV file with comprehensive validation.
Exports all candidates to CSV format with full thermodynamic metrics. The DataFrame is validated against SiRNACandidateSchema before saving to ensure data integrity and proper column types.
- Parameters:
filepath (
str) β Output CSV file path- Return type:
DataFrame[SiRNACandidateSchema]- Returns:
Validated DataFrame conforming to SiRNACandidateSchema
- Raises:
pandera.errors.SchemaError β If data validation fails
Chemical Modificationsο
Data models for siRNA chemical modifications and metadata.
This module provides structured representations for chemical modifications, overhangs, and provenance metadata associated with siRNA strands.
- class sirnaforge.models.modifications.ConfirmationStatus(*values)[source]ο
-
Confirmation status for siRNA sequence data.
- PENDING = 'pending'ο
- CONFIRMED = 'confirmed'ο
- class sirnaforge.models.modifications.SourceType(*values)[source]ο
-
Source type for siRNA provenance.
- PATENT = 'patent'ο
- PUBLICATION = 'publication'ο
- CLINICAL_TRIAL = 'clinical_trial'ο
- DATABASE = 'database'ο
- DESIGNED = 'designed'ο
- OTHER = 'other'ο
- class sirnaforge.models.modifications.Provenance(**data)[source]ο
Bases:
BaseModelProvenance information for siRNA sequences.
Tracks the origin and validation status of siRNA sequences.
- Parameters:
data (
Any)
-
source_type:
SourceTypeο
- to_header_string()[source]ο
Convert provenance to FASTA header format.
- Returns:
US10060921B2β
- Return type:
Formatted string like βPatent
- model_config: ClassVar[ConfigDict] = {}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class sirnaforge.models.modifications.ChemicalModification(**data)[source]ο
Bases:
BaseModelChemical modification annotation for siRNA strands.
Represents a specific type of chemical modification and the positions where it occurs in the sequence.
- Parameters:
data (
Any)
- to_header_string()[source]ο
Convert modification to FASTA header format.
- Return type:
- Returns:
Formatted string like β2OMe(1,4,6,11,13,16,19)β or β2F()β for no positions
- model_config: ClassVar[ConfigDict] = {}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class sirnaforge.models.modifications.StrandRole(*values)[source]ο
-
Role of the siRNA strand in the duplex.
- GUIDE = 'guide'ο
- SENSE = 'sense'ο
- ANTISENSE = 'antisense'ο
- PASSENGER = 'passenger'ο
- class sirnaforge.models.modifications.StrandMetadata(**data)[source]ο
Bases:
BaseModelComplete metadata for a single siRNA strand.
This model captures all relevant information about a siRNA strand including sequence, modifications, overhangs, and provenance.
- Parameters:
data (
Any)
-
chem_mods:
list[ChemicalModification]ο
-
provenance:
Provenance|Noneο
-
confirmation_status:
ConfirmationStatusο
- validate_modification_positions()[source]ο
Validate that modification positions donβt exceed sequence length.
- Return type:
- to_fasta_header(target_gene=None, strand_role=None)[source]ο
Generate FASTA header with embedded metadata.
- Parameters:
strand_role (
StrandRole|None) β Role of this strand in the duplex
- Return type:
- Returns:
FASTA header string with key-value pairs
- model_config: ClassVar[ConfigDict] = {}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class sirnaforge.models.modifications.SequenceRecord(**data)[source]ο
Bases:
BaseModelComplete sequence record with strand metadata.
Associates a strand with its target and role information.
- Parameters:
data (
Any)
-
strand_role:
StrandRoleο
-
metadata:
StrandMetadataο
- to_fasta()[source]ο
Generate complete FASTA record.
- Return type:
- Returns:
Multi-line FASTA string with header and sequence
- model_config: ClassVar[ConfigDict] = {}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
Off-Target Modelsο
Pydantic models for off-target analysis data structures.
This module provides validated data models for: - BWA alignment results (both genome and miRNA) - Aggregated off-target summaries - Analysis metadata and statistics
Using Pydantic ensures type safety, automatic validation, and clean serialization to JSON/TSV formats.
- class sirnaforge.models.off_target.AlignmentStrand(*values)[source]ο
-
Genomic strand orientation.
- FORWARD = '+'ο
- REVERSE = '-'ο
- class sirnaforge.models.off_target.AnalysisMode(*values)[source]ο
-
BWA alignment analysis mode.
- MIRNA_SEED = 'mirna_seed'ο
- TRANSCRIPTOME = 'transcriptome'ο
- class sirnaforge.models.off_target.MiRNADatabase(*values)[source]ο
-
Supported miRNA database sources.
Values correspond to database identifiers used by MiRNADatabaseManager. Using str enum allows seamless string comparison while providing validation.
- MIRGENEDB = 'mirgenedb'ο
- MIRBASE = 'mirbase'ο
- MIRBASE_HIGH_CONF = 'mirbase_high_conf'ο
- MIRBASE_HAIRPIN = 'mirbase_hairpin'ο
- TARGETSCAN = 'targetscan'ο
- class sirnaforge.models.off_target.BaseAlignmentHit(**data)[source]ο
Bases:
BaseModel,ABCBase class for alignment hits with common fields and validators.
This abstract base class contains all shared fields and validation logic for both off-target and miRNA alignment hits.
- Parameters:
data (
Any)
- model_config: ClassVar[ConfigDict] = {'frozen': False, 'validate_assignment': True}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
-
strand:
AlignmentStrandο
- classmethod validate_sequence(v)[source]ο
Ensure sequence contains only valid nucleotide characters.
- class sirnaforge.models.off_target.OffTargetHit(**data)[source]ο
Bases:
BaseAlignmentHitSingle off-target alignment hit from BWA analysis.
Represents one potential off-target binding site identified by sequence alignment against a reference genome or transcriptome.
- Parameters:
data (
Any)
- model_config: ClassVar[ConfigDict] = {'frozen': False, 'validate_assignment': True}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class sirnaforge.models.off_target.MiRNAHit(**data)[source]ο
Bases:
BaseAlignmentHitSingle miRNA seed match hit from BWA analysis.
Represents a potential miRNA-like seed match identified by alignment against miRNA databases.
- Parameters:
data (
Any)
-
database:
MiRNADatabase|strο
- model_config: ClassVar[ConfigDict] = {'frozen': False, 'validate_assignment': True}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class sirnaforge.models.off_target.BaseSummary(**data)[source]ο
Bases:
BaseModelBase class for analysis summary statistics with common metadata fields.
- Parameters:
data (
Any)
- model_config: ClassVar[ConfigDict] = {'frozen': False}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class sirnaforge.models.off_target.AnalysisSummary(**data)[source]ο
Bases:
BaseSummarySummary statistics for a single candidateβs off-target analysis.
- Parameters:
data (
Any)
-
mode:
AnalysisModeο
- model_config: ClassVar[ConfigDict] = {'frozen': False}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class sirnaforge.models.off_target.MiRNASummary(**data)[source]ο
Bases:
BaseSummarySummary statistics for miRNA seed match analysis.
Note: total_hits represents validated, high-quality seed region matches. hits_per_species represents raw alignment counts (may include low-quality matches).
- Parameters:
data (
Any)
-
mirna_database:
MiRNADatabase|strο
- model_config: ClassVar[ConfigDict] = {'frozen': False}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class sirnaforge.models.off_target.BaseAggregatedSummary(**data)[source]ο
Bases:
BaseModelBase class for aggregated analysis summaries with common fields.
- Parameters:
data (
Any)
- model_config: ClassVar[ConfigDict] = {'frozen': False}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class sirnaforge.models.off_target.AggregatedOffTargetSummary(**data)[source]ο
Bases:
BaseAggregatedSummarySummary of aggregated off-target results across multiple candidates and genomes.
- Parameters:
data (
Any)
- model_config: ClassVar[ConfigDict] = {'frozen': False}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class sirnaforge.models.off_target.AggregatedMiRNASummary(**data)[source]ο
Bases:
BaseAggregatedSummarySummary of aggregated miRNA results across multiple candidates.
- Parameters:
data (
Any)
-
mirna_database:
MiRNADatabase|strο
- model_config: ClassVar[ConfigDict] = {'frozen': False}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
Transcript Annotation Modelsο
Pydantic models for transcript annotation data structures.
- class sirnaforge.models.transcript_annotation.Interval(**data)[source]ο
Bases:
BaseModelGenomic interval with start, end, and optional strand information.
- Parameters:
data (
Any)
- model_config: ClassVar[ConfigDict] = {'frozen': True}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class sirnaforge.models.transcript_annotation.TranscriptAnnotation(**data)[source]ο
Bases:
BaseModelComprehensive transcript annotation from genomic databases.
Contains transcript metadata, genomic coordinates, exon/CDS structure, and source provenance for reproducibility.
- Parameters:
data (
Any)
- model_config: ClassVar[ConfigDict] = {'use_enum_values': True}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class sirnaforge.models.transcript_annotation.TranscriptAnnotationBundle(**data)[source]ο
Bases:
BaseModelCollection of transcript annotations with resolution tracking.
Bundles multiple transcript annotations from a single query, tracks which IDs were successfully resolved, and maintains reference provenance.
- Parameters:
data (
Any)
-
transcripts:
dict[str,TranscriptAnnotation]ο
-
reference_choice:
ReferenceChoiceο
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'use_enum_values': True}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
Validation Schemasο
Pandera schemas for siRNAforge data validation.
This module defines pandera schemas for validating the structure and content of various table-like outputs from the siRNAforge pipeline.
Modern schemas using class-based approach with type annotations for improved type safety, error reporting, and maintainability.
Use schemas: MySchema.validate(df) - validation errors provide detailed feedback.
- class sirnaforge.models.schemas.SchemaConfig[source]ο
Bases:
objectCommon configuration settings for all pandera schemas.
Provides consistent validation behavior across all siRNAforge data schemas with type coercion, strict column checking, and flexible column ordering.
- coerce = Trueο
- strict = Trueο
- ordered = Falseο
- class sirnaforge.models.schemas.SiRNACandidateSchema(*args, **kwargs)[source]ο
Bases:
DataFrameModelValidation schema for siRNA candidate results (CSV output).
Ensures data integrity and biological validity of siRNA design results with comprehensive checks for sequence composition, thermodynamic parameters, and scoring metrics. Includes optimal value ranges for key metrics based on research-backed thermodynamic principles.
Expected columns include sequences, thermodynamic scores (asymmetry, MFE, duplex stability), off-target counts, and composite quality scores.
- class Config[source]ο
Bases:
SchemaConfigSchema configuration with improved error reporting.
- description = 'siRNA candidate validation schema'ο
- title = 'SiRNA Design Results'ο
- add_missing_columns = Trueο
- strict = Falseο
- name = 'SiRNACandidateSchema'ο
-
seed_7mer_hits:
Series[Int64Dtype] = 'seed_7mer_hits'ο
-
seed_8mer_hits:
Series[Int64Dtype] = 'seed_8mer_hits'ο
- classmethod check_passes_filters_values(df)[source]ο
Ensure passes_filters contains allowed filter status values.
- class sirnaforge.models.schemas.ORFValidationSchema(*args, **kwargs)[source]ο
Bases:
DataFrameModelValidation schema for open reading frame analysis results (tab-delimited output).
Validates ORF detection and characterization results with proper handling of nullable fields for cases where no valid ORF is found. Includes metrics for transcript composition, ORF boundaries, codon usage, and GC content within coding regions.
Used to validate outputs from ORF analysis tools and ensure data consistency for downstream siRNA target validation.
- class sirnaforge.models.schemas.OffTargetHitsSchema(*args, **kwargs)[source]ο
Bases:
DataFrameModelDEPRECATED: Use MiRNAAlignmentSchema or GenomeAlignmentSchema instead.
Legacy validation schema for off-target analysis results (TSV output). This schema is too generic and doesnβt match actual BWA output format.
Migration Guide: - For miRNA seed analysis β Use MiRNAAlignmentSchema - For genome/transcriptome β Use GenomeAlignmentSchema
Will be removed in v0.3.0.
- class sirnaforge.models.schemas.MiRNAAlignmentSchema(*args, **kwargs)[source]ο
Bases:
DataFrameModelPandera schema for miRNA seed match alignment results (TSV/DataFrame).
Validates tabular data from BWA-MEM2 miRNA seed analysis. Each row represents one alignment between an siRNA candidate and a miRNA seed region.
Use this for: - Reading *_mirna_analysis.tsv files - Validating pandas DataFrames from miRNA analysis - Bulk operations on miRNA alignment results
Corresponding Pydantic model: models.off_target.MiRNAHit (for single rows)
- class Config[source]ο
Bases:
SchemaConfigSchema configuration.
- description = 'miRNA seed match alignment results'ο
- title = 'miRNA Alignment DataFrame'ο
- strict = Trueο
- coerce = Trueο
- name = 'MiRNAAlignmentSchema'ο
-
as_score:
Series[Int64Dtype] = 'as_score'ο
- class sirnaforge.models.schemas.GenomeAlignmentSchema(*args, **kwargs)[source]ο
Bases:
DataFrameModelPandera schema for genome/transcriptome off-target alignment results (TSV/DataFrame).
Validates tabular data from BWA-MEM2 genome/transcriptome analysis. Each row represents one potential off-target alignment in the genome.
Use this for: - Reading *_analysis.tsv files from genome alignment - Validating pandas DataFrames from transcriptome off-target analysis - Bulk operations on genome alignment results
Corresponding Pydantic model: models.off_target.OffTargetHit (for single rows)
- class Config[source]ο
Bases:
SchemaConfigSchema configuration.
- description = 'Genome/transcriptome off-target alignment results'ο
- title = 'Genome Alignment DataFrame'ο
- strict = Trueο
- coerce = Trueο
- name = 'GenomeAlignmentSchema'ο
-
as_score:
Series[Int64Dtype] = 'as_score'ο
Data Accessο
Base Data Classesο
Shared base classes and utilities for genomic data analysis.
- exception sirnaforge.data.base.DatabaseError(message, database=None)[source]ο
Bases:
ExceptionBase exception for database-related errors.
- exception sirnaforge.data.base.DatabaseAccessError(message, database=None)[source]ο
Bases:
DatabaseErrorException for network/access issues (firewall, timeout, server down).
- exception sirnaforge.data.base.GeneNotFoundError(query, database=None)[source]ο
Bases:
DatabaseErrorException for when a gene is not found in the database.
- class sirnaforge.data.base.DatabaseType(*values)[source]ο
-
Supported genomic databases.
- ENSEMBL = 'ensembl'ο
- REFSEQ = 'refseq'ο
- GENCODE = 'gencode'ο
- class sirnaforge.data.base.SequenceType(*values)[source]ο
-
Types of sequence data that can be retrieved.
- CDNA = 'cdna'ο
- CDS = 'cds'ο
- PROTEIN = 'protein'ο
- GENOMIC = 'genomic'ο
- class sirnaforge.data.base.GeneInfo(**data)[source]ο
Bases:
BaseModelGene information model.
- Parameters:
data (
Any)
- gene_id: strο
- gene_name: str | Noneο
- gene_type: str | Noneο
- chromosome: str | Noneο
- start: int | Noneο
- end: int | Noneο
- strand: int | Noneο
- description: str | Noneο
- database: DatabaseTypeο
- model_config: ClassVar[ConfigDict] = {'use_enum_values': True}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class sirnaforge.data.base.TranscriptInfo(**data)[source]ο
Bases:
BaseModelTranscript information model.
- Parameters:
data (
Any)
- transcript_id: strο
- transcript_name: str | Noneο
- transcript_type: str | Noneο
- gene_id: strο
- gene_name: str | Noneο
- sequence: str | Noneο
- length: int | Noneο
- database: DatabaseTypeο
- is_canonical: boolο
- model_config: ClassVar[ConfigDict] = {'use_enum_values': True}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class sirnaforge.data.base.AbstractDatabaseClient(timeout=30)[source]ο
Bases:
ABCAbstract base class for database clients.
- Parameters:
timeout (
int)
- abstractmethod async search_gene(query, include_sequence=True)[source]ο
Search for a gene and return gene info and transcripts.
- Parameters:
- Return type:
tuple[GeneInfo|None,list[TranscriptInfo]]- Returns:
Tuple of (gene_info, transcripts)
- Raises:
DatabaseAccessError β For network/server access issues
GeneNotFoundError β When gene is not found in database
- abstractmethod async get_sequence(identifier, sequence_type=SequenceType.CDNA)[source]ο
Get sequence for a specific identifier.
- Parameters:
identifier (
str) β Gene ID, transcript ID, etc.sequence_type (
SequenceType) β Type of sequence to retrieve
- Return type:
- Returns:
Sequence string
- Raises:
DatabaseAccessError β For network/server access issues
GeneNotFoundError β When identifier is not found in database
- abstract property database_type: DatabaseTypeο
Return the database type this client handles.
- class sirnaforge.data.base.AbstractTranscriptAnnotationClient(timeout=30)[source]ο
Bases:
ABCAbstract base class for transcript annotation clients.
Purpose and Scope: Provides genomic annotation metadata (exon/CDS structure, coordinates, biotype) WITHOUT fetching full transcript sequences. This is complementary to, not overlapping with, AbstractDatabaseClient which focuses on sequence retrieval.
Key Differences from GeneSearcher/AbstractDatabaseClient:
Focus: Structural annotations (exons, CDS intervals, genomic coordinates) vs. sequence data (cDNA, CDS, protein sequences)
Use Case: Enriching existing transcript metadata with genomic context vs. discovering and retrieving transcripts with sequences
Query Patterns: - By stable IDs: fetch_by_ids([βENST00000269305β]) - By genomic regions: fetch_by_regions([β17:7661779-7687550β]) vs. GeneSearcher which queries by gene name/symbol
Caching Strategy: In-memory LRU cache with TTL for transient annotation data vs. ReferenceManagerβs persistent file cache for large sequence datasets
When to Use: - Need exon/CDS boundaries for visualization or analysis - Need genomic coordinates for variant mapping - Need biotype information without full sequence download - Need to query multiple transcripts in a genomic region
When to Use GeneSearcher Instead: - Need transcript sequences for siRNA design - Need to discover transcripts by gene name/symbol - Need protein sequences or translations
- Parameters:
timeout (
int)
- __init__(timeout=30)[source]ο
Initialize transcript annotation client.
- Parameters:
timeout (
int) β Request timeout in seconds
- abstractmethod async fetch_by_ids(ids, *, species, reference)[source]ο
Fetch transcript annotations by stable IDs.
- Parameters:
- Return type:
- Returns:
TranscriptAnnotationBundle containing resolved annotations
- Raises:
DatabaseAccessError β For network/server access issues
- abstractmethod async fetch_by_regions(regions, *, species, reference)[source]ο
Fetch transcript annotations by genomic regions.
- Parameters:
- Return type:
- Returns:
TranscriptAnnotationBundle containing all transcripts overlapping regions
- Raises:
DatabaseAccessError β For network/server access issues
- class sirnaforge.data.base.EnsemblClient(timeout=30, base_url='https://rest.ensembl.org')[source]ο
Bases:
AbstractDatabaseClientClient for Ensembl REST API interactions.
- property database_type: DatabaseTypeο
Return the database type this client handles.
- async search_gene(query, include_sequence=True)[source]ο
Search for a gene and return gene info and transcripts.
- async get_sequence(identifier, sequence_type=SequenceType.CDNA, headers=None)[source]ο
Get sequence from Ensembl REST API.
- Parameters:
identifier (
str) β Gene ID, transcript ID, etc.sequence_type (
SequenceType) β Type of sequence to retrieve
- Return type:
- Returns:
Sequence string
- Raises:
DatabaseAccessError β For network/server access issues
GeneNotFoundError β When identifier is not found in database
- class sirnaforge.data.base.RefSeqClient(timeout=30, base_url='https://eutils.ncbi.nlm.nih.gov/entrez/eutils')[source]ο
Bases:
AbstractDatabaseClientClient for RefSeq database via NCBI E-utilities API.
- __init__(timeout=30, base_url='https://eutils.ncbi.nlm.nih.gov/entrez/eutils')[source]ο
Initialize RefSeq client.
- property database_type: DatabaseTypeο
Return the database type this client handles.
- async search_gene(query, include_sequence=True)[source]ο
Search for a gene and return gene info and transcripts.
- async get_sequence(identifier, _sequence_type=SequenceType.CDNA)[source]ο
Get sequence for a specific identifier from NCBI.
- Parameters:
identifier (
str)_sequence_type (
SequenceType)
- Return type:
- class sirnaforge.data.base.GencodeClient(timeout=30)[source]ο
Bases:
AbstractDatabaseClientClient for GENCODE database.
- Parameters:
timeout (
int)
- property database_type: DatabaseTypeο
Return the database type this client handles.
- async search_gene(query, include_sequence=True)[source]ο
Search for a gene and return gene info and transcripts.
- async get_sequence(_identifier, _sequence_type=SequenceType.CDNA)[source]ο
Get sequence for a specific identifier from GENCODE.
- Parameters:
_identifier (
str)_sequence_type (
SequenceType)
- Return type:
- class sirnaforge.data.base.SequenceUtils[source]ο
Bases:
objectUtility functions for sequence analysis.
- class sirnaforge.data.base.FastaUtils[source]ο
Bases:
objectUtility functions for FASTA file operations.
- static save_sequences_fasta(sequences, output_path, line_length=80)[source]ο
Save sequences to FASTA format.
- static write_dict_to_fasta(sequences, output_path)[source]ο
Write sequences dictionary to FASTA format.
- sirnaforge.data.base.get_database_display_name(database)[source]ο
Get display name for database, handling both enum and string values.
- Parameters:
database (
DatabaseType)- Return type:
Gene Searchο
Gene search and sequence retrieval from multiple databases.
- class sirnaforge.data.gene_search.GeneSearchResult(**data)[source]ο
Bases:
BaseModelComplete gene search result.
- Parameters:
data (
Any)
-
database:
DatabaseTypeο
-
transcripts:
list[TranscriptInfo]ο
- model_config: ClassVar[ConfigDict] = {'use_enum_values': True}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class sirnaforge.data.gene_search.GeneSearcher(timeout=30, max_retries=3)[source]ο
Bases:
objectSearch genes and retrieve sequences from genomic databases using multiple clients.
- get_client(database)[source]ο
Get the client for a specific database.
- Parameters:
database (
DatabaseType)- Return type:
- async search_gene_with_fallback(query, include_sequence=True)[source]ο
Search for a gene with automatic fallback to other databases.
Tries databases in order: Ensembl -> RefSeq -> GENCODE Falls back to next database only if access is blocked (not if gene is not found).
- Parameters:
- Return type:
- Returns:
GeneSearchResult from the first accessible database
- async search_gene(query, database=None, include_sequence=True)[source]ο
Search for a gene and retrieve its isoforms.
- Parameters:
query (
str) β Gene ID, gene name, or transcript IDdatabase (
DatabaseType|None) β Database to search (defaults to Ensembl)include_sequence (
bool) β Whether to fetch transcript sequences
- Return type:
- Returns:
GeneSearchResult with gene info and transcripts
- async search_multiple_databases(query, databases=None, include_sequence=True)[source]ο
Search across multiple databases.
- Parameters:
query (
str) β Gene ID, gene name, or transcript IDdatabases (
list[DatabaseType] |None) β List of databases to searchinclude_sequence (
bool) β Whether to fetch sequences
- Return type:
- Returns:
List of search results from each database
- sirnaforge.data.gene_search.search_gene_sync(query, database=DatabaseType.ENSEMBL, include_sequence=True)[source]ο
Synchronous wrapper for gene search.
- Parameters:
query (
str)database (
DatabaseType)include_sequence (
bool)
- Return type:
- sirnaforge.data.gene_search.search_gene_with_fallback_sync(query, include_sequence=True)[source]ο
Synchronous wrapper for gene search with fallback.
- Parameters:
- Return type:
ORF Analysisο
ORF analysis and sequence validation for transcript sequences.
- class sirnaforge.data.orf_analysis.ORFInfo(**data)[source]ο
Bases:
BaseModelInformation about an Open Reading Frame.
- Parameters:
data (
Any)
- model_config: ClassVar[ConfigDict] = {'frozen': True}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class sirnaforge.data.orf_analysis.SequenceAnalysis(**data)[source]ο
Bases:
BaseModelComplete sequence analysis including ORF information.
- Parameters:
data (
Any)
-
sequence_type:
SequenceTypeο
- model_config: ClassVar[ConfigDict] = {'use_enum_values': True}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class sirnaforge.data.orf_analysis.ORFAnalyzer(database_client=None)[source]ο
Bases:
objectAnalyze ORFs in transcript sequences and validate sequence types.
- Parameters:
database_client (
AbstractDatabaseClient|None)
- __init__(database_client=None)[source]ο
Initialize ORF analyzer.
- Parameters:
database_client (
AbstractDatabaseClient|None) β Optional database client for retrieving additional sequence types
- async get_additional_sequence(transcript_id, sequence_type)[source]ο
Retrieve specific sequence type using the database client if available.
- Parameters:
transcript_id (
str) β Transcript identifiersequence_type (
SequenceType) β Type of sequence to retrieve
- Return type:
- Returns:
Sequence string or None if not available or client not provided
- async analyze_transcript(transcript)[source]ο
Perform complete ORF analysis of a transcript.
- Parameters:
transcript (
TranscriptInfo)- Return type:
- async analyze_transcripts(transcripts)[source]ο
Analyze multiple transcripts.
- Parameters:
transcripts (
list[TranscriptInfo])- Return type:
- sirnaforge.data.orf_analysis.create_orf_analyzer(database_client=None)[source]ο
Create an ORF analyzer with optional database client.
- Parameters:
database_client (
AbstractDatabaseClient|None) β Optional database client for retrieving additional sequence types- Return type:
- Returns:
ORFAnalyzer instance
- async sirnaforge.data.orf_analysis.analyze_multiple_transcript_orfs(transcripts, database_client=None)[source]ο
Analyze ORFs in multiple transcripts.
- Parameters:
transcripts (
list[TranscriptInfo]) β List of transcripts to analyzedatabase_client (
AbstractDatabaseClient|None) β Optional database client for additional sequence retrieval
- Return type:
- Returns:
Dictionary mapping transcript IDs to SequenceAnalysis results
Transcript Annotationsο
Transcript annotation providers using Ensembl REST and optional VEP enrichment.
This module provides clients for fetching genomic transcript annotations (exon/CDS structure, coordinates, biotype) separate from sequence retrieval.
Architecture Overview:
EnsemblTranscriptModelClient: Primary implementation using Ensembl REST API
VepConsequenceClient: Optional enrichment client (placeholder for future development)
Caching Strategy:
Uses in-memory LRU cache with TTL rather than ReferenceManagerβs persistent file cache. This design choice is intentional because:
Data Size: Annotation JSON responses are small (KB) vs. sequence files (GB)
Volatility: Annotations may update with new releases; TTL provides freshness
Access Pattern: High frequency, low latency requirements during workflow execution
Scope: Transient metadata enrichment vs. permanent reference datasets
The cache automatically evicts oldest entries when reaching max_cache_entries, and entries expire after cache_ttl seconds.
Relationship to GeneSearcher:
GeneSearcher: Discovers transcripts by gene name, fetches cDNA/protein sequences
This module: Enriches known transcript IDs with genomic structural metadata
Both can use Ensembl, but query different API endpoints for different purposes
No redundancy: complementary data types that donβt overlap
- class sirnaforge.data.transcript_annotation.EnsemblTranscriptModelClient(timeout=30, base_url='https://rest.ensembl.org', cache_ttl=3600, max_cache_entries=1000)[source]ο
Bases:
AbstractTranscriptAnnotationClientEnsembl REST-based transcript annotation client.
Retrieves transcript metadata including genomic coordinates, exon/CDS structure, and biotype information using Ensemblβs public REST API.
API Endpoints Used:
Lookup by ID (/lookup/id/:id?expand=1): - Fetches detailed annotation for single transcript/gene ID - Returns exon coordinates, CDS intervals, biotype - Example: /lookup/id/ENST00000269305?expand=1
Overlap by Region (/overlap/region/:species/:region): - Fetches all transcripts overlapping genomic region - Useful for region-based queries - Example: /overlap/region/human/17:7661779-7687550?feature=transcript
Caching Implementation:
Cache key format: βid:{species}:{identifier}:{reference}β or βregion:{species}:{region}:{reference}β
TTL: Configurable, default 1 hour (3600 seconds)
Eviction: LRU when max_cache_entries reached (default 1000)
Thread-safe: Single-process use only (workflow orchestration context)
Error Handling:
404: ID not found β added to unresolved list, no exception raised
403/503: Server unavailable β DatabaseAccessError raised
Network errors: Wrapped in DatabaseAccessError with context
Timeout: Configurable via timeout parameter
Example Usage:
>>> client = EnsemblTranscriptModelClient() >>> reference = ReferenceChoice.explicit("GRCh38", reason="user-specified") >>> bundle = await client.fetch_by_ids( ... ids=["ENST00000269305"], ... species="human", ... reference=reference ... ) >>> print(f"Resolved: {bundle.resolved_count}, Unresolved: {bundle.unresolved_count}")
- __init__(timeout=30, base_url='https://rest.ensembl.org', cache_ttl=3600, max_cache_entries=1000)[source]ο
Initialize Ensembl transcript annotation client.
- async fetch_by_ids(ids, *, species, reference)[source]ο
Fetch transcript annotations by stable IDs using Ensembl lookup endpoint.
- Parameters:
- Return type:
- Returns:
TranscriptAnnotationBundle with resolved annotations
- class sirnaforge.data.transcript_annotation.VepConsequenceClient(timeout=30, base_url='https://rest.ensembl.org')[source]ο
Bases:
objectOptional VEP (Variant Effect Predictor) consequence enrichment client.
Provides additional functional annotation for transcript variants. This is an optional enhancement and not required for base functionality.
Current Status: PLACEHOLDER
This client exists as a stub for future VEP integration. The enrich_annotations method currently returns the input bundle unchanged.
Future Implementation:
When activated (via config flag), this client will:
Query Ensembl VEP REST API for consequence predictions
Enrich TranscriptAnnotation objects with variant consequence types (missense, nonsense, etc.), conservation scores, regulatory feature overlaps, and population frequency data
Maintain consistent caching strategy with EnsemblTranscriptModelClient
Design Rationale:
Separated from EnsemblTranscriptModelClient because:
VEP queries are expensive (rate-limited, slower)
Not all workflows need consequence predictions
Allows independent caching strategies
Can be enabled/disabled via configuration
- async enrich_annotations(bundle, _species='homo_sapiens')[source]ο
Enrich transcript annotations with VEP consequence data.
- Parameters:
bundle (
TranscriptAnnotationBundle) β Existing transcript annotation bundlespecies β Species name for VEP queries
_species (
str)
- Return type:
- Returns:
Enriched bundle (currently returns input unchanged - placeholder for future VEP integration)
miRNA Managementο
miRNA Database Manager with multi-species support.
This module provides a clean interface for downloading, caching, and managing miRNA databases from multiple sources (MirGeneDB, miRBase, TargetScan) with automatic cache management and species-specific organization.
- class sirnaforge.data.mirna_manager.MiRNASource(name, url, species, format='fasta', compressed=False, description='')[source]ο
Bases:
ReferenceSourcemiRNA-specific database source configuration.
Inherits from ReferenceSource and can add miRNA-specific fields if needed.
- class sirnaforge.data.mirna_manager.MiRNADatabaseManager(cache_dir=None, cache_ttl_days=30)[source]ο
Bases:
ReferenceManager[MiRNASource]Elegant miRNA database manager with caching and multi-species support.
- SOURCES = {'mirbase': {'human': MiRNASource(name='mirbase_mature', url='https://www.mirbase.org/download/CURRENT/mature.fa', species='human', format='fasta', compressed=False, description='miRBase mature miRNA sequences (all species, filtered for Homo sapiens - hsa)'), 'mouse': MiRNASource(name='mirbase_mature', url='https://www.mirbase.org/download/CURRENT/mature.fa', species='mouse', format='fasta', compressed=False, description='miRBase mature miRNA sequences (all species, filtered for Mus musculus - mmu)'), 'rat': MiRNASource(name='mirbase_mature', url='https://www.mirbase.org/download/CURRENT/mature.fa', species='rat', format='fasta', compressed=False, description='miRBase mature miRNA sequences (all species, filtered for Rattus norvegicus - rno)')}, 'mirbase_hairpin': {'human': MiRNASource(name='mirbase_hairpin', url='https://www.mirbase.org/download/CURRENT/hairpin.fa', species='human', format='fasta', compressed=False, description='miRBase hairpin precursor miRNA sequences (Homo sapiens - hsa)'), 'mouse': MiRNASource(name='mirbase_hairpin', url='https://www.mirbase.org/download/CURRENT/hairpin.fa', species='mouse', format='fasta', compressed=False, description='miRBase hairpin precursor miRNA sequences (Mus musculus - mmu)'), 'rat': MiRNASource(name='mirbase_hairpin', url='https://www.mirbase.org/download/CURRENT/hairpin.fa', species='rat', format='fasta', compressed=False, description='miRBase hairpin precursor miRNA sequences (Rattus norvegicus - rno)')}, 'mirbase_high_conf': {'human': MiRNASource(name='mirbase_mature_hc', url='https://www.mirbase.org/download/CURRENT/mature_high_conf.fa', species='human', format='fasta', compressed=False, description='miRBase high-confidence mature miRNA sequences (Homo sapiens - hsa)'), 'mouse': MiRNASource(name='mirbase_mature_hc', url='https://www.mirbase.org/download/CURRENT/mature_high_conf.fa', species='mouse', format='fasta', compressed=False, description='miRBase high-confidence mature miRNA sequences (Mus musculus - mmu)'), 'rat': MiRNASource(name='mirbase_mature_hc', url='https://www.mirbase.org/download/CURRENT/mature_high_conf.fa', species='rat', format='fasta', compressed=False, description='miRBase high-confidence mature miRNA sequences (Rattus norvegicus - rno)')}, 'mirgenedb': {'aga': MiRNASource(name='mirgenedb', url='https://www.mirgenedb.org/fasta/aga?mat=1', species='aga', format='fasta', compressed=False, description='MirGeneDB high-confidence miRNAs (Anopheles gambiae, NCBI:7165) [Mosquito]'), 'bta': MiRNASource(name='mirgenedb', url='https://www.mirgenedb.org/fasta/bta?mat=1', species='bta', format='fasta', compressed=False, description='MirGeneDB high-confidence miRNAs (Bos taurus, NCBI:9913) [Cow]'), 'cel': MiRNASource(name='mirgenedb', url='https://www.mirgenedb.org/fasta/cel?mat=1', species='cel', format='fasta', compressed=False, description='MirGeneDB high-confidence miRNAs (Caenorhabditis elegans, NCBI:6239) [C. elegans]'), 'cfa': MiRNASource(name='mirgenedb', url='https://www.mirgenedb.org/fasta/cfa?mat=1', species='cfa', format='fasta', compressed=False, description='MirGeneDB high-confidence miRNAs (Canis lupus familiaris, NCBI:9615) [Dog]'), 'dme': MiRNASource(name='mirgenedb', url='https://www.mirgenedb.org/fasta/dme?mat=1', species='dme', format='fasta', compressed=False, description='MirGeneDB high-confidence miRNAs (Drosophila melanogaster, NCBI:7227) [Fruit fly]'), 'dre': MiRNASource(name='mirgenedb', url='https://www.mirgenedb.org/fasta/dre?mat=1', species='dre', format='fasta', compressed=False, description='MirGeneDB high-confidence miRNAs (Danio rerio, NCBI:7955) [Zebrafish]'), 'eca': MiRNASource(name='mirgenedb', url='https://www.mirgenedb.org/fasta/eca?mat=1', species='eca', format='fasta', compressed=False, description='MirGeneDB high-confidence miRNAs (Equus caballus, NCBI:9796) [Horse]'), 'fca': MiRNASource(name='mirgenedb', url='https://www.mirgenedb.org/fasta/fca?mat=1', species='fca', format='fasta', compressed=False, description='MirGeneDB high-confidence miRNAs (Felis catus, NCBI:9685)'), 'gac': MiRNASource(name='mirgenedb', url='https://www.mirgenedb.org/fasta/gac?mat=1', species='gac', format='fasta', compressed=False, description='MirGeneDB high-confidence miRNAs (Gasterosteus aculeatus, NCBI:69293) [Stickleback]'), 'gga': MiRNASource(name='mirgenedb', url='https://www.mirgenedb.org/fasta/gga?mat=1', species='gga', format='fasta', compressed=False, description='MirGeneDB high-confidence miRNAs (Gallus gallus, NCBI:9031) [Chicken]'), 'ggo': MiRNASource(name='mirgenedb', url='https://www.mirgenedb.org/fasta/ggo?mat=1', species='ggo', format='fasta', compressed=False, description='MirGeneDB high-confidence miRNAs (Gorilla gorilla, NCBI:9593)'), 'hsa': MiRNASource(name='mirgenedb', url='https://www.mirgenedb.org/fasta/hsa?mat=1', species='hsa', format='fasta', compressed=False, description='MirGeneDB high-confidence miRNAs (Homo sapiens, NCBI:9606) [Human]'), 'mml': MiRNASource(name='mirgenedb', url='https://www.mirgenedb.org/fasta/mml?mat=1', species='mml', format='fasta', compressed=False, description='MirGeneDB high-confidence miRNAs (Macaca mulatta, NCBI:9544) [Rhesus macaque]'), 'mmu': MiRNASource(name='mirgenedb', url='https://www.mirgenedb.org/fasta/mmu?mat=1', species='mmu', format='fasta', compressed=False, description='MirGeneDB high-confidence miRNAs (Mus musculus, NCBI:10090) [Mouse]'), 'oar': MiRNASource(name='mirgenedb', url='https://www.mirgenedb.org/fasta/oar?mat=1', species='oar', format='fasta', compressed=False, description='MirGeneDB high-confidence miRNAs (Ovis aries, NCBI:9940) [Sheep]'), 'ola': MiRNASource(name='mirgenedb', url='https://www.mirgenedb.org/fasta/ola?mat=1', species='ola', format='fasta', compressed=False, description='MirGeneDB high-confidence miRNAs (Oryzias latipes, NCBI:8090) [Medaka]'), 'pma': MiRNASource(name='mirgenedb', url='https://www.mirgenedb.org/fasta/pma?mat=1', species='pma', format='fasta', compressed=False, description='MirGeneDB high-confidence miRNAs (Petromyzon marinus, NCBI:7757) [Sea lamprey]'), 'ptr': MiRNASource(name='mirgenedb', url='https://www.mirgenedb.org/fasta/ptr?mat=1', species='ptr', format='fasta', compressed=False, description='MirGeneDB high-confidence miRNAs (Pan troglodytes, NCBI:9598) [Chimpanzee]'), 'rno': MiRNASource(name='mirgenedb', url='https://www.mirgenedb.org/fasta/rno?mat=1', species='rno', format='fasta', compressed=False, description='MirGeneDB high-confidence miRNAs (Rattus norvegicus, NCBI:10116)'), 'spur': MiRNASource(name='mirgenedb', url='https://www.mirgenedb.org/fasta/spur?mat=1', species='spur', format='fasta', compressed=False, description='MirGeneDB high-confidence miRNAs (Strongylocentrotus purpuratus, NCBI:7668) [Purple sea urchin]'), 'ssc': MiRNASource(name='mirgenedb', url='https://www.mirgenedb.org/fasta/ssc?mat=1', species='ssc', format='fasta', compressed=False, description='MirGeneDB high-confidence miRNAs (Sus scrofa, NCBI:9823) [Pig]'), 'tgu': MiRNASource(name='mirgenedb', url='https://www.mirgenedb.org/fasta/tgu?mat=1', species='tgu', format='fasta', compressed=False, description='MirGeneDB high-confidence miRNAs (Meleagris gallopavo, NCBI:9103) [Turkey]'), 'xla': MiRNASource(name='mirgenedb', url='https://www.mirgenedb.org/fasta/xla?mat=1', species='xla', format='fasta', compressed=False, description='MirGeneDB high-confidence miRNAs (Xenopus laevis, NCBI:8355) [African clawed frog]')}, 'targetscan': {'human': MiRNASource(name='targetscan', url='https://www.targetscan.org/vert_80/vert_80_data_download/miR_Family_Info.txt.zip', species='human', format='tsv', compressed=True, description='TargetScan miRNA family data')}}ο
- classmethod get_species_for_source(source_name)[source]ο
Return sorted list of species supported by a given source.
- classmethod get_species_aliases(source_name)[source]ο
Return mapping of canonical species identifiers to their known aliases.
- classmethod canonicalize_species_name(species)[source]ο
Normalize a raw species identifier to a canonical key.
- classmethod canonicalize_species_list(species_list)[source]ο
Normalize a list of species identifiers to canonical keys, preserving order.
- classmethod get_genome_species_for_canonical(canonical_species)[source]ο
Return genome species identifiers for canonical species keys.
Note: These are used for miRNA database lookups, not genomic DNA alignment. The term βgenomeβ here refers to the organismβs miRNA annotation set.
- classmethod get_mirna_slugs_for_canonical(canonical_species, source_name)[source]ο
Return normalized miRNA identifiers for canonical species.
- classmethod get_supported_canonical_species_for_source(source_name)[source]ο
Return canonical species supported by a given source.
- classmethod resolve_species_selection(requested_species, source_name, mirna_overrides=None)[source]ο
Resolve canonical, genome, and miRNA identifiers for the requested species.
- classmethod normalize_species(source_name, species)[source]ο
Normalize user-provided species identifiers to canonical keys.
- classmethod get_source_configuration(source_name, species)[source]ο
Retrieve the MiRNASource configuration for a given source/species.
- Parameters:
- Return type:
- classmethod is_supported_species(source_name, species)[source]ο
Check if a species is supported for the given source.
- classmethod get_mirgenedb_species_metadata()[source]ο
Expose the MirGeneDB species metadata table.
- get_database(source_name, species, force_refresh=False)[source]ο
Get miRNA database, downloading and filtering if needed.
Simplified caching: each species+source combination gets its own cache file.
- Parameters:
- Return type:
- Returns:
Path to cached FASTA file, or None if failed
- get_combined_database(sources, species, output_name=None)[source]ο
Combine multiple databases into a single file.
Species Registryο
Canonical species registry and metadata for miRNA and genome mappings.
- sirnaforge.data.species_registry.normalize_species_name(species)[source]ο
Normalize species name to canonical form.
- Parameters:
species (
str) β Species name in any recognized form (e.g., βhsaβ, βhumanβ, βHomo sapiensβ)- Return type:
- Returns:
Canonical species name (e.g., βhumanβ), or original string if not recognized
Examples
>>> normalize_species_name('hsa') 'human' >>> normalize_species_name('Mus musculus') 'mouse' >>> normalize_species_name('macaque') 'macaque' >>> normalize_species_name('unknown') 'unknown'
Pipeline Integrationο
Nextflow CLIο
Command-line entry points used by embedded Nextflow modules.
- sirnaforge.pipeline.nextflow_cli.build_bwa_index_cli(fasta_file, species, output_dir='.')[source]ο
Build BWA-MEM2 index for genome/transcriptome.
- sirnaforge.pipeline.nextflow_cli.aggregate_results_cli(genome_species, output_dir='.', mirna_db=None, mirna_species=None)[source]ο
Aggregate off-target analysis results from multiple candidates and genomes.
- Parameters:
- Return type:
- Returns:
Dictionary with aggregation statistics
Nextflow Configurationο
Nextflow Configuration Management.
This module handles configuration for Nextflow workflows, including Docker settings, resource management, and parameter validation.
Simple Usage Examples:
# Auto-configure based on environment (easiest) config = NextflowConfig.auto_configure()
# Production settings config = NextflowConfig.for_production()
# Testing settings config = NextflowConfig.for_testing()
# Local Docker testing (uses local image built by βmake dockerβ) config = NextflowConfig.for_local_docker_testing()
- class sirnaforge.pipeline.nextflow.config.EnvironmentInfo(**data)[source]ο
Bases:
BaseModelInformation about the current execution environment.
- Parameters:
data (
Any)
- is_profile_overridden()[source]ο
Check if the recommended profile differs from the requested profile.
- Return type:
- get_execution_summary()[source]ο
Get a human-readable summary of the execution environment.
- Return type:
- model_config: ClassVar[ConfigDict] = {}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class sirnaforge.pipeline.nextflow.config.NextflowConfig(docker_image='ghcr.io/austin-s-h/sirnaforge:latest', profile='docker', work_dir=None, nxf_home=None, max_cpus=16, max_memory='128.GB', max_time='240.h', **kwargs)[source]ο
Bases:
objectConfiguration manager for Nextflow workflows.
- Parameters:
- DEFAULT_SIRNAFORGE_DOCKER_IMAGE = 'ghcr.io/austin-s-h/sirnaforge:latest'ο
- MEMORY_BUFFER_GB = 0.5ο
- MIN_MEMORY_GB = 1ο
- __init__(docker_image='ghcr.io/austin-s-h/sirnaforge:latest', profile='docker', work_dir=None, nxf_home=None, max_cpus=16, max_memory='128.GB', max_time='240.h', **kwargs)[source]ο
Initialize Nextflow configuration.
- Parameters:
docker_image (
str) β Docker container image to useprofile (
str) β Nextflow profile (docker, singularity, conda, local)work_dir (
Path|None) β Working directory for Nextflow executionmax_cpus (
int) β Maximum CPU coresmax_memory (
str) β Maximum memory allocationmax_time (
str) β Maximum execution time**kwargs (
Any) β Additional configuration parameters
- get_nextflow_args(input_file, output_dir, genome_species, additional_params=None, include_test_profile=False)[source]ο
Generate Nextflow command arguments.
- Parameters:
input_file (
Path) β Input FASTA file pathoutput_dir (
Path) β Output directorygenome_species (
list[str]) β List of species for miRNA genome lookups (not genomic DNA)additional_params (
dict[str,Any] |None) β Additional parameters to passinclude_test_profile (
bool) β Whether to include βtestβ profile for integration testing
- Return type:
- Returns:
List of command arguments for Nextflow
- validate_docker_available()[source]ο
Check if Docker is available for Nextflow execution.
This checks if Docker can be used by Nextflow to run containers. Note: This is different from running tests inside Docker containers.
- Return type:
- Returns:
True if Docker is available and accessible for Nextflow
- is_running_in_docker()[source]ο
Check if weβre currently running inside a Docker container.
This is useful for determining the appropriate execution profile when running tests or workflows.
- Return type:
- Returns:
True if running inside a Docker container
- get_execution_profile()[source]ο
Get the appropriate execution profile based on available tools and environment.
This method considers: 1. Environment variables (SIRNAFORGE_USE_LOCAL_EXECUTION) 2. Whether weβre running inside a Docker container (for testing) 3. Whether Docker is available for Nextflow execution 4. Availability of Singularity or Conda as fallbacks 5. The requested profile
- Return type:
- Returns:
Recommended execution profile
- get_environment_info()[source]ο
Get information about the current execution environment.
This provides structured information about Docker availability, profile selection, and environment detection.
- Return type:
- Returns:
EnvironmentInfo model with environment details
- classmethod for_testing()[source]ο
Create a configuration optimized for testing.
This automatically detects if weβre running in Docker and adjusts accordingly. Uses uv/conda for environment management when available.
- Return type:
- Returns:
NextflowConfig instance with test-friendly settings
- classmethod for_production(**kwargs)[source]ο
Create a configuration optimized for production use.
This uses Docker by default for reproducible execution with full resources.
- Parameters:
**kwargs (
Any) β Additional configuration parameters to override defaults- Return type:
- Returns:
NextflowConfig instance with production settings
- classmethod auto_configure(**kwargs)[source]ο
Auto-configure Nextflow settings based on environment detection.
This method automatically detects available tools and selects the best profile.
- Parameters:
**kwargs (
Any) β Additional configuration parameters to override defaults- Return type:
- Returns:
NextflowConfig instance with auto-detected settings
Nextflow Runnerο
Nextflow Pipeline Runner.
This module provides a Python interface to execute Nextflow workflows for siRNA off-target analysis with proper Docker integration.
- class sirnaforge.pipeline.nextflow.runner.NextflowRunner(config=None)[source]ο
Bases:
objectExecute Nextflow workflows from Python with proper error handling.
- Parameters:
config (
NextflowConfig|None)
- __init__(config=None)[source]ο
Initialize Nextflow runner.
- Parameters:
config (
NextflowConfig|None) β NextflowConfig instance, creates auto-configured if None
- async run(input_file, output_dir, genome_species=None, **kwargs)[source]ο
Simple method to run Nextflow workflow with auto-validation and defaults.
- Parameters:
input_file (
Path) β Path to input FASTA fileoutput_dir (
Path) β Output directory for resultsgenome_species (
list[str] |None) β List of species for miRNA genome lookups (defaults to [βhumanβ, βratβ, βrhesusβ])**kwargs (
Any) β Additional parameters passed to run_offtarget_analysis
- Return type:
- Returns:
Dictionary containing execution results and metadata
- Raises:
NextflowExecutionError β If workflow execution fails
- run_sync(input_file, output_dir, genome_species=None, **kwargs)[source]ο
Synchronous version of run() for simpler usage without async/await.
- Parameters:
input_file (
Path) β Path to input FASTA fileoutput_dir (
Path) β Output directory for resultsgenome_species (
list[str] |None) β List of species for miRNA genome lookups (defaults to [βhumanβ, βratβ, βrhesusβ])**kwargs (
Any) β Additional parameters passed to run_offtarget_analysis
- Return type:
- Returns:
Dictionary containing execution results and metadata
- async run_offtarget_analysis(input_file, output_dir, genome_species, additional_params=None, show_progress=True)[source]ο
Run the off-target analysis Nextflow workflow.
- Parameters:
input_file (
Path) β Path to siRNA candidates FASTA fileoutput_dir (
Path) β Output directory for resultsgenome_species (
list[str]) β List of species for miRNA genome lookupsadditional_params (
dict[str,Any] |None) β Additional parameters for the workflowshow_progress (
bool) β Whether to show progress indicators
- Return type:
- Returns:
Dictionary containing execution results and metadata
- Raises:
NextflowExecutionError β If workflow execution fails
- classmethod for_testing()[source]ο
Create a runner configured for testing.
This uses the test configuration which automatically detects if weβre running in Docker and adjusts accordingly.
- Return type:
- Returns:
NextflowRunner configured for testing
Workflow Orchestrationο
siRNAforge Workflow Orchestrator.
Coordinates the complete siRNA design pipeline: 1. Transcript retrieval and validation 2. ORF validation and reporting 3. siRNA candidate generation and scoring 4. Top-N candidate selection and reporting 5. Off-target analysis with Nextflow pipeline
- class sirnaforge.workflow.WorkflowConfig(output_dir, gene_query, input_fasta=None, database=DatabaseType.ENSEMBL, design_params=None, nextflow_config=None, genome_indices_override=None, genome_species=None, mirna_database='mirgenedb', mirna_species=None, transcriptome_fasta=None, transcriptome_filter=None, transcriptome_selection=None, validation_config=None, log_file=None, write_json_summary=True, num_threads=None, input_source=None, keep_nextflow_work=False, variant_config=None)[source]ο
Bases:
objectConfiguration for the complete siRNA design workflow.
- Parameters:
output_dir (
Path)gene_query (
str)database (
DatabaseType)design_params (
DesignParameters|None)mirna_database (
str)transcriptome_selection (
ReferenceSelection|None)validation_config (
ValidationConfig|None)write_json_summary (
bool)input_source (
InputSource|None)keep_nextflow_work (
bool)variant_config (
VariantWorkflowConfig|None)
- __init__(output_dir, gene_query, input_fasta=None, database=DatabaseType.ENSEMBL, design_params=None, nextflow_config=None, genome_indices_override=None, genome_species=None, mirna_database='mirgenedb', mirna_species=None, transcriptome_fasta=None, transcriptome_filter=None, transcriptome_selection=None, validation_config=None, log_file=None, write_json_summary=True, num_threads=None, input_source=None, keep_nextflow_work=False, variant_config=None)[source]ο
Initialize workflow configuration.
- Parameters:
output_dir (
Path)gene_query (
str)database (
DatabaseType)design_params (
DesignParameters|None)mirna_database (
str)transcriptome_selection (
ReferenceSelection|None)validation_config (
ValidationConfig|None)write_json_summary (
bool)input_source (
InputSource|None)keep_nextflow_work (
bool)variant_config (
VariantWorkflowConfig|None)
- class sirnaforge.workflow.SiRNAWorkflow(config)[source]ο
Bases:
objectMain workflow orchestrator for siRNA design pipeline.
- Parameters:
config (
WorkflowConfig)
- __init__(config)[source]ο
Initialize the siRNA workflow orchestrator.
- Parameters:
config (
WorkflowConfig)
- async step1_retrieve_transcripts(progress)[source]ο
Step 1: Retrieve and validate transcript sequences.
- Parameters:
progress (
Progress)- Return type:
- async resolve_variants_step(progress)[source]ο
Resolve variants for targeting or avoidance (optional workflow step).
This step runs after transcript retrieval and before siRNA design, resolving and filtering variants based on the workflow configuration.
This step is run after transcript retrieval and before ORF validation and siRNA design. Variants are resolved using ClinVar, Ensembl Variation, and/or VCF files.
- Parameters:
progress (
Progress) β Rich progress tracker- Return type:
list[VariantRecord]- Returns:
List of resolved VariantRecords that passed filters
- async step2_validate_orfs(transcripts, progress)[source]ο
Step 2: Validate ORFs and generate validation report.
- Parameters:
transcripts (
list[TranscriptInfo])progress (
Progress)
- Return type:
- async step3_design_sirnas(transcripts, progress)[source]ο
Step 3: Design siRNA candidates for valid transcripts.
Parallelizes per-transcript design when not running from a user-provided input FASTA, to preserve backward-compatibility with tests and monkeypatching of design_from_file. Set env SIRNAFORGE_PARALLEL_DESIGN=1 to force parallel mode.
- Parameters:
transcripts (
list[TranscriptInfo])progress (
Progress)
- Return type:
- async step4_generate_reports(design_results)[source]ο
Step 4: Generate comprehensive reports.
- Parameters:
design_results (
DesignResult)- Return type:
- async sirnaforge.workflow.run_sirna_workflow(gene_query, output_dir, input_fasta=None, database='ensembl', design_mode='sirna', top_n_candidates=20, genome_species=None, genome_indices_override=None, mirna_database='mirgenedb', mirna_species=None, transcriptome_fasta=None, transcriptome_filter=None, transcriptome_selection=None, gc_min=30.0, gc_max=52.0, sirna_length=21, modification_pattern='standard_2ome', overhang='dTdT', check_off_targets=True, variant_ids=None, variant_vcf_file=None, variant_mode='avoid', variant_min_af=0.01, variant_clinvar_filters='Pathogenic,Likely pathogenic', variant_assembly='GRCh38', log_file=None, write_json_summary=True, num_threads=None, allow_transcriptome_with_input_fasta=False, default_transcriptome_sources=('ensembl_human_cdna', 'ensembl_mouse_cdna', 'ensembl_rat_cdna', 'ensembl_macaque_cdna'), keep_nextflow_work=False, nextflow_docker_image=None)[source]ο
Run complete siRNA design workflow.
- Parameters:
gene_query (
str) β Gene name or ID to search foroutput_dir (
str) β Directory for output filesinput_fasta (
str|None) β Local path or remote URI to an input FASTA filedatabase (
str) β Database to search (ensembl, refseq, gencode)design_mode (
str) β Design mode (sirna or mirna)top_n_candidates (
int) β Number of top candidates to generategenome_species (
list[str] |None) β Species genomes for off-target analysisgenome_indices_override (
str|None) β Comma-separated species:/index_prefix overrides for off-target analysismirna_database (
str) β miRNA reference database identifiermirna_species (
Sequence[str] |None) β miRNA reference species identifierstranscriptome_fasta (
str|None) β Path or URL to transcriptome FASTA for off-target analysistranscriptome_filter (
str|None) β Comma-separated filter names (protein_coding, canonical_only)transcriptome_selection (
ReferenceSelection|None) β Pre-resolved transcriptome selection metadatagc_min (
float) β Minimum GC content percentagegc_max (
float) β Maximum GC content percentagesirna_length (
int) β siRNA length in nucleotidesmodification_pattern (
str) β Chemical modification patternoverhang (
str) β Overhang sequence (dTdT for DNA, UU for RNA)check_off_targets (
bool) β Perform off-target analysis stage (default: True)variant_ids (
list[str] |None) β List of variant identifiers (rsID, chr:pos:ref:alt, or HGVS) to target or avoidvariant_vcf_file (
Path|None) β Path to VCF file containing variants to target or avoidvariant_mode (
str) β How to handle variants (avoid/target/both) - default is avoidvariant_min_af (
float) β Minimum allele frequency threshold for variant filtering (default: 0.01)variant_clinvar_filters (
str) β Comma-separated ClinVar significance levels to include (default: Pathogenic,Likely pathogenic)variant_assembly (
str) β Reference genome assembly for variants (only GRCh38 supported)write_json_summary (
bool) β Write logs/workflow_summary.jsonnum_threads (
int|None) β Optional override for design parallelismallow_transcriptome_with_input_fasta (
bool) β Force transcriptome analysis even when using input FASTAdefault_transcriptome_sources (
Sequence[str]) β Ordered list of transcriptome identifiers evaluated by defaultkeep_nextflow_work (
bool) β Keep Nextflow work directory symlink in outputnextflow_docker_image (
str|None) β Override Docker image used by the embedded Nextflow pipeline
- Return type:
- Returns:
Dictionary with complete workflow results
- async sirnaforge.workflow.run_offtarget_only_workflow(input_candidates_fasta, output_dir, genome_species=None, genome_indices_override=None, mirna_database='mirgenedb', mirna_species=None, transcriptome_fasta=None, transcriptome_filter=None, transcriptome_selection=None, log_file=None, nextflow_docker_image=None)[source]ο
Run off-target-only workflow for pre-designed siRNA candidates.
This is a simplified workflow that only runs the off-target analysis stage without transcript retrieval, ORF validation, or siRNA design. It accepts pre-designed 21-nt siRNA guide sequences and runs comprehensive off-target analysis using the embedded Nextflow pipeline.
- Parameters:
input_candidates_fasta (
str) β Path to FASTA file with 21-nt siRNA guide sequencesoutput_dir (
str) β Directory for output filesgenome_species (
list[str] |None) β Species genomes for off-target analysisgenome_indices_override (
str|None) β Comma-separated species:/index_prefix overridesmirna_database (
str) β miRNA reference database identifiermirna_species (
Sequence[str] |None) β miRNA reference species identifierstranscriptome_fasta (
str|None) β Path or URL to transcriptome FASTA for off-target analysistranscriptome_filter (
str|None) β Comma-separated filter names (protein_coding, canonical_only)transcriptome_selection (
ReferenceSelection|None) β Pre-resolved transcriptome selection metadatanextflow_docker_image (
str|None) β Override Docker image used by the embedded Nextflow pipeline
- Return type:
- Returns:
Dictionary with off-target analysis results
Validationο
Validation Configurationο
Validation configuration and settings.
- class sirnaforge.validation.config.ValidationLevel(*values)[source]ο
-
Validation strictness levels.
- STRICT = 'strict'ο
- WARNING = 'warning'ο
- DISABLED = 'disabled'ο
- class sirnaforge.validation.config.ValidationStage(*values)[source]ο
-
Pipeline stages where validation can be applied.
- INPUT = 'input'ο
- TRANSCRIPT_RETRIEVAL = 'transcript_retrieval'ο
- ORF_ANALYSIS = 'orf_analysis'ο
- DESIGN = 'design'ο
- FILTERING = 'filtering'ο
- SCORING = 'scoring'ο
- OFF_TARGET = 'off_target'ο
- OUTPUT = 'output'ο
- class sirnaforge.validation.config.ValidationConfig(**data)[source]ο
Bases:
BaseModelConfiguration for validation system.
- Parameters:
data (
Any)
-
default_level:
ValidationLevelο
-
stage_levels:
dict[ValidationStage,ValidationLevel]ο
- get_level_for_stage(stage)[source]ο
Get validation level for a specific stage.
- Parameters:
stage (
ValidationStage)- Return type:
- is_enabled_for_stage(stage)[source]ο
Check if validation is enabled for a stage.
- Parameters:
stage (
ValidationStage)- Return type:
- should_fail_on_error(stage)[source]ο
Check if validation errors should cause failures.
- Parameters:
stage (
ValidationStage)- Return type:
- model_config: ClassVar[ConfigDict] = {}ο
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class sirnaforge.validation.config.ValidationPresets[source]ο
Bases:
objectPredefined validation configurations.
Validation Middlewareο
Validation middleware for integrating validation into the siRNA design workflow.
- class sirnaforge.validation.middleware.ValidationReport(stage)[source]ο
Bases:
objectComprehensive validation report for a workflow stage.
- Parameters:
stage (
ValidationStage)
- __init__(stage)[source]ο
Initialize validation report.
- Parameters:
stage (
ValidationStage)
- add_item_result(result)[source]ο
Add validation result for an individual item.
- Parameters:
result (
ValidationResult)- Return type:
- class sirnaforge.validation.middleware.ValidationMiddleware(config)[source]ο
Bases:
objectMiddleware for integrating validation throughout the workflow.
- Parameters:
config (
ValidationConfig)
- __init__(config)[source]ο
Initialize validation middleware.
- Parameters:
config (
ValidationConfig)
- validate_input_parameters(params)[source]ο
Validate input design parameters.
- Parameters:
params (
DesignParameters)- Return type:
- validate_transcripts(transcripts)[source]ο
Validate transcript data after retrieval.
- Parameters:
transcripts (
list[TranscriptInfo])- Return type:
- validate_design_results(design_result)[source]ο
Validate siRNA design results.
- Parameters:
design_result (
DesignResult)- Return type:
- validate_dataframe_output(df, schema_type)[source]ο
Validate DataFrame output against pandera schemas.
- Parameters:
- Return type:
- validate_transcript_id_consistency(transcripts, candidates, orf_data=None)[source]ο
Validate consistency of transcript IDs across datasets.
- Parameters:
transcripts (
list[TranscriptInfo])candidates (
list[SiRNACandidate])
- Return type:
Validation Utilitiesο
Validation utilities for data consistency and cross-validation.
- class sirnaforge.validation.utils.ValidationResult(is_valid=True)[source]ο
Bases:
objectContainer for validation results.
- Parameters:
is_valid (
bool)
- __init__(is_valid=True)[source]ο
Initialize validation result container.
- Parameters:
is_valid (
bool)
- merge(other)[source]ο
Merge another validation result into this one.
- Parameters:
other (
ValidationResult)- Return type:
- class sirnaforge.validation.utils.ValidationUtils[source]ο
Bases:
objectUtility functions for data validation.
- static validate_nucleotide_sequence(sequence, allow_ambiguous=True)[source]ο
Validate nucleotide sequence composition.
- Parameters:
- Return type:
- static validate_sirna_length(sequence)[source]ο
Validate siRNA sequence length.
- Parameters:
sequence (
str)- Return type:
- static validate_parameter_consistency(params)[source]ο
Validate design parameter consistency.
- Parameters:
params (
DesignParameters)- Return type:
- static validate_candidate_consistency(candidate)[source]ο
Validate siRNA candidate internal consistency.
- Parameters:
candidate (
SiRNACandidate)- Return type:
- static validate_dataframe_schema(df, schema_type)[source]ο
Validate DataFrame against appropriate pandera schema.
- Parameters:
- Return type:
- static validate_transcript_ids_consistency(candidate_ids, orf_ids, transcript_ids)[source]ο
Validate consistency of transcript IDs across datasets.
- static validate_biological_constraints(candidate)[source]ο
Validate bioinformatics-specific constraints.
- Parameters:
candidate (
SiRNACandidate)- Return type:
Utilitiesο
Logging Utilitiesο
Logging utilities for siRNAforge toolkit.
This module provides a single point to configure logging for both console and an optional centralized log file. Call configure_logging once at application startup (CLI entrypoint) to enable file logging. Individual modules should use get_logger(__name__) to obtain a configured logger.
- sirnaforge.utils.logging_utils.configure_logging(level=None, log_file=None)[source]ο
Configure root logger: console + optional rotating file handler.
- sirnaforge.utils.logging_utils.get_logger(name, level=None)[source]ο
Get a logger with standard configuration.
This will return a child logger of the root logger configured by configure_logging. For scripts that donβt call configure_logging, get_logger will still set a console handler on first use.
Modification Patternsο
Utility functions for applying chemical modification patterns to siRNA candidates.
This module provides functions to apply standard modification patterns to siRNA candidates during the design workflow, enabling automated annotation of chemical modifications for downstream synthesis and analysis.
- sirnaforge.utils.modification_patterns.apply_standard_2ome_pattern(sequence)[source]ο
Apply standard alternating 2β-O-methyl pattern.
This is the industry-standard pattern providing balanced nuclease resistance and RISC loading efficiency.
- Parameters:
sequence (
str) β RNA sequence to modify- Return type:
- Returns:
List containing one ChemicalModification with alternating positions
- sirnaforge.utils.modification_patterns.apply_minimal_terminal_pattern(sequence)[source]ο
Apply minimal terminal modifications for cost-effective protection.
Modifies only the 3β terminal positions to provide basic nuclease resistance while minimizing synthesis cost.
- Parameters:
sequence (
str) β RNA sequence to modify- Return type:
- Returns:
List containing one ChemicalModification with terminal positions
- sirnaforge.utils.modification_patterns.apply_maximal_stability_pattern(sequence)[source]ο
Apply maximal stability pattern for in vivo applications.
Fully modified pattern similar to FDA-approved therapeutics, providing maximum nuclease resistance and extended serum half-life.
- Parameters:
sequence (
str) β RNA sequence to modify- Return type:
- Returns:
List containing ChemicalModifications (2OMe on all positions, PS at terminals)
- sirnaforge.utils.modification_patterns.get_modification_pattern(pattern_name, sequence)[source]ο
Get modification pattern by name.
- Parameters:
- Return type:
- Returns:
List of ChemicalModification objects
- Raises:
ValueError β If pattern_name is not recognized
- sirnaforge.utils.modification_patterns.apply_modifications_to_candidate(candidate, pattern_name='standard_2ome', overhang='dTdT', target_gene=None)[source]ο
Apply chemical modifications to a siRNA candidate.
This function annotates both guide and passenger strands with the specified modification pattern and overhang, updating the candidateβs metadata fields.
- Parameters:
candidate (
SiRNACandidate) β SiRNACandidate to annotatepattern_name (
str) β Modification pattern to apply (default: standard_2ome)overhang (
str) β Overhang sequence (default: dTdT)target_gene (
str|None) β Optional target gene name for metadata
- Return type:
- Returns:
Updated SiRNACandidate with modification metadata
Resource Resolverο
Utilities for resolving user-provided input resources.
Supports downloading transcript FASTA files from remote locations and normalises them into local paths that the workflow can consume.
- class sirnaforge.utils.resource_resolver.InputSource(original, local_path, source_type, downloaded, size_bytes, sha256=None)[source]ο
Bases:
objectNormalized representation of a workflow input resource.
- Parameters:
- sirnaforge.utils.resource_resolver.resolve_input_source(input_location, destination_root, *, timeout=30.0)[source]ο
Resolve a workflow input location into a local path.
- Parameters:
- Return type:
- Returns:
InputSource describing the normalized local resource.
- Raises:
FileNotFoundError β If a local file doesnβt exist.
ValueError β If the URI scheme is unsupported.
httpx.HTTPStatusError β If the remote download fails with non-2xx status.
Chemical Modificationsο
Helper functions for working with siRNA chemical modifications metadata.
This module provides utilities for: - Parsing FASTA headers to extract modification metadata - Loading metadata from JSON sidecar files - Encoding/decoding modification annotations
- sirnaforge.modifications.parse_chem_mods(chem_mods_str)[source]ο
Parse ChemMods field from FASTA header.
- Parameters:
chem_mods_str (
str) β String like β2OMe(1,4,6,11)+2F()β- Return type:
- Returns:
List of ChemicalModification objects
- sirnaforge.modifications.parse_provenance(prov_str, url=None)[source]ο
Parse Provenance field from FASTA header.
- Parameters:
- Return type:
- Returns:
Provenance object or None
- sirnaforge.modifications.load_metadata(json_path)[source]ο
Load and validate metadata from JSON sidecar file using Pydantic.
- sirnaforge.modifications.merge_metadata_into_fasta(fasta_path, metadata_path, output_path)[source]ο
Merge metadata from JSON into FASTA headers.
Uses Pydantic for automatic validation of metadata.
- Parameters:
- Return type:
- Returns:
Number of sequences with metadata applied
- Raises:
ValidationError β If metadata doesnβt match StrandMetadata schema