Data Models, Filtering & Scoring: Technical Referenceο
Academic rigor note: This document provides comprehensive documentation of siRNAforgeβs data models, filtering criteria, and scoring algorithms with citations and justification for critical thresholds. Sections marked with
[REVIEW NEEDED]indicate areas requiring additional expert review or validation.
Overviewο
siRNAforge implements a multi-stage computational pipeline for siRNA design that relies on validated data models and research-backed scoring algorithms. This document describes:
Data Models - Pydantic-validated structures for siRNA candidates and analysis results
Filter Criteria - Evidence-based thresholds for candidate quality control
Scoring Algorithms - Composite scoring with thermodynamic and empirical components
Threshold Justification - Literature citations and rationale for default parameters
1. Core Data Modelsο
1.1 SiRNACandidateο
The SiRNACandidate model represents a complete siRNA duplex with all computed properties.
class SiRNACandidate(BaseModel):
"""Individual siRNA candidate with computed thermodynamic and efficacy properties."""
# Identity (unique identifier and source)
id: str # Format: SIRNAF_{transcript}_{start}_{end}
transcript_id: str # Source transcript (e.g., ENST00000269305)
position: int # 1-based start position in transcript
# Duplex sequences
guide_sequence: str # Antisense strand (loaded into RISC)
passenger_sequence: str # Sense strand (typically degraded)
# Basic properties
gc_content: float # GC percentage (optimal: 35-60%)
length: int # Duplex length (typically 21 nt)
# Thermodynamic properties
asymmetry_score: float # RISC loading preference (optimal: β₯0.65)
duplex_stability: float # ΞG in kcal/mol (optimal: -15 to -25)
# Secondary structure
structure: str # Dot-bracket notation
mfe: float # Minimum free energy (optimal: -2 to -8 kcal/mol)
paired_fraction: float # Fraction paired bases (optimal: 0.4-0.6)
# Off-target metrics
off_target_count: int # Potential off-target sites (goal: β€3)
transcriptome_hits_0mm: int # Perfect match hits
transcriptome_hits_1mm: int # 1-mismatch hits
transcriptome_hits_2mm: int # 2-mismatch hits
mirna_hits_total: int # Total miRNA seed matches
mirna_hits_0mm_seed: int # Perfect seed matches
# Scoring
composite_score: float # Overall quality (0-100 scale)
component_scores: dict # Individual scoring components
passes_filters: bool|FilterStatus # Quality control status
Sequence Validationο
All sequences undergo validation:
Allowed nucleotides: A, T, C, G, U
Length constraints: 19-23 nucleotides (siRNA length)
Strand matching: Guide and passenger must be same length
1.2 DesignParametersο
Configuration model for the design workflow:
class DesignParameters(BaseModel):
"""Complete configuration for siRNA design workflow."""
# Design mode
design_mode: DesignMode # "sirna" or "mirna"
# Sequence parameters
sirna_length: int = 21 # Duplex length (19-23 nt)
top_n: int = 50 # Number of candidates to return
# Quality control
filters: FilterCriteria # Threshold parameters
scoring: ScoringWeights # Component weights
# Chemical modifications
apply_modifications: bool = True
modification_pattern: str = "standard_2ome"
default_overhang: str = "dTdT"
1.3 FilterCriteriaο
Threshold parameters for quality filtering:
class FilterCriteria(BaseModel):
"""Quality filters based on thermodynamic and empirical criteria."""
# GC content (literature: 30-60%, optimal: 40-55%)
gc_min: float = 35.0
gc_max: float = 60.0
# Sequence composition
max_poly_runs: int = 3 # Max consecutive identical nucleotides
# Secondary structure
max_paired_fraction: float = 0.6 # Prevent rigid structures
# Thermodynamic asymmetry
min_asymmetry_score: float = 0.65 # Guide strand selection
# MFE thresholds (kcal/mol)
mfe_min: float = -8.0 # Too stable (more negative)
mfe_max: float = -2.0 # Too unstable (less negative)
# Duplex stability (kcal/mol)
duplex_stability_min: float = -25.0
duplex_stability_max: float = -15.0
# Melting temperature (Β°C, for mammalian cells)
melting_temp_min: float = 60.0
melting_temp_max: float = 78.0
# End asymmetry ΞΞG (kcal/mol)
delta_dg_end_min: float = 2.0
delta_dg_end_max: float = 6.0
# Off-target limits
max_off_target_count: int = 3
1.4 OffTargetFilterCriteriaο
Specialized filtering for off-target analysis results:
class OffTargetFilterCriteria(BaseModel):
"""Off-target analysis filtering criteria."""
# Transcriptome off-targets (mismatch tolerance)
max_transcriptome_hits_0mm: int = 0 # Perfect matches
max_transcriptome_hits_1mm: int = 5 # 1-mismatch hits
max_transcriptome_hits_2mm: int = 20 # 2-mismatch hits
# miRNA seed matches (positions 2-8)
max_mirna_perfect_seed: int = 3
max_mirna_1mm_seed: int = 10
fail_on_high_risk_mirna: bool = True
1.5 ScoringWeightsο
Relative weights for composite scoring:
class ScoringWeights(BaseModel):
"""Component weights for composite scoring (must sum to 1.0)."""
asymmetry: float = 0.25 # Thermodynamic asymmetry
gc_content: float = 0.20 # GC optimization
accessibility: float = 0.25 # Target accessibility
off_target: float = 0.20 # Specificity
empirical: float = 0.10 # Position-specific rules
2. Scoring Algorithmsο
2.1 Composite Score Calculationο
The composite score integrates multiple evidence-based components:
Where \(w_i\) are configurable weights and \(S_i\) are normalized component scores (0-1).
2.2 Thermodynamic Asymmetry Scoreο
Research basis: Khvorova et al. (2003), Schwarz et al. (2003)
RISC preferentially loads the strand with the less thermodynamically stable 5β end. The asymmetry score measures this preference:
Algorithm:
Calculate 5β end stability (positions 1-7): \(\Delta G_{5'}\)
Calculate 3β end stability (positions 15-21): \(\Delta G_{3'}\)
Compute asymmetry: \(\text{raw} = \Delta G_{5'} - \Delta G_{3'}\)
Normalize: \(\text{score} = \max(0, \min(1, (\text{raw} + 5) / 10))\)
Implementation (ViennaRNA):
def calculate_asymmetry_score(candidate) -> tuple[float, float, float]:
"""Returns (dg_5p, dg_3p, asymmetry_score)"""
dg_5p = calculate_end_stability(guide[:7], passenger[:7])
dg_3p = calculate_end_stability(guide[14:21], passenger[14:21])
asymmetry_raw = dg_5p - dg_3p
asymmetry_score = max(0.0, min(1.0, (asymmetry_raw + 5.0) / 10.0))
return dg_5p, dg_3p, asymmetry_score
Interpretation:
Score |
Interpretation |
|---|---|
0.8-1.0 |
Excellent - strong guide strand bias |
0.65-0.8 |
Good - likely correct strand selection |
0.5-0.65 |
Moderate - mixed strand loading possible |
<0.5 |
Poor - passenger strand may dominate |
2.3 GC Content Scoreο
Research basis: Reynolds et al. (2004), Ui-Tei et al. (2004)
GC content affects duplex stability and target accessibility. The scoring uses a Gaussian penalty centered at optimal GC (40%):
Implementation:
def _calculate_gc_score(gc_content: float) -> float:
"""Gaussian penalty around 40% GC."""
return math.exp(-(((gc_content - 40) / 10) ** 2))
Interpretation:
GC Range |
Effect |
|---|---|
<35% |
Unstable duplex, poor RISC loading |
35-40% |
Acceptable, monitor stability |
40-55% |
Optimal range |
55-60% |
Acceptable, may reduce accessibility |
>60% |
Overly stable, poor target release |
2.4 Duplex Stability Scoreο
Research basis: Naito et al. (2009), Ichihara et al. (2017)
Duplex formation ΞG affects RISC loading efficiency. Score normalized from ΞG range [-40, -5] kcal/mol:
Implementation:
def _calculate_duplex_score(candidate) -> tuple[float, float]:
"""Returns (normalized_score, dg_value)"""
dg = calculate_duplex_stability(guide, passenger)
dg_clamped = max(-40.0, min(-5.0, dg))
score = (-(dg_clamped) - 5.0) / (40.0 - 5.0)
return max(0.0, min(1.0, score)), dg
Optimal range: -15 to -25 kcal/mol
2.5 Target Accessibility Scoreο
Research basis: Tafer et al. (2008)
Target site accessibility affects siRNA efficacy. Score based on guide strand secondary structure:
Implementation (ViennaRNA):
def _calculate_accessibility_score(candidate) -> float:
"""Accessibility inversely related to secondary structure."""
structure, mfe, paired_fraction = calculate_secondary_structure(guide)
return 1.0 - paired_fraction
Optimal: paired_fraction 0.4-0.6 (moderate structure)
2.6 Off-Target Scoreο
Specificity prediction based on internal repetitive sequences:
Implementation:
def _calculate_off_target_score(candidate) -> float:
"""Penalty for repetitive 7-mer sequences."""
penalty = 0
for i in range(len(guide) - 6):
seed = guide[i:i+7]
if guide.count(seed) > 1:
penalty += 10
return math.exp(-penalty / 50)
[REVIEW NEEDED]: Current implementation is simplified. Full off-target analysis uses BWA-MEM2 alignment against reference genomes in the Nextflow pipeline.
2.7 Empirical Score (Reynolds Rules)ο
Research basis: Reynolds et al. (2004)
Position-specific sequence preferences:
def _calculate_empirical_score(candidate) -> float:
"""Simplified Reynolds rules."""
score = 0.5 # Base score
# Prefer A/U at position 19 (3' end)
if guide[18] in ["A", "U"]:
score += 0.1
# Prefer G/C at position 1
if guide[0] in ["G", "C"]:
score += 0.1
# Avoid C at position 19
if guide[18] == "C":
score -= 0.1
return max(0.0, min(1.0, score))
[REVIEW NEEDED]: Additional Reynolds criteria could be implemented:
Position 10 preferences
A/U content in positions 15-19
Avoid GGG stretches
3. Filter Implementationο
3.1 Early Filtering (Enumeration Stage)ο
During candidate enumeration, fast filters are applied to reduce computational load:
def _enumerate_candidates(sequence, transcript_id):
for i in range(len(sequence) - sirna_length + 1):
target_seq = sequence[i:i+sirna_length]
guide_seq = reverse_complement(target_seq)
gc_content = calculate_gc_content(guide_seq)
# Fast rejection
fail_reason = None
if not (gc_min <= gc_content <= gc_max):
fail_reason = FilterStatus.GC_OUT_OF_RANGE
elif has_poly_runs(guide_seq, max_poly_runs):
fail_reason = FilterStatus.POLY_RUNS
if fail_reason:
# Store in rejected pool for "dirty control" analysis
rejected.append(candidate)
else:
candidates.append(candidate)
3.2 Quality Filters (Scoring Stage)ο
Additional filters applied during scoring:
Filter |
Condition |
Rationale |
|---|---|---|
|
paired_fraction > 0.6 |
Prevents rigid structures |
|
asymmetry_score < min_asymmetry_score |
Ensures guide strand selection |
3.3 Filter Status Codesο
class FilterStatus(str, Enum):
PASS = "PASS" # All criteria met
GC_OUT_OF_RANGE = "GC_OUT_OF_RANGE" # GC content outside bounds
POLY_RUNS = "POLY_RUNS" # Homopolymer runs exceed limit
EXCESS_PAIRING = "EXCESS_PAIRING" # Too much secondary structure
LOW_ASYMMETRY = "LOW_ASYMMETRY" # Poor thermodynamic asymmetry
DIRTY_CONTROL = "DIRTY_CONTROL" # Reserved for controls
4. Threshold Justificationο
4.1 GC Content: 35-60%ο
Literature support:
Reynolds et al. (2004): Optimal 30-52% for maximum silencing
Ui-Tei et al. (2004): Functional siRNAs have 35-65% GC
Jackson et al. (2006): Higher GC correlates with off-targets
Rationale: Balance between:
Lower bound (35%): Minimum duplex stability for RISC loading
Upper bound (60%): Maximum to prevent over-stabilization and off-targeting
4.2 Asymmetry Score: β₯0.65ο
Literature support:
Khvorova et al. (2003): Thermodynamic asymmetry determines strand selection
Schwarz et al. (2003): ΞΞG of 2+ kcal/mol ensures correct loading
Rationale: Score of 0.65 corresponds to approximately ΞΞG = 1.5 kcal/mol, providing >80% probability of correct strand selection.
4.3 Poly-runs: β€3 consecutiveο
Literature support:
Jackson et al. (2003): AAAA runs associated with off-targets
Synthesis considerations: Long homopolymers cause synthesis issues
Rationale: Practical limit balancing efficacy and manufacturability.
4.4 MFE: -2 to -8 kcal/molο
Literature support:
Tafer et al. (2008): Moderate structure optimal for target binding
Too stable (<-10): Impaired target access
Too unstable (>0): Poor duplex integrity
4.5 Melting Temperature: 60-78Β°Cο
Literature support:
Standard for mammalian cell culture at 37Β°C
Allows duplex stability while permitting RISC-mediated unwinding
[REVIEW NEEDED]: Temperature thresholds may need adjustment for:
Plant cells (different optimal ranges)
In vivo applications (serum stability requirements)
5. miRNA-Biogenesis Modeο
5.1 MiRNADesignConfigο
Specialized parameters for miRNA-like siRNA design:
class MiRNADesignConfig(BaseModel):
"""miRNA-biogenesis-aware configuration."""
# Conservative thresholds
gc_min: float = 30.0 # Relaxed lower bound
gc_max: float = 52.0 # Stricter upper bound
asymmetry_min: float = 0.65
# Argonaute loading preferences
scoring_weights: dict = {
"ago_start_bonus": 0.1, # A/U at position 1
"pos1_mismatch_bonus": 0.05, # G:U wobble preferred
"seed_clean_bonus": 0.15, # Clean seed region
"supp_13_16_bonus": 0.1, # 3' supplementary pairing
}
5.2 miRNA-Specific Scoringο
Position 1 analysis:
Argonaute preferentially loads strands with A/U at position 1
G:U wobble or mismatch at position 1 improves loading
Seed region (positions 2-8):
Critical for target recognition
Clean seed = lower off-target potential
3β Supplementary pairing (positions 13-16):
Contributes to target specificity
Lower stability preferred (more specific)
[REVIEW NEEDED]: miRNA-specific scoring weights are based on general principles. Experimental validation recommended for therapeutic applications.
6. Off-Target Analysis Modelsο
6.1 OffTargetHitο
class OffTargetHit(BaseModel):
"""Single off-target alignment from BWA analysis."""
qname: str # siRNA identifier
qseq: str # Query sequence
rname: str # Reference (chromosome/transcript)
coord: int # Alignment position
strand: str # + or -
nm: int # Edit distance
seed_mismatches: int # Mismatches in seed (pos 2-8)
offtarget_score: float
6.2 MiRNAHitο
class MiRNAHit(BaseModel):
"""miRNA seed match from database alignment."""
species: str # e.g., "hsa" (human)
database: str # mirgenedb, mirbase, etc.
mirna_id: str # e.g., hsa-miR-21-5p
seed_mismatches: int # Seed region mismatches
6.3 Supported miRNA Databasesο
Database |
Description |
|---|---|
|
High-confidence, manually curated |
|
Comprehensive, all mature miRNAs |
|
miRBase high-confidence subset |
|
miRNA family conservation data |
7. Chemical Modification Modelsο
7.1 StrandMetadataο
class StrandMetadata(BaseModel):
"""Complete strand metadata with modifications."""
id: str
sequence: str
overhang: str # e.g., "dTdT", "UU"
chem_mods: list[ChemicalModification]
provenance: Provenance
7.2 ChemicalModificationο
class ChemicalModification(BaseModel):
"""Position-specific chemical modification."""
type: str # 2OMe, 2F, PS, LNA
positions: list[int] # 1-based positions
7.3 Supported Modification Patternsο
Pattern |
Description |
|---|---|
|
2β-O-methyl at alternating positions |
|
Terminal modifications only |
|
Full backbone modifications |
|
No modifications |
8. Workflows Requiring Documentationο
[DOCUMENTATION NEEDED]: The following workflows exist but require detailed documentation:
8.1 Nextflow Pipelineο
Multi-genome off-target analysis
BWA-MEM2 alignment parameters
Species-specific reference handling
8.2 ORF Validationο
Start/stop codon detection
Frame shift analysis
Kozak sequence scoring
8.3 Transcript Retrievalο
Ensembl/RefSeq/GENCODE integration
Isoform selection criteria
Sequence validation
9. Referencesο
Khvorova A, Reynolds A, Jayasena SD (2003). Functional siRNAs and miRNAs exhibit strand bias. Cell 115(2):209-216.
Schwarz DS, HutvΓ‘gner G, Du T, Xu Z, Aronin N, Bhatt DP (2003). Asymmetry in the assembly of the RNAi enzyme complex. Cell 115(2):199-208.
Reynolds A, Leake D, Boese Q, Scaringe S, Marshall WS, Khvorova A (2004). Rational siRNA design for RNA interference. Nature Biotechnology 22(3):326-330.
Ui-Tei K, Naito Y, Takahashi F, Haraguchi T, Ohki-Hamazaki H, Juni A, Ueda R, Saigo K (2004). Guidelines for the selection of highly effective siRNA sequences for mammalian and chick RNA interference. Nucleic Acids Research 32(3):936-948.
Naito Y, Yoshimura J, Morishita S, Ui-Tei K (2009). siDirect 2.0: updated software for designing functional siRNA with reduced seed-dependent off-target effect. BMC Bioinformatics 10:392.
Ichihara M, Murakumo Y, Masuda A, Matsuura T, Asai N, Jijiwa M, Ishida M, Shinmi J, Yatsuya H, Qiao S, Takahashi M, Ohno K (2007). Thermodynamic instability of siRNA duplex is a prerequisite for dependable prediction of siRNA activities. Nucleic Acids Research 35(18):e123.
Tafer H, Ameres SL, Obernosterer G, Gebeshuber CA, Schroeder R, Martinez J, Hofacker IL (2008). The impact of target site accessibility on the design of effective siRNAs. Nature Biotechnology 26(5):578-583.
Jackson AL, Bartz SR, Schelter J, Kobayashi SV, Burchard J, Mao M, Li B, Cavet G, Linsley PS (2003). Expression profiling reveals off-target gene regulation by RNAi. Nature Biotechnology 21(6):635-637.
Appendix A: Default Parameter Summaryο
Parameter |
Default |
Range |
Justification |
|---|---|---|---|
|
21 |
19-23 |
Standard duplex length |
|
35.0 |
0-100 |
Minimum stability |
|
60.0 |
0-100 |
Maximum stability |
|
3 |
1+ |
Synthesis/specificity |
|
0.6 |
0-1 |
Accessibility |
|
0.65 |
0.3-1 |
Strand selection |
|
-8.0 |
kcal/mol |
Structure stability |
|
-2.0 |
kcal/mol |
Structure stability |
|
-25.0 |
kcal/mol |
Duplex formation |
|
-15.0 |
kcal/mol |
Duplex formation |
|
60.0 |
Β°C |
Mammalian cells |
|
78.0 |
Β°C |
Mammalian cells |
|
3 |
0+ |
Specificity |
Appendix B: Scoring Weight Defaultsο
Component |
Weight |
Rationale |
|---|---|---|
Asymmetry |
0.25 |
Most predictive single factor |
GC Content |
0.20 |
Stability/accessibility balance |
Accessibility |
0.25 |
Target site availability |
Off-target |
0.20 |
Specificity importance |
Empirical |
0.10 |
Position-specific fine-tuning |
Document version: 1.0 Last updated: Auto-generated from source code Review status: Initial draft - Expert review recommended