siRNAforge Architectureο
Short version (TL;DR)ο
What it is: A layered Python toolkit for end-to-end siRNA design.
How itβs organized: CLI β Workflow β Core β Models β Data β Pipeline, with shared Utils/Validation.
How data flows: Gene query β transcripts β design β scores β ranked output.
How to extend: plug in new scorers, data providers, and output writers via small interfaces.
Use this page as a map. Skim the diagrams and the βArchitectural Layersβ section first; dive into details only as needed.
Overviewο
siRNAforge is built as a modern Python package with clear separation of concerns, type safety, and extensibility. The architecture follows domain-driven design principles with distinct layers for different responsibilities.
System Architecture Flowο
Package Structureο
Version & Config] B[cli.py
Command Interface] C[workflow.py
Orchestration] subgraph "Core Algorithms" D[design.py
siRNA Design] E[thermodynamics.py
RNA Folding] F[off_target.py
Off-target Analysis] end subgraph "Data Models" G[sirna.py
Pydantic Models] H[schemas.py
Pandera Schemas] end subgraph "Data Access" I[gene_search.py
Gene/Transcript Search] J[orf_analysis.py
ORF Analysis] K[base.py
Base Classes] end subgraph "Pipeline Integration" L[nextflow/
Pipeline Modules] M[resources/
Resource Mgmt] end subgraph "Utilities & Validation" N[utils/
Logging Utils] O[validation/
Validation System] end end B --> C C --> D C --> E C --> F C --> I C --> J D --> G E --> G F --> G I --> G J --> G C --> L style D fill:#e8f5e8 style E fill:#e8f5e8 style F fill:#e8f5e8 style G fill:#fff3e0 style H fill:#fff3e0 style I fill:#f3e5f5 style J fill:#f3e5f5 style K fill:#f3e5f5
Directory Structureο
src/sirnaforge/
βββ __init__.py # Package initialization and version
βββ cli.py # Command-line interface (Typer/Rich)
βββ workflow.py # High-level workflow orchestration
β
βββ core/ # Core algorithms and business logic
β βββ design.py # siRNA design algorithms
β βββ thermodynamics.py # RNA folding and energy calculations
β βββ off_target.py # Off-target prediction algorithms
β
βββ models/ # Data models and validation
β βββ sirna.py # Pydantic models for siRNA data
β βββ schemas.py # Pandera validation schemas
β
βββ data/ # Data access and external APIs
β βββ base.py # Base classes for data providers
β βββ gene_search.py # Gene/transcript search functionality
β βββ orf_analysis.py # Open reading frame analysis
β
βββ pipeline/ # Pipeline and workflow integration
β βββ nextflow/ # Nextflow workflow configs and runners
β βββ resources/ # Resource management
β
βββ utils/ # Shared utilities
β βββ logging_utils.py # Logging configuration
β
βββ validation/ # Data validation and QC
βββ config.py # Validation configuration
βββ middleware.py # Validation middleware
βββ utils.py # Validation utilities
Architectural Layersο
1. CLI Layer (cli.py)ο
Purpose: User interface and command orchestration
Technologies:
Responsibilities:
Command parsing and validation
Progress indicators and user feedback
Error handling and user-friendly messages
Configuration management
Configuration & Default Resolution (sirnaforge.config)ο
The sirnaforge.config.reference_policy module centralizes how user inputs and default references are resolved. WorkflowInputSpec captures raw CLI/API inputs, while ReferencePolicyResolver produces metadata-rich ReferenceChoice objects that indicate whether a reference was explicitly provided, auto-selected, or disabled. The workflow records these choices (currently for transcriptome references) inside logs/workflow_summary.json so production runs can be audited without inspecting CLI arguments.
2. Workflow Layer (workflow.py)ο
Purpose: High-level process orchestration
Pattern: Facade/Coordinator
Responsibilities:
Coordinate multi-step workflows
Handle data flow between components
Manage temporary files and outputs
Provide consistent APIs for different entry points
3. Core Layer (core/)ο
Purpose: Core algorithms and business logic
Pattern: Strategy/Template Method
Core Components:ο
design.py - siRNA Design Engineο
class SiRNADesigner:
"""Main siRNA design orchestrator"""
class SiRNACandidate:
"""Individual siRNA candidate with scoring"""
class DesignParameters:
"""Configuration for design algorithms"""
thermodynamics.py - RNA Structure Analysisο
class ThermodynamicsCalculator:
"""ViennaRNA integration for structure prediction and asymmetry scoring"""
class ThermodynamicAsymmetryScorer:
"""Calculate thermodynamic asymmetry for guide strand selection"""
class SecondaryStructure:
"""RNA secondary structure representation"""
Thermodynamic Asymmetry Implementation:
The thermodynamic asymmetry scoring is a critical component that predicts guide strand selection into RISC (RNA-induced silencing complex). This implementation is based on research showing that siRNAs with less stable 5β ends on the guide strand are more effectively incorporated into RISC.
Key Research Foundation:
Khvorova A et al. (2003): Demonstrated thermodynamic asymmetry importance for RISC incorporation
Naito Y et al. (2009): Established thermodynamic stability as a major determinant of siRNA efficiency
Amarzguioui M and Prydz H (2004): Identified asymmetry as critical for distinguishing target genes
Ichihara M et al. (2017): Comprehensive principles including thermodynamic asymmetry for efficacy prediction
Algorithm Components:
5β End Stability Analysis: Calculates free energy of duplex 5β terminus (positions 1-4)
3β End Stability Analysis: Calculates free energy of duplex 3β terminus (positions -4 to -1)
Asymmetry Ratio Calculation: Measures stability difference (ΞGββ - ΞGβ β)
Strand Bias Prediction: Predicts likelihood of correct guide strand selection
off_target.py - Specificity Analysisο
class OffTargetPredictor:
"""Multi-genome off-target prediction"""
class AlignmentResult:
"""Off-target alignment with scoring"""
4. Model Layer (models/)ο
Purpose: Data models and validation
Technologies: Pydantic for data validation
Pattern: Data Transfer Objects (DTOs)
Key Models:ο
class SiRNACandidate(BaseModel):
"""Complete siRNA candidate with all metadata"""
sirna_id: str
guide_sequence: str
passenger_sequence: str
position: int
composite_score: float
asymmetry_score: float
gc_content: float
class DesignParameters(BaseModel):
"""Design configuration with validation"""
sirna_length: int = Field(ge=19, le=23)
gc_min: float = Field(ge=0, le=100)
gc_max: float = Field(ge=0, le=100)
top_n: int = Field(ge=1)
class FilterCriteria(BaseModel):
"""Quality filters for candidate selection"""
gc_min: float = 30.0
gc_max: float = 60.0
max_poly_runs: int = 3
min_asymmetry_score: float = 0.65
class ScoringWeights(BaseModel):
"""Relative weights for composite scoring"""
asymmetry: float = 0.25
gc_content: float = 0.20
accessibility: float = 0.25
off_target: float = 0.20
empirical: float = 0.10
5. Data Layer (data/)ο
Purpose: External data access and integration
Pattern: Repository/Adapter
Data Providers:ο
gene_search.py - Gene Information Retrievalο
class GeneSearcher:
"""Multi-database gene search"""
class GeneSearchResult:
"""Complete gene search result"""
class EnsemblClient(AbstractDatabaseClient):
"""Ensembl REST API integration"""
class RefSeqClient(AbstractDatabaseClient):
"""RefSeq/NCBI integration"""
class GencodeClient(AbstractDatabaseClient):
"""GENCODE database integration"""
orf_analysis.py - Sequence Analysisο
class ORFAnalyzer:
"""Open reading frame validation"""
6. Validation Layer (validation/)ο
Purpose: Data validation and quality control
Pattern: Middleware/Decorator
Technologies: Pandera for schema validation
Validation Components:ο
config.py - Validation Configurationο
class ValidationConfig(BaseModel):
"""Configuration for validation system"""
class ValidationLevel(str, Enum):
"""Validation strictness levels"""
STRICT = "strict"
WARNING = "warning"
DISABLED = "disabled"
class ValidationStage(str, Enum):
"""Pipeline stages for validation"""
INPUT = "input"
DESIGN = "design"
OUTPUT = "output"
utils.py - Validation Utilitiesο
class ValidationResult:
"""Container for validation results"""
def validate_sirna_candidates(df: pd.DataFrame) -> ValidationResult:
"""Validate siRNA candidate data"""
def validate_fasta_sequences(sequences: list) -> ValidationResult:
"""Validate FASTA sequence data"""
7. Pipeline Layer (pipeline/)ο
Purpose: External pipeline integration
Technologies: Nextflow, Docker
Responsibilities:
Nextflow workflow orchestration
Docker container management
Batch processing coordination
Resource management
Pipeline Components:ο
nextflow/ - Workflow Managementο
class NextflowConfig:
"""Nextflow execution configuration"""
class NextflowRunner:
"""Nextflow workflow execution"""
resources/ - Resource Managementο
class ResourceManager:
"""Compute resource allocation and monitoring"""
Design Principlesο
1. Type Safetyο
All components use comprehensive type hints and Pydantic validation:
from typing import List, Optional
from pydantic import BaseModel, Field
class DesignParameters(BaseModel):
sirna_length: int = Field(21, ge=19, le=23)
top_candidates: int = Field(10, ge=1)
gc_content_range: tuple[float, float] = (30.0, 60.0)
2. Separation of Concernsο
Each layer has distinct responsibilities:
CLI: User interaction
Workflow: Process orchestration
Core: Algorithm implementation
Models: Data representation
Data: External integration
3. Dependency Injectionο
Components use constructor injection and composition patterns:
class SiRNADesigner:
def __init__(self, parameters: DesignParameters) -> None:
"""Initialize designer with configuration parameters."""
self.parameters = parameters
# Components are instantiated as needed
# ThermodynamicCalculator() instantiated when required
class GeneSearcher:
def __init__(self, timeout: int = 30, max_retries: int = 3):
"""Initialize with configurable database clients."""
self.clients: dict[DatabaseType, AbstractDatabaseClient] = {
DatabaseType.ENSEMBL: EnsemblClient(timeout=timeout),
DatabaseType.REFSEQ: RefSeqClient(timeout=timeout),
DatabaseType.GENCODE: GencodeClient(timeout=timeout),
}
4. Error Handlingο
Comprehensive error handling with custom exceptions:
class SiRNAForgeException(Exception):
"""Base exception for all siRNAforge errors"""
class DesignException(SiRNAForgeException):
"""siRNA design specific errors"""
class ValidationException(SiRNAForgeException):
"""Input validation errors"""
5. Configuration Managementο
Centralized configuration with environment support:
class WorkflowConfig(BaseModel):
"""Workflow-specific configuration"""
class ValidationConfig(BaseModel):
"""Validation configuration with environment variable support"""
class NextflowConfig(BaseModel):
"""Nextflow execution parameters"""
class Config:
env_prefix = "SIRNAFORGE_"
Data Flowο
1. Complete Workflowο
graph TD
A[Gene Query] --> B[Gene Search]
B --> C[Transcript Retrieval]
C --> D[ORF Analysis]
D --> E[siRNA Design]
E --> F[Thermodynamic Analysis]
F --> G[Off-target Prediction]
G --> H[Scoring & Ranking]
H --> I[Output Generation]
2. Component Interactionο
graph LR
CLI --> Workflow
Workflow --> Core
Core --> Models
Core --> Data
Pipeline --> Core
Utils --> All[All Components]
Extension Pointsο
1. Custom Scoring Functionsο
The current implementation uses fixed scoring within SiRNADesigner._score_candidates(). Custom scoring can be implemented by:
class CustomSiRNADesigner(SiRNADesigner):
def _score_candidates(self, candidates: list[SiRNACandidate]) -> list[SiRNACandidate]:
"""Override with custom scoring logic"""
for candidate in candidates:
candidate.composite_score = self._custom_score(candidate)
return candidates
def _custom_score(self, candidate: SiRNACandidate) -> float:
# Custom scoring logic
return score
2. Additional Data Sourcesο
Extend the database client system:
class CustomDatabaseClient(AbstractDatabaseClient):
async def search_gene(self, query: str) -> GeneSearchResult:
# Custom gene search logic
return result
# Register with the searcher
searcher = GeneSearcher()
searcher.clients[DatabaseType.CUSTOM] = CustomDatabaseClient()
3. New Validation Rulesο
Extend the validation system:
class CustomValidationRules:
def validate_custom_criteria(self, candidates: pd.DataFrame) -> ValidationResult:
# Custom validation logic
return result
# Use with validation middleware
from sirnaforge.validation.middleware import ValidationMiddleware
validator = ValidationMiddleware(custom_rules=CustomValidationRules())
Performance Considerationsο
1. Asynchronous Operationsο
External API calls and I/O operations use asyncio:
async def search_multiple_databases(query: str) -> List[SearchResult]:
tasks = [
search_ensembl(query),
search_refseq(query),
search_gencode(query)
]
return await asyncio.gather(*tasks)
2. Memory Managementο
Large datasets are processed in chunks:
def design_from_large_file(file_path: Path) -> Iterator[SiRNACandidate]:
for chunk in read_fasta_chunks(file_path, chunk_size=1000):
yield from design_candidates(chunk)
3. Cachingο
Expensive computations are cached:
from functools import lru_cache
@lru_cache(maxsize=1000)
def calculate_thermodynamics(sequence: str) -> ThermodynamicResult:
# Expensive ViennaRNA calculation
return result
Testing Architectureο
1. Unit Tests (tests/unit/)ο
Test individual components in isolation
Mock external dependencies
Focus on algorithm correctness
2. Integration Tests (tests/integration/)ο
Test component interactions
Use real external services (with rate limiting)
Validate end-to-end workflows
3. Pipeline Tests (tests/pipeline/)ο
Test Nextflow pipeline components
Container-based testing
Resource usage validation
Deployment Architectureο
1. Local Developmentο
uvfor dependency managementDirect Python execution
Local debugging and testing
2. Container Deploymentο
Multi-stage Docker builds
Optimized for size and security
Environment-specific configurations
3. Pipeline Deploymentο
Nextflow for workflow orchestration
Support for multiple execution platforms
Resource management and monitoring
Future Architecture Considerationsο
1. Microservicesο
Potential split into specialized services
API gateway for service coordination
Independent scaling of components
2. Cloud Integrationο
Cloud storage for large datasets
Serverless functions for lightweight operations
Managed services for databases
3. Plugin Systemο
Dynamic loading of algorithms
Third-party extensions
Community contributions
This architecture provides a solid foundation for current needs while maintaining flexibility for future enhancements.