Chemical Modification Annotation Specificationο
Version: 1.0 Date: 2025-10-19 Status: Implemented
Overviewο
This specification defines how chemical modifications, overhangs, and provenance metadata are represented and managed in siRNAforge. The system provides a first-class data model for annotating modifications alongside siRNA sequences, ensuring annotations survive all pipeline stages without manual parsing.
Data Modelsο
ChemicalModificationο
Represents a specific type of chemical modification and the positions where it occurs in the sequence.
Fields:
type(str): Modification type (e.g., β2OMeβ, β2Fβ, βPSβ, βLNAβ)positions(list[int]): 1-based positions in the sequence where this modification occurs
Example:
from sirnaforge.models.modifications import ChemicalModification
mod = ChemicalModification(type="2OMe", positions=[1, 4, 6, 11, 13, 16, 19])
Provenanceο
Tracks the origin and validation status of siRNA sequences.
Fields:
source_type(SourceType): Type of source (patent, publication, clinical_trial, database, designed, other)identifier(str): Source identifier (e.g., patent number, DOI, PubMed ID)url(str, optional): URL to the source document
Example:
from sirnaforge.models.modifications import Provenance, SourceType
prov = Provenance(
source_type=SourceType.PATENT,
identifier="US10060921B2",
url="https://patents.google.com/patent/US10060921B2"
)
StrandMetadataο
Complete metadata for a single siRNA strand including sequence, modifications, overhangs, and provenance.
Fields:
id(str): Unique identifier for this strandsequence(str): RNA or DNA sequenceoverhang(str, optional): Overhang sequence (e.g., βdTdTβ for DNA, βUUβ for RNA)chem_mods(list[ChemicalModification]): List of chemical modificationsnotes(str, optional): Additional notes or commentsprovenance(Provenance, optional): Source and validation informationconfirmation_status(ConfirmationStatus): Experimental confirmation status (pending, confirmed)
Example:
from sirnaforge.models.modifications import (
StrandMetadata,
ChemicalModification,
Provenance,
SourceType,
ConfirmationStatus
)
metadata = StrandMetadata(
id="patisiran_ttr_guide",
sequence="AUGGAAUACUCUUGGUUAC",
overhang="dTdT",
chem_mods=[
ChemicalModification(type="2OMe", positions=[1, 4, 6, 11, 13, 16, 19])
],
provenance=Provenance(
source_type=SourceType.PATENT,
identifier="US10060921B2"
),
confirmation_status=ConfirmationStatus.CONFIRMED
)
SequenceRecordο
Associates a strand with its target and role information.
Fields:
target_gene(str): Target gene symbolstrand_role(StrandRole): Role of this strand (guide, sense, antisense, passenger)metadata(StrandMetadata): Complete strand metadata
FASTA Header Encodingο
The system uses standardized key-value pairs in FASTA headers to encode metadata:
>patisiran_ttr_guide Target=TTR; Role=guide; Confirmed=confirmed; Overhang=dTdT; ChemMods=2OMe(1,4,6,11,13,16,19); Provenance=Patent:US10060921B2; URL=https://patents.google.com/patent/US10060921B2
AUGGAAUACUCUUGGUUAC
Format Rules:
Fields are separated by
;(semicolon with space)Key-value pairs use
=(equals sign)ChemModssyntax:TYPE(pos1,pos2,...)with multiple types separated by|Multiple modifications:
ChemMods=2OMe(1,4,6)|2F(2,5,8)|PS()Empty positions allowed:
2F()means modification type is annotated but no specific positions
JSON Sidecar Formatο
Metadata can be stored in separate JSON files for easier curation:
{
"patisiran_ttr_guide": {
"id": "patisiran_ttr_guide",
"sequence": "AUGGAAUACUCUUGGUUAC",
"target_gene": "TTR",
"strand_role": "guide",
"overhang": "dTdT",
"chem_mods": [
{
"type": "2OMe",
"positions": [1, 4, 6, 11, 13, 16, 19]
}
],
"provenance": {
"source_type": "patent",
"identifier": "US10060921B2",
"url": "https://patents.google.com/patent/US10060921B2"
},
"confirmation_status": "confirmed",
"notes": "Alnylam's patisiran (Onpattro) - FDA approved"
}
}
Python APIο
Loading Metadataο
from sirnaforge.modifications import load_metadata
# Load from JSON file
metadata_dict = load_metadata("path/to/metadata.json")
# Access specific strand
strand_meta = metadata_dict["patisiran_ttr_guide"]
print(strand_meta["overhang"]) # "dTdT"
print(strand_meta["chem_mods"]) # List of modifications
Parsing FASTA Headersο
from Bio import SeqIO
from sirnaforge.modifications import parse_header
# Parse FASTA file
records = SeqIO.parse("sequences.fasta", "fasta")
for record in records:
metadata = parse_header(record)
print(f"ID: {metadata['id']}")
print(f"Target: {metadata.get('target_gene')}")
print(f"Modifications: {metadata.get('chem_mods')}")
Merging Metadata into FASTAο
from sirnaforge.modifications import merge_metadata_into_fasta
# Merge JSON metadata into FASTA headers
count = merge_metadata_into_fasta(
fasta_path="sequences.fasta",
metadata_path="metadata.json",
output_path="sequences_annotated.fasta"
)
print(f"Updated {count} sequences")
Creating Metadata Programmaticallyο
from sirnaforge.models.modifications import (
StrandMetadata,
ChemicalModification,
SequenceRecord,
StrandRole
)
# Create strand metadata
metadata = StrandMetadata(
id="my_sirna_guide",
sequence="AUCGAUCGAUCGAUCGAUCGA",
overhang="dTdT",
chem_mods=[
ChemicalModification(type="2OMe", positions=[1, 3, 5])
]
)
# Create full sequence record
record = SequenceRecord(
target_gene="KRAS",
strand_role=StrandRole.GUIDE,
metadata=metadata
)
# Generate FASTA
fasta = record.to_fasta()
print(fasta)
CLI Usageο
Show Sequences with Metadataο
Display sequences in table format:
sirnaforge sequences show sequences.fasta
Display specific sequence:
sirnaforge sequences show sequences.fasta --id patisiran_ttr_guide
Output as JSON:
sirnaforge sequences show sequences.fasta --format json
Output as FASTA:
sirnaforge sequences show sequences.fasta --format fasta
Annotate FASTA with Metadataο
Merge metadata from JSON into FASTA headers:
sirnaforge sequences annotate sequences.fasta metadata.json -o annotated.fasta
The annotate command:
Reads the input FASTA file
Loads metadata from the JSON file
Matches sequences by ID
Generates updated FASTA headers with metadata
Writes annotated sequences to output file
Integration with SiRNACandidateο
The SiRNACandidate model includes optional metadata fields:
from sirnaforge.models.sirna import SiRNACandidate
from sirnaforge.models.modifications import StrandMetadata
# Create candidate with metadata
candidate = SiRNACandidate(
id="sirna_001",
transcript_id="ENST00000123456",
position=100,
guide_sequence="AUCGAUCGAUCGAUCGAUCGA",
passenger_sequence="UCGAUCGAUCGAUCGAUCGAU",
gc_content=52.4,
length=21,
asymmetry_score=0.75,
composite_score=85.2,
guide_metadata=StrandMetadata(
id="sirna_001_guide",
sequence="AUCGAUCGAUCGAUCGAUCGA",
overhang="dTdT"
)
)
# Generate FASTA with metadata
fasta = candidate.to_fasta(include_metadata=True)
Modification Type Referenceο
Common chemical modifications:
Type |
Full Name |
Description |
|---|---|---|
|
2β-O-methyl |
Ribose 2β position methylation (nuclease resistance) |
|
2β-fluoro |
Ribose 2β position fluorination (enhanced stability) |
|
Phosphorothioate |
Phosphate backbone sulfur substitution (nuclease resistance) |
|
Locked Nucleic Acid |
Bicyclic ribose analog (enhanced binding affinity) |
|
Peptide Nucleic Acid |
Peptide backbone (high stability) |
|
2β-O-methoxyethyl |
Ribose modification (improved pharmacokinetics) |
Note: The system accepts free-text modification types, allowing for proprietary or novel modifications to be annotated.
Position Numberingο
All positions are 1-based (first nucleotide is position 1)
Positions are relative to the 5β end of the strand
Example: For sequence βAUCGAUCGβ, position 1 is βAβ, position 4 is βGβ
Backward Compatibilityο
FASTA files without metadata annotations work seamlessly
Missing metadata fields default to
Noneor empty valuesLegacy workflows continue to function without modification
The system gracefully handles partial metadata
Best Practicesο
Consistent IDs: Use consistent sequence IDs between FASTA and JSON files
Validate Sources: Include provenance information for all curated sequences
Version Control: Store metadata JSON files in version control alongside FASTA
Documentation: Use the
notesfield to document important detailsConfirmation Status: Mark experimentally validated sequences as βconfirmedβ
Example Workflowο
Create Initial FASTA:
# Start with basic sequences cat > sequences.fasta << EOF >sirna_001 AUCGAUCGAUCGAUCGAUCGA EOF
Curate Metadata:
# Create JSON with modifications cat > metadata.json << EOF { "sirna_001": { "id": "sirna_001", "target_gene": "BRCA1", "strand_role": "guide", "overhang": "dTdT", "chem_mods": [{"type": "2OMe", "positions": [1, 4, 6]}] } } EOF
Merge Metadata:
sirnaforge sequences annotate sequences.fasta metadata.json -o annotated.fasta
View Results:
sirnaforge sequences show annotated.fasta
Use in Workflows:
from sirnaforge.modifications import parse_header from Bio import SeqIO for record in SeqIO.parse("annotated.fasta", "fasta"): metadata = parse_header(record) # Process with full metadata available
Future Extensionsο
Potential future enhancements:
Validation rules for position-specific chemistry compatibility
Support for duplex-level annotations (e.g., lipid conjugates)
Delivery vehicle and tropism annotations
Integration with chemical synthesis planning tools
Visualization of modification patterns
Referencesο
siRNAforge GitHub: https://github.com/austin-s-h/sirnaforge
Issue #[number]: Implement best practice system for metadata of chemical modifications