| Title: | R/Bioconductor Variant Simulator with HGVS Notation |
|---|---|
| Description: | Simulates all possible single nucleotide variants (SNVs) across MANE Select transcripts including coding sequence (CDS), untranslated regions (UTRs), and canonical splice sites. Outputs variants in HGVS notation. Provides a comprehensive HGVS toolkit: parsing, syntactic and semantic validation, normalization (3' shifting, common affix trimming), format conversion (HGVS <-> VCF <-> SPDI), transcription mapping (genomic <-> coding), translation to protein consequence, backtranslation from protein to coding variants, variant extraction from sequence alignment, and liftover between genome assemblies. |
| Authors: | Liu Sun [aut, cre] (ORCID: <https://orcid.org/0000-0002-1459-9478>) |
| Maintainer: | Liu Sun <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.99.1 |
| Built: | 2026-06-10 09:54:24 UTC |
| Source: | https://github.com/BiocStaging/rvarsim |
Given an observed protein (p.) variant, infers all possible coding (c.) variants that could produce it by enumerating possible codon changes in the genetic code.
backtranslate_hgvs(hgvs_p, txdb, bsgenome, genetic_code = NULL)backtranslate_hgvs(hgvs_p, txdb, bsgenome, genetic_code = NULL)
hgvs_p |
Character vector of HGVS p. notation strings
(e.g., |
txdb |
A |
bsgenome |
A |
genetic_code |
Genetic code table (default: standard code). |
For example, p.Arg215Gly could be caused by any of the 6 AGA/Gly codon pair changes. This function enumerates all possibilities.
A list of character vectors, where each element is a vector of possible HGVS c. notation strings for the corresponding protein variant.
# The backtranslate function requires a TxDb and BSgenome with # matching seqlevels (use same reference for both): if (requireNamespace("EnsDb.Hsapiens.v86", quietly = TRUE) && requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)) { cat("Ready for backtranslation") }# The backtranslate function requires a TxDb and BSgenome with # matching seqlevels (use same reference for both): if (requireNamespace("EnsDb.Hsapiens.v86", quietly = TRUE) && requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)) { cat("Ready for backtranslation") }
Aligns a reference and observed sequence, identifies differences (substitutions, insertions, deletions), and generates HGVS variant descriptions.
extract_hgvs( ref_seq, obs_seq, accession, notation = "c", start_pos = 1L, algorithm = c("needleman", "smith") )extract_hgvs( ref_seq, obs_seq, accession, notation = "c", start_pos = 1L, algorithm = c("needleman", "smith") )
ref_seq |
Character string: reference sequence. |
obs_seq |
Character string: observed (variant) sequence. |
accession |
Accession number for the HGVS output
(e.g., |
notation |
HGVS notation prefix. Default |
start_pos |
Starting position offset. For c. notation, 1 = first base of CDS. For g. notation, the genomic position of the first base in the reference sequence. |
algorithm |
Alignment algorithm: |
A character vector of HGVS variant description strings, one
per detected variant. Returns "=" if sequences are
identical.
ref <- "ATGCGTACGTAG" obs <- "ATGCATACCTAG" extract_hgvs(ref, obs, "NM_000546.6", "c", 1)ref <- "ATGCGTACGTAG" obs <- "ATGCATACCTAG" extract_hgvs(ref, obs, "NM_000546.6", "c", 1)
Retrieves MANE Select transcripts using AnnotationHub and ensembldb. MANE Select transcripts are a single representative transcript per protein-coding gene, jointly curated by NCBI and EMBL-EBI.
fetch_mane_txdb( txdb = NULL, ah_id = NULL, species = "Homo sapiens", assembly = "GRCh38" )fetch_mane_txdb( txdb = NULL, ah_id = NULL, species = "Homo sapiens", assembly = "GRCh38" )
txdb |
An |
ah_id |
Optional AnnotationHub ID (e.g. |
species |
Species identifier. Default |
assembly |
Genome assembly name. Default |
An EnsDb object containing only MANE Select transcripts,
or the original TxDb/EnsDb if filtering is not
possible.
# Auto-fetch requires AnnotationHub (internet): if (requireNamespace("AnnotationHub", quietly = TRUE) && requireNamespace("EnsDb.Hsapiens.v86", quietly = TRUE)) { cat("Ready for MANE transcript fetching") }# Auto-fetch requires AnnotationHub (internet): if (requireNamespace("AnnotationHub", quietly = TRUE) && requireNamespace("EnsDb.Hsapiens.v86", quietly = TRUE)) { cat("Ready for MANE transcript fetching") }
Converts a variants data.frame (from generate_variants)
into HGVS-compliant coding DNA notation.
format_hgvs(variants, use_transcript_version = FALSE)format_hgvs(variants, use_transcript_version = FALSE)
variants |
A |
use_transcript_version |
Logical. If |
HGVS conventions used:
CDS: NM_.*:c.[pos][ref]>[alt] (e.g., c.215C>G)
5' UTR: c.-[pos][ref]>[alt] (e.g., c.-14G>A)
3' UTR: c.*[pos][ref]>[alt] (e.g., c.*32T>C)
Splice donor: c.[boundary]+[offset][ref]>[alt]
(e.g., c.453+1G>T)
Splice acceptor: c.[boundary]-[offset][ref]>[alt]
(e.g., c.612-2A>G)
A data.frame with all original columns plus:
HGVS coding DNA notation string.
HGVS genomic notation string
(e.g., NC_000001.11:g.123456A>G).
# Self-contained format demo: df <- data.frame( transcript_id = "NM_TEST", region = c("cds", "five_utr"), genomic_pos = c(215L, 30L), genomic_ref = c("C", "G"), genomic_alt = c("G", "A"), cds_pos = c(215L, -14L), exon_boundary = c(NA_integer_, NA_integer_), offset = c(NA_integer_, NA_integer_), stringsAsFactors = FALSE ) format_hgvs(df)# Self-contained format demo: df <- data.frame( transcript_id = "NM_TEST", region = c("cds", "five_utr"), genomic_pos = c(215L, 30L), genomic_ref = c("C", "G"), genomic_alt = c("G", "A"), cds_pos = c(215L, -14L), exon_boundary = c(NA_integer_, NA_integer_), offset = c(NA_integer_, NA_integer_), stringsAsFactors = FALSE ) format_hgvs(df)
For each position across CDS, UTR, and splice site regions, extracts the reference base from the genome and generates the three possible alternative alleles (A, C, G, T minus the reference).
generate_variants( tx_struct, bsgenome, regions = c("cds", "five_utr", "three_utr", "splice_site"), transcript_ids = NULL )generate_variants( tx_struct, bsgenome, regions = c("cds", "five_utr", "three_utr", "splice_site"), transcript_ids = NULL )
tx_struct |
A |
bsgenome |
A |
regions |
Character vector specifying which region types to
include. Options: |
transcript_ids |
Optional character vector of transcript IDs
to limit generation. |
A data.frame with columns:
Transcript identifier.
Region type: "cds", "five_utr", "three_utr", "splice_donor", "splice_acceptor".
Genomic position (1-based).
Reference allele at this position.
Alternative allele.
Position relative to CDS start (1 = A of ATG; negative = 5'UTR; NA for splice sites described by boundary).
For splice sites: the nearest coding coordinate (NA for other regions).
For splice sites: the intronic offset
(+1, +2, -1, -2; NA otherwise).
if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)) { cat("Generate variants with BSgenome.Hsapiens.UCSC.hg38") }if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)) { cat("Generate variants with BSgenome.Hsapiens.UCSC.hg38") }
Extracts the coding sequence (CDS), 5' UTR, 3' UTR, and canonical splice site positions for one or more transcripts from a TxDb/EnsDb.
get_transcript_structure(txdb, transcript_ids = NULL)get_transcript_structure(txdb, transcript_ids = NULL)
txdb |
A |
transcript_ids |
Optional character vector of transcript IDs
to limit extraction. If |
Canonical splice sites are defined as the first and last two positions of each intron (±1, ±2 from exon boundaries).
A list with components:
A GRangesList of CDS exons by transcript.
A GRangesList of 5' UTR regions by transcript.
A GRangesList of 3' UTR regions by transcript.
A GRangesList of all exons by transcript.
A GRanges of splice site positions
(donor +1,+2 and acceptor -1,-2).
Named integer vector of CDS start positions (genomic coordinate of ATG).
Named integer vector of CDS end positions (genomic coordinate of stop codon last base).
Named character vector of strand ("+" or "-").
library(EnsDb.Hsapiens.v86) mane <- fetch_mane_txdb(EnsDb.Hsapiens.v86) struct <- get_transcript_structure(mane, transcript_ids = NULL)library(EnsDb.Hsapiens.v86) mane <- fetch_mane_txdb(EnsDb.Hsapiens.v86) struct <- get_transcript_structure(mane, transcript_ids = NULL)
Converts between HGVS notation and common variant representation formats: VCF (Variant Call Format), SPDI (Sequence Position Deletion Insertion), and VRS (Variation Representation Specification, basic support).
A data.frame or character vector, depending on the
conversion function used.
VCF uses 1-based positions: CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO. Conversion from HGVS c. notation requires transcript-to-genomic mapping.
NCBI's SPDI uses 0-based interbase coordinates: sequence_id:position:deleted_sequence:inserted_sequence
SPDI uses 0-based interbase coordinates: seqid:position:deletion:insertion
hgvs_to_spdi(hgvs_strings, assembly = "GRCh38")hgvs_to_spdi(hgvs_strings, assembly = "GRCh38")
hgvs_strings |
Character vector of HGVS g. notation strings. |
assembly |
Genome assembly name. |
Character vector of SPDI strings.
hgvs_to_spdi("NC_000001.11:g.123456A>G")hgvs_to_spdi("NC_000001.11:g.123456A>G")
Convert HGVS to VCF data.frame
hgvs_to_vcf(hgvs_strings, txdb = NULL, bsgenome = NULL)hgvs_to_vcf(hgvs_strings, txdb = NULL, bsgenome = NULL)
hgvs_strings |
Character vector of HGVS variant descriptions. |
txdb |
Optional |
bsgenome |
Optional |
A data.frame with VCF columns: CHROM, POS, ID, REF,
ALT, QUAL, FILTER, INFO.
# VCF conversion: vcf <- hgvs_to_vcf("NC_000001.11:g.123456A>G") # SPDI round-trip: hgvs_to_spdi("NC_000001.11:g.123456A>G")# VCF conversion: vcf <- hgvs_to_vcf("NC_000001.11:g.123456A>G") # SPDI round-trip: hgvs_to_spdi("NC_000001.11:g.123456A>G")
Check if an HGVS string is valid (convenience wrapper)
is_valid_hgvs(hgvs_string, ...)is_valid_hgvs(hgvs_string, ...)
hgvs_string |
A single HGVS variant string. |
... |
Additional arguments passed to |
TRUE if both syntactic and semantic checks pass.
is_valid_hgvs("NM_000546.6:c.215C>G")is_valid_hgvs("NM_000546.6:c.215C>G")
Lift over a c. variant by converting to genomic, lifting, then converting back
liftover_c_hgvs( hgvs_c, source_txdb, source_bsgenome, chain_file, target_txdb, target_bsgenome )liftover_c_hgvs( hgvs_c, source_txdb, source_bsgenome, chain_file, target_txdb, target_bsgenome )
hgvs_c |
A HGVS c. notation string. |
source_txdb |
TxDb for the source assembly. |
source_bsgenome |
BSgenome for the source assembly. |
chain_file |
Path to chain file. |
target_txdb |
TxDb for the target assembly. |
target_bsgenome |
BSgenome for the target assembly. |
HGVS c. notation string in the target assembly, or NA.
if (requireNamespace("rtracklayer", quietly = TRUE)) { cat("rtracklayer available for c. liftover") }if (requireNamespace("rtracklayer", quietly = TRUE)) { cat("rtracklayer available for c. liftover") }
Converts HGVS variants between genome assemblies (e.g., hg19 to
hg38) using chain files via the rtracklayer package.
Supports both g. and c. notation (c. requires TxDb for the target
assembly).
liftover_hgvs(hgvs_strings, chain_file, target_bsgenome = NULL)liftover_hgvs(hgvs_strings, chain_file, target_bsgenome = NULL)
hgvs_strings |
Character vector of HGVS variant descriptions. |
chain_file |
Path to a UCSC chain file (e.g.,
|
target_bsgenome |
A |
Character vector of HGVS strings mapped to the target
assembly, or NA for variants that cannot be mapped.
# Liftover requires rtracklayer and a chain file: if (requireNamespace("rtracklayer", quietly = TRUE)) { cat("rtracklayer available for liftover") }# Liftover requires rtracklayer and a chain file: if (requireNamespace("rtracklayer", quietly = TRUE)) { cat("rtracklayer available for liftover") }
Applies HGVS normalization rules: 3' shifting for indels in repetitive regions, rewriting insertions as duplications when applicable, and trimming common prefixes/suffixes.
normalize_hgvs(hgvs_strings, bsgenome = NULL, txdb = NULL)normalize_hgvs(hgvs_strings, bsgenome = NULL, txdb = NULL)
hgvs_strings |
Character vector of HGVS variant descriptions. |
bsgenome |
A |
txdb |
Optional |
Normlization ensures a single canonical representation for each variant, which is critical for comparison and deduplication.
A character vector of normalized HGVS strings, or the original string if normalization cannot be applied.
# Normalize an HGVS string: normalize_hgvs("NM_000546.6:c.215C>G")# Normalize an HGVS string: normalize_hgvs("NM_000546.6:c.215C>G")
Parses HGVS-formatted variant strings into structured R list objects. Supports all major HGVS notation types: substitution (>), deletion (del), insertion (ins), duplication (dup), inversion (inv), deletion- insertion (delins), repeat, and frameshift variants across c., g., p., n., m., and r. notation prefixes.
parse_hgvs(hgvs_strings, strict = FALSE)parse_hgvs(hgvs_strings, strict = FALSE)
hgvs_strings |
Character vector of HGVS variant descriptions. |
strict |
Logical. If |
A list of parsed HGVS objects, each with components:
Variant type: "substitution", "deletion", "insertion", "duplication", "inversion", "delins", "frameshift", "repeat", or "unknown".
Accession number (e.g., "NM_000546.6").
Notation prefix: "c", "g", "p", "n", "m", "r".
Reference allele/sequence.
Alternate allele/sequence.
List with start, end,
offset_start, offset_end, is_utr,
utr_offset.
Original input string.
parse_hgvs(c("NM_000546.6:c.215C>G", "NC_000001.11:g.123456delA"))parse_hgvs(c("NM_000546.6:c.215C>G", "NC_000001.11:g.123456delA"))
Main entry point for the variant simulation pipeline. Fetches MANE Select transcripts, extracts transcript structure, generates all possible single nucleotide variants across CDS, UTR, and canonical splice site regions, and formats them in HGVS notation.
simulate_variants( txdb = NULL, bsgenome = NULL, transcript_ids = NULL, regions = c("cds", "five_utr", "three_utr", "splice_site"), use_messages = TRUE )simulate_variants( txdb = NULL, bsgenome = NULL, transcript_ids = NULL, regions = c("cds", "five_utr", "three_utr", "splice_site"), use_messages = TRUE )
txdb |
An |
bsgenome |
A |
transcript_ids |
Optional character vector of transcript IDs
to simulate. |
regions |
Character vector. Which regions to simulate.
Options: |
use_messages |
Logical. If |
A data.frame with columns:
Transcript identifier.
Region type.
Genomic position (1-based).
Reference allele.
Alternative allele.
CDS-relative position (NA for splice sites).
Exon boundary coordinate (splice sites only).
Intronic offset (splice sites only).
HGVS coding DNA notation.
HGVS genomic notation.
fetch_mane_txdb for obtaining MANE Select transcripts.
get_transcript_structure for transcript structure
extraction.
generate_variants for the variant generation step.
format_hgvs for HGVS formatting details.
# Check validity of HGVS strings: is_valid_hgvs("NM_000546.6:c.215C>G") # Simulate requires matching TxDb + BSgenome with same seqlevels style: # library(EnsDb.Hsapiens.v86) # library(BSgenome.Hsapiens.UCSC.hg38) # simulate_variants(EnsDb.Hsapiens.v86, BSgenome.Hsapiens.UCSC.hg38, # transcript_ids = "ENST00000357654")# Check validity of HGVS strings: is_valid_hgvs("NM_000546.6:c.215C>G") # Simulate requires matching TxDb + BSgenome with same seqlevels style: # library(EnsDb.Hsapiens.v86) # library(BSgenome.Hsapiens.UCSC.hg38) # simulate_variants(EnsDb.Hsapiens.v86, BSgenome.Hsapiens.UCSC.hg38, # transcript_ids = "ENST00000357654")
Convert SPDI to HGVS g. notation
spdi_to_hgvs(spdi_strings)spdi_to_hgvs(spdi_strings)
spdi_strings |
Character vector of SPDI strings. |
Character vector of HGVS g. notation strings.
spdi_to_hgvs("NC_000001.11:123455:A:G")spdi_to_hgvs("NC_000001.11:123455:A:G")
Converts HGVS variants between genomic (g.) and coding (c.) notation using a TxDb/EnsDb to resolve exon/CDS structure. Handles intronic offsets, UTR positions, and splice boundary coordinates.
transcribe_hgvs( hgvs_strings, txdb, bsgenome, transcript_id = NULL, direction = c("c_to_g", "g_to_c", "auto") ) c_to_g(hgvs_strings, txdb, bsgenome, transcript_id = NULL) g_to_c(hgvs_strings, txdb, bsgenome, transcript_id = NULL)transcribe_hgvs( hgvs_strings, txdb, bsgenome, transcript_id = NULL, direction = c("c_to_g", "g_to_c", "auto") ) c_to_g(hgvs_strings, txdb, bsgenome, transcript_id = NULL) g_to_c(hgvs_strings, txdb, bsgenome, transcript_id = NULL)
hgvs_strings |
Character vector of HGVS variant descriptions. |
txdb |
A |
bsgenome |
A |
transcript_id |
Optional transcript ID to use when disambiguating c. notation (required when multiple transcripts share the same accession). |
direction |
Direction of transcription: |
A list of HGVS variant strings in the target notation.
# Transcribe needs a TxDb with matching BSgenome seqlevels: if (requireNamespace("EnsDb.Hsapiens.v86", quietly = TRUE) && requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)) { cat("Ready for transcription mapping") }# Transcribe needs a TxDb with matching BSgenome seqlevels: if (requireNamespace("EnsDb.Hsapiens.v86", quietly = TRUE) && requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)) { cat("Ready for transcription mapping") }
Translates CDS nucleotide variants to protein (p.) HGVS notation using the genetic code and CDS structure from a TxDb/EnsDb.
translate_hgvs(hgvs_strings, txdb, bsgenome, genetic_code = NULL)translate_hgvs(hgvs_strings, txdb, bsgenome, genetic_code = NULL)
hgvs_strings |
Character vector of HGVS c. notation strings. |
txdb |
A |
bsgenome |
A |
genetic_code |
A named integer vector or genetic code table
mapping codons to single-letter amino acids. If |
Supported consequence types:
Missense: p.Arg215Gly (substitution causing amino
acid change)
Nonsense: p.Trp215Ter (premature stop codon)
Silent: p.(=) (no amino acid change)
Frameshift: p.Arg215LeufsTer5 (frameshift with
altered reading frame)
Inframe deletion: p.Lys215_Gly217del
Inframe insertion: p.Lys215_Gly216insAla
Character vector of HGVS p. notation strings, or
NA for variants that cannot be translated (e.g.,
non-coding variants).
# Translate requires a TxDb with matching BSgenome: if (requireNamespace("EnsDb.Hsapiens.v86", quietly = TRUE) && requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)) { cat("Ready for translation") }# Translate requires a TxDb with matching BSgenome: if (requireNamespace("EnsDb.Hsapiens.v86", quietly = TRUE) && requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)) { cat("Ready for translation") }
Performs syntactic and semantic validation of HGVS variant strings. Syntactic validation checks that the string follows correct HGVS grammar. Semantic validation verifies that positions are in range, reference alleles match expected sequence, and the variant is biologically plausible.
validate_hgvs(hgvs_strings, txdb = NULL, bsgenome = NULL)validate_hgvs(hgvs_strings, txdb = NULL, bsgenome = NULL)
hgvs_strings |
Character vector of HGVS variant descriptions. |
txdb |
Optional |
bsgenome |
Optional |
A data.frame with columns:
Original input string.
Logical: passes syntactic checks.
Logical: passes semantic checks.
Error message from syntactic check (or NA).
Error message from semantic check (or NA).
Variant type if successfully parsed.
validate_hgvs(c("NM_000546.6:c.215C>G", "invalid string"))validate_hgvs(c("NM_000546.6:c.215C>G", "invalid string"))
Convert VCF row to HGVS notation
vcf_to_hgvs(vcf_df, assembly = "GRCh38")vcf_to_hgvs(vcf_df, assembly = "GRCh38")
vcf_df |
A |
assembly |
Genome assembly name for creating NC_ accession (e.g., "GRCh38"). |
Character vector of HGVS g. notation strings.
vcf <- data.frame(CHROM = "1", POS = 123456L, REF = "A", ALT = "G") vcf_to_hgvs(vcf, "GRCh38")vcf <- data.frame(CHROM = "1", POS = 123456L, REF = "A", ALT = "G") vcf_to_hgvs(vcf, "GRCh38")