rvarsim: Variant Simulation with HGVS Notation

Introduction

rvarsim simulates all possible single nucleotide variants (SNVs) across MANE Select transcripts and outputs them in HGVS notation. It also provides a comprehensive toolkit for parsing, validating, normalizing, converting, transcribing, translating, and lifting over HGVS variant descriptions.

Variant Simulation Pipeline

The four-step pipeline generates all possible SNVs from a reference transcript:

library(rvarsim)
library(EnsDb.Hsapiens.v86)
library(BSgenome.Hsapiens.UCSC.hg38)

# Fetch MANE Select transcripts
mane <- fetch_mane_txdb(EnsDb.Hsapiens.v86)

# Get transcript structure
struct <- get_transcript_structure(mane, "ENST00000357654")

# Generate variants
vars <- generate_variants(struct, BSgenome.Hsapiens.UCSC.hg38)

# Add HGVS notation
hgvs <- format_hgvs(vars)
head(hgvs[, c("region", "genomic_ref", "genomic_alt", "hgvs_c")])

Or use the all-in-one wrapper:

result <- simulate_variants(
    txdb     = EnsDb.Hsapiens.v86,
    bsgenome = BSgenome.Hsapiens.UCSC.hg38,
    transcript_ids = "ENST00000357654",
    regions  = c("cds", "splice_site")
)

HGVS Parsing and Validation

library(rvarsim)
# Parse HGVS strings into structured objects
variant <- parse_hgvs("NM_000546.6:c.215C>G")[[1]]
variant$type        # "substitution"
## [1] "substitution"
variant$reference   # "C"
## [1] "C"
variant$alternate   # "G"
## [1] "G"
variant$position$start  # 215
## [1] 215
# Validate
is_valid_hgvs("NM_000546.6:c.215C>G")  # TRUE
## [1] TRUE
is_valid_hgvs("garbage string")        # FALSE
## [1] FALSE

Format Conversion

# HGVS to VCF
vcf <- hgvs_to_vcf("NC_000001.11:g.123456A>G")
print(vcf)
##       CHROM    POS ID REF ALT QUAL FILTER                          INFO
## 1 NC_000001 123456  .   A   G    .      . HGVS=NC_000001.11:g.123456A>G
# SPDI conversion
cat(hgvs_to_spdi("NC_000001.11:g.123456A>G"), "\n")
## NC_000001.11:123455:A:G

Transcription Mapping

# Coding to genomic
g_vars <- c_to_g("ENST00000357654:c.215C>G",
                 EnsDb.Hsapiens.v86,
                 BSgenome.Hsapiens.UCSC.hg38)

# Genomic to coding
c_vars <- g_to_c("1:g.7577120C>G",
                 EnsDb.Hsapiens.v86,
                 BSgenome.Hsapiens.UCSC.hg38)

Translation

translate_hgvs("ENST00000357654:c.215C>G",
               EnsDb.Hsapiens.v86,
               BSgenome.Hsapiens.UCSC.hg38)

Variant Extraction

extract_hgvs("ATGCGTACGTAG", "ATGCATACCTAG",
             "NM_000546.6", "c", 1)
## [1] "NM_000546.6:c.5G>A" "NM_000546.6:c.9G>C"

Session Information

sessionInfo()
## R version 4.6.0 (2026-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rvarsim_0.99.1   BiocStyle_2.41.0
## 
## loaded via a namespace (and not attached):
##  [1] KEGGREST_1.53.0             SummarizedExperiment_1.43.0
##  [3] rjson_0.2.23                xfun_0.58                  
##  [5] bslib_0.11.0                Biobase_2.73.1             
##  [7] lattice_0.22-9              vctrs_0.7.3                
##  [9] tools_4.6.0                 bitops_1.0-9               
## [11] generics_0.1.4              stats4_4.6.0               
## [13] curl_7.1.0                  parallel_4.6.0             
## [15] AnnotationDbi_1.75.0        RSQLite_3.53.1             
## [17] blob_1.3.0                  BiocBaseUtils_1.15.1       
## [19] Matrix_1.7-5                BSgenome_1.81.0            
## [21] S4Vectors_0.51.3            cigarillo_1.3.0            
## [23] lifecycle_1.0.5             compiler_4.6.0             
## [25] Rsamtools_2.29.0            Biostrings_2.81.3          
## [27] Seqinfo_1.3.0               codetools_0.2-20           
## [29] GenomeInfoDb_1.49.1         htmltools_0.5.9            
## [31] sys_3.4.3                   buildtools_1.0.0           
## [33] sass_0.4.10                 lazyeval_0.2.3             
## [35] RCurl_1.98-1.19             yaml_2.3.12                
## [37] crayon_1.5.3                jquerylib_0.1.4            
## [39] BiocParallel_1.47.0         cachem_1.1.0               
## [41] DelayedArray_0.39.3         abind_1.4-8                
## [43] digest_0.6.39               restfulr_0.0.16            
## [45] maketools_1.3.2             fastmap_1.2.0              
## [47] grid_4.6.0                  cli_3.6.6                  
## [49] SparseArray_1.13.2          S4Arrays_1.13.0            
## [51] GenomicFeatures_1.65.0      XML_3.99-0.23              
## [53] UCSC.utils_1.9.0            bit64_4.8.2                
## [55] rmarkdown_2.31              XVector_0.53.0             
## [57] httr_1.4.8                  matrixStats_1.5.0          
## [59] bit_4.6.0                   otel_0.2.0                 
## [61] png_0.1-9                   memoise_2.0.1              
## [63] evaluate_1.0.5              knitr_1.51                 
## [65] GenomicRanges_1.65.0        IRanges_2.47.2             
## [67] BiocIO_1.23.3               rtracklayer_1.73.0         
## [69] rlang_1.2.0                 DBI_1.3.0                  
## [71] ensembldb_2.37.3            BiocManager_1.30.27        
## [73] BiocGenerics_0.59.7         jsonlite_2.0.0             
## [75] AnnotationFilter_1.37.0     R6_2.6.1                   
## [77] ProtGenerics_1.45.0         MatrixGenerics_1.25.0      
## [79] GenomicAlignments_1.49.0