Geneslator: an R package for comprehensive gene identifier conversion and annotation

Introduction

Gene identifier conversion and annotation is a common and critical task in bioinformatics research. Existing databases and tools use different naming conventions for genes or provide only partial annotations, making it challenging to integrate data from multiple sources. geneslator addresses this problem by providing a unified interface for genome annotation across different databases in several model organisms.

Key Features:

  • Multiple database integration: Integrates data from cross-organism databases (NCBI, Ensembl, UniProt, Alliance of Genome Resources, GO, KEGG, Reactome, Wikipathways) and organism-specific resources (HGNC, MGI, RGD, SGD, WormBase, Flybase, ZFIN, TAIR);
  • Archive search: Supports searching using both current and archived gene identifiers in NCBI and Ensembl databases;
  • Alias resolution: Supports automatic disambiguation between symbols and aliases in annotations involving gene symbols;
  • Multi-organism support: Currently supports 8 model organisms (human, mouse, rat, yeast, worm, fly, zebrafish, and arabidopsis).

Installation

# Install package devtools, if missing
if (!requireNamespace("devtools", quietly = TRUE)) {
    install.packages("devtools")
}

# Install the development version of geneslator from GitHub
devtools::install_github("knowmics-lab/geneslator", build_vignettes = TRUE)

Load the package

library(geneslator)

Import annotation databases

geneslator provides species-specific annotation databases for several organisms. Annotation databases are stored as SQLite files in different versions of a Zenodo record at https://doi.org/10.5281/zenodo.20448208. Each release refers to a specific version of the databases. Versions are tagged as year.month, where year and month denote the year and the month of the publication of the release (e.g. ‘2026.03’ for March 2026). Databases are updated on a monthly basis.

Type availableDatabases() to retrieve the list of available databases and supported species in the most recent release.

# List organisms annotated in geneslator
availableDatabases()
#>                   Name                 Organism  TaxID
#> 8     org.Athaliana.db     Arabidopsis thaliana   3702
#> 6      org.Celegans.db   Caenorhabditis elegans   6239
#> 4        org.Drerio.db              Danio rerio   7955
#> 5 org.Dmelanogaster.db  Drosophila melanogaster   7227
#> 1      org.Hsapiens.db             Homo sapiens   9606
#> 2     org.Mmusculus.db             Mus musculus  10090
#> 3   org.Rnorvegicus.db        Rattus norvegicus  10116
#> 7   org.Scerevisiae.db Saccharomyces cerevisiae 559292
#>                                MD5 Version                     DOI
#> 8 5161342725c3bc0f7ad5cbe32558f7d4 2026.05 10.5281/zenodo.20457977
#> 6 e862899bb1328407e4641f29f04f5ef3 2026.05 10.5281/zenodo.20457977
#> 4 51fd51a03511d84116436b78633b5eff 2026.05 10.5281/zenodo.20457977
#> 5 479daba8a3fbd4baaaa43224726b775d 2026.05 10.5281/zenodo.20457977
#> 1 ae0b03569e27aec470ed3bef8404238d 2026.05 10.5281/zenodo.20457977
#> 2 81f413bda4ff3ffab4b71d95300c53f9 2026.05 10.5281/zenodo.20457977
#> 3 0828846be2b79802aa70c87aabda15eb 2026.05 10.5281/zenodo.20457977
#> 7 788863602c38669e5b11038a362aab7e 2026.05 10.5281/zenodo.20457977

The parameter release.version can be used to retrieve the list of all available databases in an older release.

# List organisms annotated in geneslator (release December 2025)
availableDatabases(release.version = "2025.12")
#>                   Name                 Organism  TaxID
#> 1     org.Athaliana.db     Arabidopsis thaliana   3702
#> 6      org.Celegans.db   Caenorhabditis elegans   6239
#> 8        org.Drerio.db              Danio rerio   7955
#> 2 org.Dmelanogaster.db  Drosophila melanogaster   7227
#> 3      org.Hsapiens.db             Homo sapiens   9606
#> 4     org.Mmusculus.db             Mus musculus  10090
#> 5   org.Rnorvegicus.db        Rattus norvegicus  10116
#> 7   org.Scerevisiae.db Saccharomyces cerevisiae 559292
#>                                MD5 Version                     DOI
#> 1 a292153eee87600c5d8c27977fe7ea45 2025.12 10.5281/zenodo.20448209
#> 6 fb4f03098e379712c17196a1f2b6c6a4 2025.12 10.5281/zenodo.20448209
#> 8 e292dcb2cca5c038d9c369b31ca16d8c 2025.12 10.5281/zenodo.20448209
#> 2 06031138af0a7e44af7d9f938f8f4239 2025.12 10.5281/zenodo.20448209
#> 3 6b6ffd437724b029e3ec5f24ab866d97 2025.12 10.5281/zenodo.20448209
#> 4 1f5af73caf5e89f65e7bcf31669f62d0 2025.12 10.5281/zenodo.20448209
#> 5 7cb6dbed9441b0b142032a5206b66126 2025.12 10.5281/zenodo.20448209
#> 7 7b36f0be0eecce6e12bf05f32d5d8779 2025.12 10.5281/zenodo.20448209

A complete list of all available release versions can be obtained with availableVersions().

# Import human db again. Now cache data will be used to import db
availableVersions()
#> [1] "2025.12" "2026.03" "2026.04" "2026.05"

To query a database for a specific organism org, you first need to import it, by using the GeneslatorDb function. org can be either the scientific name of the organism (e.g. “Homo sapiens”) or its Taxonomy ID (e.g. “10090” for Mouse). The function creates a new GeneslatorDb object for the requested database, which is then exported to the global environment of the user as a variable having the same name of the SQLite annotation database (e.g. org.Hsapiens.db for Human, org.Mmusculus.db for Mouse).

# Import human annotation db (after downloading it from remote repository)
GeneslatorDb("Homo sapiens")
# Info about the imported human annotation database object
org.Hsapiens.db
#> An object of class "GeneslatorDb"
#> Slot "db":
#> OrgDb object:
#> | DBSCHEMAVERSION: 2.1
#> | DBSCHEMA: NOSCHEMA_DB
#> | ORGANISM: Homo sapiens
#> | SPECIES: Homo sapiens
#> | CENTRALID: GID
#> | Taxonomy ID: 9606
#> | Db type: OrgDb
#> | Supporting package: AnnotationDbi
# Import mouse annotation database using its Taxonomy ID
GeneslatorDb("10090")
# Info about the imported human annotation database object
org.Mmusculus.db
#> An object of class "GeneslatorDb"
#> Slot "db":
#> OrgDb object:
#> | DBSCHEMAVERSION: 2.1
#> | DBSCHEMA: NOSCHEMA_DB
#> | ORGANISM: Mus musculus
#> | SPECIES: Mus musculus
#> | CENTRALID: GID
#> | Taxonomy ID: 10090
#> | Db type: OrgDb
#> | Supporting package: AnnotationDbi

When called for the first time on a specific organism, GeneslatorDb function downloads the annotation database from the remote repository, stores a local copy into your R cache folder and finally imports the database. Future calls to GeneslatorDb function will simply import the database from your cache, unless a new version of the database is present in the remote repository. In the latter case, you will be notified about that and you will be able to choose whether or not updating your local copy in the R cache, before importing the database.

# Import human db again. Now cache data will be used to import db
GeneslatorDb("Homo sapiens")

By default, GeneslatorDb queries the latest release. To retrieve an older version of the database, you can set the release.version parameter to the desired release version. Again, a local copy of the database (distinct from the latest release) will be stored into your R cache folder, so that future calls to the same database will simply import it from your cache.

# Import yeast annotation db from release 2025.12 (December 2025)
GeneslatorDb("Saccharomyces cerevisiae",release.version = "2025.12")
# Info about the imported human annotation database object
org.Scerevisiae.db
#> An object of class "GeneslatorDb"
#> Slot "db":
#> OrgDb object:
#> | DBSCHEMAVERSION: 2.1
#> | DBSCHEMA: NOSCHEMA_DB
#> | ORGANISM: Saccharomyces cerevisiae
#> | SPECIES: Saccharomyces cerevisiae
#> | CENTRALID: GID
#> | Taxonomy ID: 559292
#> | Db type: OrgDb
#> | Supporting package: AnnotationDbi

Columns and values of annotation databases

Annotation databases are internally represented as collections of R dataframes that can be queried through functions that map a set of values of an input column (the key) of a dataframe to the corresponding values of one or more output columns of the same or a different dataframe.

Function keytypes() lists all columns that can be used as keys.

# Get all columns that can be used as keys in mouse annotation db
geneslator::keytypes(org.Mmusculus.db)
#>  [1] "ALIAS"          "ENSEMBL"        "ENSEMBLOLD"     "ENTREZID"      
#>  [5] "ENTREZIDOLD"    "GENENAME"       "GENETYPE"       "GO"            
#>  [9] "KEGGPATH"       "MGI"            "ORTHOFLY"       "ORTHOHUMAN"    
#> [13] "ORTHORAT"       "ORTHOWORM"      "ORTHOYEAST"     "ORTHOZEBRAFISH"
#> [17] "REACTOMEPATH"   "SYMBOL"         "UNIPROT"        "WIKIPATH"

Similarly, function columns() lists all possible output columns.

# Get all available types of output values in mouse annotation db
geneslator::columns(org.Mmusculus.db)
#>  [1] "ALIAS"            "ENSEMBL"          "ENSEMBLOLD"       "ENTREZID"        
#>  [5] "ENTREZIDOLD"      "GENENAME"         "GENETYPE"         "GO"              
#>  [9] "GOEVIDENCE"       "GONAME"           "GOTYPE"           "KEGGPATH"        
#> [13] "KEGGPATHNAME"     "MGI"              "ORTHOFLY"         "ORTHOHUMAN"      
#> [17] "ORTHORAT"         "ORTHOWORM"        "ORTHOYEAST"       "ORTHOZEBRAFISH"  
#> [21] "REACTOMEPATH"     "REACTOMEPATHNAME" "SYMBOL"           "UNIPROT"         
#> [25] "WIKIPATH"         "WIKIPATHNAME"

Note that the output of the two functions is different, because only identifier columns can be used as keys, while any column can be an output column. Type help("columns","geneslator") to see the complete list of columns available in the annotation databases of geneslator, together with their description.

Function keys() is used to retrieve all values of a column in an annotation database.

# Get the first 10 Entrez IDs in mouse annotation db
head(geneslator::keys(org.Mmusculus.db, keytype = "ENTREZID"), 10)
#>  [1] "100008564" "100008567" "100009600" "100009609" "100009614" "100009664"
#>  [7] "100009698" "100010"    "100012"    "100014"

Query the annotation databases

Columns of the annotation databases can be queried using properly re-defined versions of the well-known query functions select() and mapIds() of AnnotationDbi R package.

The select() function allows you to query an input key column of the annotation database (keytype argument) and retrieve related information across one or more other columns (columns argument).

The output of select() is a dataframe with all columns specified by keytype and columns arguments and one row for each mapping found between input and output values.

# Map NCBI Gene IDs to gene symbols and Ensembl IDs in Human
genes <- c("1", "2", "9")
result <- geneslator::select(org.Hsapiens.db, keys = genes,
            columns = c("SYMBOL", "ENSEMBL"), keytype = "ENTREZID")
result
#>   ENTREZID SYMBOL         ENSEMBL
#> 1        1   A1BG ENSG00000121410
#> 2        2    A2M ENSG00000175899
#> 3        9   NAT1 ENSG00000171428

Unlike select(), mapIds() maps an input key column (argument keytype) to a single output column (argument column).

# Convert gene symbols to ENTREZ IDs (first match only)
genes <- c("TP53", "BRCA1", "EGFR")
entrez_ids <- geneslator::mapIds(org.Hsapiens.db, keys = genes, 
            column = "ENTREZID", keytype = "SYMBOL")
entrez_ids
#>   TP53  BRCA1   EGFR 
#> "7157"  "672" "1956"

By default, the return type is a named vector, where each value is the first mapping found (if any) for a given key, even if multiple output values map to that key. However, this behaviour can be changed through the multiVals parameter, which also controls the shape of the output result. For example, multiVals="list" produces a list object with all matches found for each input.

# Get all possible mappings as a list
entrez_list <- geneslator::mapIds(org.Hsapiens.db, keys = genes,
            column = "ENTREZID", keytype = "SYMBOL", multiVals = "list")
entrez_list
#> $TP53
#> [1] "7157"
#> 
#> $BRCA1
#> [1] "672"
#> 
#> $EGFR
#> [1] "1956"

Search options

Search using aliases

In select() and mapIds() functions, by default, queries of annotation databases involving gene symbols are performed by first looking at column “SYMBOL” and, if no mapping is found using “SYMBOL”, the query is performed using the “ALIAS” column. This is helpful when users unknowingly start from a list of names that is actually a mix of official gene symbols and aliases.

This behaviour of select() and mapIds() can be controlled through the boolean parameter search.aliases, whose default value is TRUE.

In the following example, “BRCAI” is actually an alias of BRCA1 gene, while “PTEN” is the official symbol of the PTEN gene. When mapping these two keys (treated as SYMBOL) to ENTREZID by using select(), BRCAI is correctly viewed as an alias of BRCA1 gene and mapped to the NCBI gene id of BRCA1.

# Map gene symbols to their NCBI gene ids, querying also the ALIAS column 
# if needed
result <- geneslator::select(org.Hsapiens.db, keys = c("BRCAI","PTEN"),
            columns = "ENTREZID", keytype = "SYMBOL")
result
#>   SYMBOL ENTREZID
#> 1  BRCAI      672
#> 2   PTEN     5728

Whenever ALIAS column is used in place of SYMBOL column (as in this example), a warning message is sent to the user. If we repeat the same query with search.aliases=FALSE, select() is unable to map BRCAI to the correct NCBI gene id.

# Map gene symbols to their NCBI gene ids, querying only the SYMBOL column 
result <- geneslator::select(org.Hsapiens.db, keys = c("BRCAI","PTEN"),
            columns = "ENTREZID", keytype = "SYMBOL", search.aliases = FALSE)
result
#>   SYMBOL ENTREZID
#> 1  BRCAI     <NA>
#> 2   PTEN     5728

Search using archived identifiers

Gene identifiers and symbols can change over time or become deprecated, as a result of periodic updates of databases such as NCBI or Ensembl. This could be troublesome in annotation tasks, especially when user starts from an old set of identifiers or symbols. To overcome this, annotation databases in geneslator contain columns “ENTREZIDOLD” and “ENSEMBLOLD”, which collect old gene identifiers of NCBI Gene and Ensembl databases. By default, these columns are queried by select() and mapIds() methods whenever a gene cannot be annotated using current identifiers. This behaviour can be controlled through the boolean parameter search.archives, whose default value is TRUE.

For example, in the following query key “3” corresponds to the old NCBI Gene identifier of gene “A2MP1”. By using archived data, select() is able to correctly map NCBI Gene ID “3” to gene symbol “A2MP1”.

# Map NCBI gene id 3 to gene symbol, using both current and old identifiers
result <- geneslator::select(org.Hsapiens.db, keys = "3", columns = "SYMBOL",
            keytype = "ENTREZID")
result
#>   ENTREZID SYMBOL
#> 1        3  PZP2P

Whenever archived identifiers are used to solve a query (as in this example), a warning message is sent to the user. If we set search.archives=FALSE, select() is unable to map the identifier to the correct symbol.

# Map NCBI gene id 3 to gene symbol, using only current identifiers 
result <- geneslator::select(org.Hsapiens.db, keys = "3", columns = "SYMBOL",
            keytype = "ENTREZID", search.archives = FALSE)
result
#>   ENTREZID SYMBOL
#> 1        3     NA

Orthologs mapping

In queries involving orthologs mapping, by default, select() returns all possible ortholog mappings. This behavior is controlled by parameter orthologs.mapping, whose default value is “multiple”.

# Get orthologs of yeast genes CHC1 and NMA2 in worm and fly 
result <- geneslator::select(org.Hsapiens.db, keys = c("CHC1","SCAMP5"),
            columns = c("ORTHOWORM", "ORTHOFLY"), keytype = "SYMBOL")
result
#>   SYMBOL ORTHOWORM ORTHOFLY
#> 1   CHC1     ran-3  CG33288
#> 2   CHC1     ran-3   CG7420
#> 3   CHC1     ran-3     Rcc1
#> 4 SCAMP5     scm-1    Scamp

To get only the first ortholog, set orthologs.mapping="single":

result <- geneslator::select(org.Hsapiens.db, keys = c("CHC1","SCAMP5"),
            columns = c("ORTHOWORM", "ORTHOFLY"), keytype = "SYMBOL",
            orthologs.mapping = "single")
result
#>   SYMBOL ORTHOWORM ORTHOFLY
#> 1   CHC1     ran-3  CG33288
#> 2 SCAMP5     scm-1    Scamp

For mapIds() function, the option orthologs.mapping is absent, because the number of mapped orthologs can be directly controlled through parameter multiVals.

Session Information

sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#> [1] AnnotationDbi_1.75.0 IRanges_2.47.2       S4Vectors_0.51.3    
#> [4] Biobase_2.73.1       BiocGenerics_0.59.7  generics_0.1.4      
#> [7] geneslator_0.99.2    BiocStyle_2.41.0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] utf8_1.2.6          sass_0.4.10         xml2_1.5.2         
#>  [4] RSQLite_3.53.1      zen4R_0.10.5        digest_0.6.39      
#>  [7] evaluate_1.0.5      fastmap_1.2.0       blob_1.3.0         
#> [10] plyr_1.8.9          jsonlite_2.0.0      DBI_1.3.0          
#> [13] BiocManager_1.30.27 httr_1.4.8          XML_3.99-0.23      
#> [16] Biostrings_2.81.3   jquerylib_0.1.4     cli_3.6.6          
#> [19] rlang_1.2.0         crayon_1.5.3        XVector_0.53.0     
#> [22] bit64_4.8.2         cachem_1.1.0        yaml_2.3.12        
#> [25] otel_0.2.0          tools_4.6.0         memoise_2.0.1      
#> [28] curl_7.1.0          buildtools_1.0.0    vctrs_0.7.3        
#> [31] R6_2.6.1            png_0.1-9           lifecycle_1.0.5    
#> [34] KEGGREST_1.53.0     Seqinfo_1.3.0       bit_4.6.0          
#> [37] pkgconfig_2.0.3     bslib_0.11.0        Rcpp_1.1.1-1.1     
#> [40] xfun_0.58           keyring_1.4.1       sys_3.4.3          
#> [43] knitr_1.51          htmltools_0.5.9     rmarkdown_2.31     
#> [46] maketools_1.3.2     compiler_4.6.0

References