---
title: "Geneslator: an R package for comprehensive gene identifier conversion 
and annotation"
author:
- name: Giovanni Micale
  affiliation: University of Catania
  email: giovanni.micale@unict.it
- name: Giulia Cavallaro
  affiliation: University of Catania
  email: giuliacavallaro96@outlook.it
- name: Grete Francesca Privitera
  affiliation: University of Catania
  email: grete.privitera@unict.it
date: "`r Sys.Date()`"
package: geneslator
abstract: >
    Geneslator is a comprehensive R package designed for accurate gene 
    identifier conversion and genome annotation across multiple organisms. 
    The package integrates data from several cross-organism databases and 
    organism-specific resources within a single, coherent framework. This 
    vignette demonstrates the main features of geneslator and provides 
    practical examples of its usage.
vignette: >
    %\VignetteEncoding{UTF-8}
    %\VignetteIndexEntry{Geneslator R package for gene annotation}
    %\VignetteEngine{knitr::rmarkdown}
output:
    BiocStyle::html_document:
        toc: true
        toc_float: true
        number_sections: true
    editor_options: 
        markdown: 
            wrap: 72
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", warning = FALSE,
message = FALSE)
```

```{r style, echo=FALSE, results='asis'}
BiocStyle::markdown()
```


# Introduction

Gene identifier conversion and annotation is a common and critical task in
bioinformatics research. Existing databases and tools use different
naming conventions for genes or provide only partial annotations, making it 
challenging to integrate data from multiple sources. **geneslator** addresses 
this problem by providing a unified interface for genome annotation across 
different databases in several model organisms.

Key Features:

-   **Multiple database integration**: Integrates data from cross-organism 
    databases (NCBI, Ensembl, UniProt, Alliance of Genome Resources, GO, KEGG, 
    Reactome, Wikipathways) and organism-specific resources (HGNC, MGI, RGD, 
    SGD, WormBase, Flybase, ZFIN, TAIR);
-   **Archive search**: Supports searching using both current and archived gene 
    identifiers in NCBI and Ensembl databases;
-   **Alias resolution**: Supports automatic disambiguation between symbols and 
    aliases in annotations involving gene symbols;
-   **Multi-organism support**: Currently supports 8 model organisms
    (human, mouse, rat, yeast, worm, fly, zebrafish, and arabidopsis).


# Installation

```{r installation, eval=FALSE}
# Install package devtools, if missing
if (!requireNamespace("devtools", quietly = TRUE)) {
    install.packages("devtools")
}

# Install the development version of geneslator from GitHub
devtools::install_github("knowmics-lab/geneslator", build_vignettes = TRUE)
```


# Load the package

```{r load-package}
library(geneslator)
```


# Import annotation databases

**geneslator** provides species-specific annotation databases for several
organisms. Annotation databases are stored as SQLite files in different
versions of a Zenodo record at https://doi.org/10.5281/zenodo.20448208. Each 
release refers to a specific version of the databases. Versions are tagged 
as `year.month`, where `year` and `month` denote the year and the month of the 
publication of the release (e.g. '2026.03' for March 2026). 
Databases are updated on a monthly basis.

Type `availableDatabases()` to retrieve the list of available databases and 
supported species in the most recent release.

```{r available-organisms, eval=TRUE}
# List organisms annotated in geneslator
availableDatabases()
```

The parameter `release.version` can be used to retrieve the list of all 
available databases in an older release.

```{r available-organisms-older, eval=TRUE}
# List organisms annotated in geneslator (release December 2025)
availableDatabases(release.version = "2025.12")
```

A complete list of all available release versions can be obtained with 
`availableVersions()`.

```{r available-versions, eval=TRUE}
# Import human db again. Now cache data will be used to import db
availableVersions()
```

To query a database for a specific organism `org`, you first need to import 
it, by using the `GeneslatorDb` function. `org` can be either the scientific 
name of the organism (e.g. "Homo sapiens") or its Taxonomy ID (e.g. "10090" 
for Mouse). The function creates a new `GeneslatorDb` object for the 
requested database, which is then exported to the global environment of the 
user as a variable having the same name of the SQLite annotation database 
(e.g. `org.Hsapiens.db` for Human, `org.Mmusculus.db` for Mouse).

```{r geneslator-db, eval=TRUE}
# Import human annotation db (after downloading it from remote repository)
GeneslatorDb("Homo sapiens")
# Info about the imported human annotation database object
org.Hsapiens.db
# Import mouse annotation database using its Taxonomy ID
GeneslatorDb("10090")
# Info about the imported human annotation database object
org.Mmusculus.db
```

When called for the first time on a specific organism, `GeneslatorDb` function 
downloads the annotation database from the remote repository, stores a local 
copy into your R cache folder and finally imports the database. Future calls 
to `GeneslatorDb` function will simply import the database from your cache, 
unless a new version of the database is present in the remote repository.
In the latter case, you will be notified about that and you will be able to 
choose whether or not updating your local copy in the R cache, before importing
the database.

```{r geneslator-db-cache, eval=TRUE}
# Import human db again. Now cache data will be used to import db
GeneslatorDb("Homo sapiens")
```

By default, `GeneslatorDb` queries the latest release. To retrieve an older 
version of the database, you can set the `release.version` parameter to the
desired release version. Again, a local copy of the database (distinct from the
latest release) will be stored into your R cache folder, so that future calls 
to the same database will simply import it from your cache.

```{r geneslator-db-older, eval=TRUE}
# Import yeast annotation db from release 2025.12 (December 2025)
GeneslatorDb("Saccharomyces cerevisiae",release.version = "2025.12")
# Info about the imported human annotation database object
org.Scerevisiae.db
```


# Columns and values of annotation databases

Annotation databases are internally represented as collections of R 
dataframes that can be queried through functions that map a set of values of an 
input column (the key) of a dataframe to the corresponding values of one or 
more output columns of the same or a different dataframe.

Function `keytypes()` lists all columns that can be used as keys.

```{r keytypes, eval=TRUE}
# Get all columns that can be used as keys in mouse annotation db
geneslator::keytypes(org.Mmusculus.db)
```

Similarly, function `columns()` lists all possible output columns. 

```{r columns, eval=TRUE}
# Get all available types of output values in mouse annotation db
geneslator::columns(org.Mmusculus.db)
```

Note that the output of the two functions is different, because only identifier 
columns can be used as keys, while any column can be an output column. Type 
`help("columns","geneslator")` to see the complete list of columns available in 
the annotation databases of **geneslator**, together with their description.

Function `keys()` is used to retrieve all values of a column in an annotation 
database.

```{r keys, eval=TRUE}
# Get the first 10 Entrez IDs in mouse annotation db
head(geneslator::keys(org.Mmusculus.db, keytype = "ENTREZID"), 10)
```


# Query the annotation databases

Columns of the annotation databases can be queried using properly re-defined 
versions of the well-known query functions `select()` and `mapIds()` of 
**AnnotationDbi** R package.

The `select()` function allows you to query an input key column of the 
annotation database (`keytype` argument) and retrieve related information 
across one or more other columns (`columns` argument).

The output of `select()` is a dataframe with all columns specified by `keytype` 
and `columns` arguments and one row for each mapping found between input and 
output values.

```{r select-example, eval=TRUE}
# Map NCBI Gene IDs to gene symbols and Ensembl IDs in Human
genes <- c("1", "2", "9")
result <- geneslator::select(org.Hsapiens.db, keys = genes,
            columns = c("SYMBOL", "ENSEMBL"), keytype = "ENTREZID")
result
```

Unlike `select()`, `mapIds()` maps an input key column (argument `keytype`) to 
a single output column (argument `column`).

```{r mapids-example, eval=TRUE}
# Convert gene symbols to ENTREZ IDs (first match only)
genes <- c("TP53", "BRCA1", "EGFR")
entrez_ids <- geneslator::mapIds(org.Hsapiens.db, keys = genes, 
            column = "ENTREZID", keytype = "SYMBOL")
entrez_ids
```

By default, the return type is a named vector, where each value is the first 
mapping found (if any) for a given key, even if multiple output values map to 
that key. However, this behaviour can be changed through the `multiVals` 
parameter, which also controls the shape of the output result. For example, 
`multiVals="list"` produces a list object with all matches found for each input.

```{r mapids-multi, eval=TRUE}
# Get all possible mappings as a list
entrez_list <- geneslator::mapIds(org.Hsapiens.db, keys = genes,
            column = "ENTREZID", keytype = "SYMBOL", multiVals = "list")
entrez_list
```


# Search options


## Search using aliases

In `select()` and `mapIds()` functions, by default, queries of annotation 
databases involving gene symbols are performed by first looking at column 
"SYMBOL" and, if no mapping is found using "SYMBOL", the query is performed 
using the "ALIAS" column. This is helpful when users unknowingly start from a 
list of names that is actually a mix of official gene symbols and aliases.

This behaviour of `select()` and `mapIds()` can be controlled through the 
boolean parameter `search.aliases`, whose default value is `TRUE`.

In the following example, "BRCAI" is actually an alias of BRCA1 gene, while 
"PTEN" is the official symbol of the PTEN gene. When mapping these two keys 
(treated as SYMBOL) to ENTREZID by using `select()`, BRCAI is correctly viewed 
as an alias of BRCA1 gene and mapped to the NCBI gene id of BRCA1.

```{r aliases, eval=TRUE}
# Map gene symbols to their NCBI gene ids, querying also the ALIAS column 
# if needed
result <- geneslator::select(org.Hsapiens.db, keys = c("BRCAI","PTEN"),
            columns = "ENTREZID", keytype = "SYMBOL")
result
```

Whenever ALIAS column is used in place of SYMBOL column (as in this example), 
a warning message is sent to the user. If we repeat the same query with 
`search.aliases=FALSE`, `select()` is unable to map BRCAI to the correct 
NCBI gene id.

```{r no-aliases, eval=TRUE}
# Map gene symbols to their NCBI gene ids, querying only the SYMBOL column 
result <- geneslator::select(org.Hsapiens.db, keys = c("BRCAI","PTEN"),
            columns = "ENTREZID", keytype = "SYMBOL", search.aliases = FALSE)
result
```


## Search using archived identifiers

Gene identifiers and symbols can change over time or become deprecated, as a 
result of periodic updates of databases such as NCBI or Ensembl. This could be 
troublesome in annotation tasks, especially when user starts from an old set of 
identifiers or symbols. To overcome this, annotation databases in 
**geneslator** contain columns "ENTREZIDOLD" and "ENSEMBLOLD", which collect 
old gene identifiers of NCBI Gene and Ensembl databases.
By default, these columns are queried by `select()` and `mapIds()` methods 
whenever a gene cannot be annotated using current identifiers. This behaviour 
can be controlled through the boolean parameter `search.archives`, whose 
default value is `TRUE`.

For example, in the following query key "3" corresponds to the old NCBI Gene 
identifier of gene "A2MP1". By using archived data, `select()` is able to 
correctly map NCBI Gene ID "3" to gene symbol "A2MP1".

```{r archives, eval=TRUE}
# Map NCBI gene id 3 to gene symbol, using both current and old identifiers
result <- geneslator::select(org.Hsapiens.db, keys = "3", columns = "SYMBOL",
            keytype = "ENTREZID")
result
```

Whenever archived identifiers are used to solve a query (as in this example), 
a warning message is sent to the user. If we set `search.archives=FALSE`, 
`select()` is unable to map the identifier to the correct symbol.

```{r no-archives, eval=TRUE}
# Map NCBI gene id 3 to gene symbol, using only current identifiers 
result <- geneslator::select(org.Hsapiens.db, keys = "3", columns = "SYMBOL",
            keytype = "ENTREZID", search.archives = FALSE)
result
```


## Orthologs mapping

In queries involving orthologs mapping, by default, `select()` returns all 
possible ortholog mappings. This behavior is controlled by parameter 
`orthologs.mapping`, whose default value is "multiple".

```{r orthologs, eval=TRUE}
# Get orthologs of yeast genes CHC1 and NMA2 in worm and fly 
result <- geneslator::select(org.Hsapiens.db, keys = c("CHC1","SCAMP5"),
            columns = c("ORTHOWORM", "ORTHOFLY"), keytype = "SYMBOL")
result
```

To get only the first ortholog, set `orthologs.mapping="single"`:

```{r orthologs-single, eval=TRUE}
result <- geneslator::select(org.Hsapiens.db, keys = c("CHC1","SCAMP5"),
            columns = c("ORTHOWORM", "ORTHOFLY"), keytype = "SYMBOL",
            orthologs.mapping = "single")
result
```

For `mapIds()` function, the option `orthologs.mapping` is absent, because the 
number of mapped orthologs can be directly controlled through parameter 
`multiVals`.


# Session Information

```{r session-info}
sessionInfo()
```


# References

-   Micale G, Cavallaro G, Privitera GF (2026). geneslator: A
    Comprehensive Gene Identifier Conversion Tool. R package version
    0.99.0.

-   Pages H, Carlson M, Falcon S, Li N (2024). AnnotationDbi:
    Manipulation of SQLite-based annotations in Bioconductor. R package.

-   NCBI Gene: <https://www.ncbi.nlm.nih.gov/gene>

-   Ensembl: <https://www.ensembl.org>

-   UniProt: <https://www.uniprot.org>

-   Gene Ontology: <http://geneontology.org>

-   KEGG: <https://www.kegg.jp>

-   Reactome: <https://reactome.org>

-   WikiPathways: <https://www.wikipathways.org>

-   Alliance of Genome Resources: <https://www.alliancegenome.org>