---
title: Gene set searches with the Gesel database
author:
- name: Aaron Lun
  email: infinite.monkeys.with.keyboards@gmail.com
package: gesel
date: "Revised: November 4, 2024"
output:
  BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{Working with the Gesel database}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo=FALSE}
library(BiocStyle)
self <- Githubpkg("gesel-inc/gesel-R", "gesel")
knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE)
```

# Introduction

The `r self` package implements an R client to the [Gesel database](https://doi.org/10.21105/joss.05777) for gene set searching.
The Gesel database is a collection of static files containing information about gene sets, which can be hosted anywhere - via a file server, on a HPC's shared filesystem etc. 
Clients like the `r self` package then use HTTP range requests to perform a variety of queries.
No custom server logic is required, greatly reducing effort and cost required to keep Gesel up and running.
Most database files do not need to be downloaded to the client, allowing us to easily scale with increasing numbers of gene sets in the database.

More details can be found at https://github.com/gesel-inc.
A web interface is also available at https://gesel-inc.github.io/web-app. 

# Quick start

The raison d'etre of the Gesel database is to find overlaps between known gene sets and a user-supplied list of genes.
Let's say we have several human genes of interest:

```{r}
my.genes <- c("SNAP25", "NEUROD6", "GAD1", "GAD2", "RELN")
```

We use `querySets()` to identify all gene sets that overlap with `my.genes`.
This also gives the enrichment $p$-value from a hypergeometric test.

```{r}
library(gesel)
res <- querySets("9606", genes = my.genes)
head(res)
```

Or we can filter by keywords in the gene set names or descriptions:

```{r}
res <- querySets("9606", text = "brain")
head(res)

# Or, filtering by both genes and text. 
res <- querySets("9606", genes = my.genes, text = "neur*")
head(res)
```

We can also load all available gene sets for a species for use in other gene set-related packages, e.g., `r Biocpkg("fgsea")`.

```{r}
everything <- loadAllSets("9606", type = "symbol", as.compressed = TRUE)
everything
S4Vectors::mcols(everything)
```

# Using low-level functions

## Finding overlaps

The functions described above are convenient but opinionated wrappers around some of `r self`'s low-level functions.
Developers may prefer to use these low-level functions directly for greater control of the output, e.g., to avoid unnecessary work. 
To demonstrate, we first map our genes of interest to Gesel gene indices, which are Gesel's internal identifiers for each gene.

```{r}
library(gesel)
gene.idx <- searchGenes("9606", my.genes)
gene.idx # this is a list of integer vectors, in case of 1:many mappings.
gene.idx <- unlist(gene.idx)
fetchAllGenes("9606")[gene.idx,] # double-checking that we got it right.
```

Now we find all sets with overlapping genes.
The `searchOverlappingSets()` identifies all sets that overlap at least one gene in the user-supplied set.

```{r}
overlaps <- searchOverlappingSets("9606", gene.idx, counts.only=FALSE)
head(overlaps$overlap) # set index and the identities of overlapping genes.
```

The set index can be converted back to information about each set, along with the collection from which it came:

```{r}
first.set.info <- fetchSomeSets("9606", head(overlaps$overlap$set)) # getting details of the first few sets
first.set.info
first.col.info <- fetchSomeCollections("9606", first.set.info$collection) # getting details of the collections
first.col.info
```

## Searching on text

We can also search for gene sets based on the text in their names or descriptions.
The `searchSetText()` function accepts a query string with multiple words and the usual `*` (any characters) and `?` (one character) wildcards.

```{r}
chits <- searchSetText("9606", "cancer")
fetchSomeSets("9606", head(chits))
ihits <- searchSetText("9606", "innate immun*")
fetchSomeSets("9606", head(ihits))
thits <- searchSetText("9606", "cd? t cell")
fetchSomeSets("9606", head(thits))
```

Users can construct powerful queries by intersecting the sets recovered from `searchSetText()` with those from `findOverlappingSets()`.

```{r}
cancer.sets <- intersect(chits, overlaps$overlap$set)
info <- fetchSomeSets("9606", cancer.sets)
m <- match(cancer.sets, overlaps$overlap$set)
info$count <- overlaps$overlap$count[m]
info$pvalue <- overlaps$overlap$pvalue[m]
info[head(order(info$pvalue)),]
```

## Fetching all data

Gesel is designed around partial extraction from the database files, but it may be more efficient to pull all of the data into the R session at once.
This is most useful for the gene set details, which can be retrieved _en masse_:

```{r}
set.info <- fetchAllSets("9606")
nrow(set.info)
head(set.info)
```

The set indices can then be used to directly subset the `set.info` data frame by row.
In fact, calling `fetchSomeSets()` after `fetchAllSets()` will automatically use the data frame created by the latter,
instead of attempting another request to the database.

```{r}
head(set.info[cancer.sets,])
head(fetchSomeSets("9606", cancer.sets)) # same results
```

The same approach can be used to extract collection information, via `fetchAllCollections()`;
gene set membership, via `fetchGenesForAllSets()`;
and the sets containing each gene, via `fetchSetsForAllGenes()`.

## Flushing the cache

`r self` uses a lot of in-memory caching to reduce the number of requests to the database files within a single R session.
On rare occasions, the cache may become outdated, e.g., if the database files are updated while an R session is running.
Users can prompt `r self` to re-acquire all data by flushing the cache:

```{r}
flushMemoryCache()
```

# Customizing the database

## Creating the custom database

Users can easily create custom Gesel database for their own gene set collections.
This is most relevant to large research organizations that have accumulated many gene sets over time and want to easily search them.
To demonstrate, let's mock up some gene set information:

```{r}
mock.collection.info <- data.frame(
    title=c("FOO", "BAR"),
    description=c("I am a foo", "I am a bar"),
    maintainer=c("Aaron", "Aaron"),
    source=c("https://foo", "https://bar")
)

mock.set.info <- list(
    data.frame(
        name=sprintf("FOO_%i", seq_len(20)),
        description=sprintf("this is FOO %i", seq_len(20))
    ),
    data.frame(
        name=sprintf("BAR_%i", seq_len(50)),
        description=sprintf("this is BAR %i", seq_len(50))
    )
)

mock.num.genes <- 10000
mock.set.membership <- list(
    lapply(seq_len(nrow(mock.set.info[[1]])), function(i) {
        sample(mock.num.genes, sample(500, 1))
    }),
    lapply(seq_len(nrow(mock.set.info[[2]])), function(i) {
        sample(mock.num.genes, sample(200, 1))
    })
)
```

Now we create the Gesel database files:

```{r}
mock.db.dir <- tempfile()
dir.create(mock.db.dir)
prepareDatabaseFiles(
    "12345", # mock NCBI taxonomy ID
    mock.collection.info, 
    mock.set.info, 
    mock.set.membership,
    mock.num.genes,
    path = mock.db.dir
)
list.files(mock.db.dir)
```

We also create gene annotation files to map Gesel's internal gene indices to more recognizable identifiers.
Most downstream users will expect files for Ensembl, Entrez and gene symbols.

```{r}
mock.ref.genes <- list()
for (type in c("ensembl", "entrez", "symbol")) {
    gene.ids <- sprintf("%s_%05d", type, seq_len(19999))

    # Generating a 1:many mapping of genes to names to make things interesting.
    genes.plus <- mock.num.genes * 1.1 
    chosen <- sample(gene.ids, genes.plus, replace=TRUE)
    ids <- sample(mock.num.genes, genes.plus, replace=TRUE)
    mock.ids <- split(chosen, factor(ids, seq_len(mock.num.genes)))

    # Cleaning things up before saving to file.
    mock.ids <- lapply(mock.ids, unique)
    mock.ref.genes[[type]] <- unname(mock.ids)
}

mock.gene.dir <- tempfile()
dir.create(mock.gene.dir)
prepareGeneFiles(
    "12345",
    mock.ref.genes,
    path = mock.gene.dir
)
list.files(mock.gene.dir)
```

## Reading custom files

In each `r self` function, the `config=` argument determines how data is read from the database files.
By default, `r self` makes HTTP requests to the pre-built database files in the [feedstock](https://github.com/gesel-inc/feedstock).
For our custom files, we can simply direct `r self` to look inside our new directory:

```{r}
custom.config <- newConfig(
    fetch.file=function(x) file.path(mock.db.dir, x),
    fetch.gene=function(x) file.path(mock.gene.dir, x),
    fetch.ranges=function(n, s, e) readDatabaseRanges(mock.db.dir, n, s, e) 
)
```

Passing `config = custom.config` to any `r self` function will instruct it to use our custom database files:

```{r}
# Trying a low-level function:
head(fetchAllSets("12345", config = custom.config))

# Trying one of the more user-friendly functions:
head(querySets(
    "12345",
    genes = sprintf("ensembl_%05d", 1:100),
    text = "FOO",
    config = custom.config
))
```

It is possible to do some fairly sophisticated things with `newConfig()`, depending on the environment.
For example, in hybrid environments, we can directly read files from the filesystem when running `r self` on-prem for reduced latency,
but then fall back to a HTTP request if the R session is running on the cloud. 

## Contributing custom sets

Alternatively, you might consider contributing your custom gene sets to the default Gesel database at https://github.com/gesel-inc/feedstock.
Contributed gene sets will be searchable with the default `r self` functions, which may be more convenient than creating a custom database.

# Session information {-}

```{r}
sessionInfo()
```