Gene set searches with the Gesel database

Introduction

The gesel package implements an R client to the Gesel database for gene set searching. The Gesel database is a collection of static files containing information about gene sets, which can be hosted anywhere - via a file server, on a HPC’s shared filesystem etc. Clients like the gesel package then use HTTP range requests to perform a variety of queries. No custom server logic is required, greatly reducing effort and cost required to keep Gesel up and running. Most database files do not need to be downloaded to the client, allowing us to easily scale with increasing numbers of gene sets in the database.

More details can be found at https://github.com/gesel-inc. A web interface is also available at https://gesel-inc.github.io/web-app.

Quick start

The raison d’etre of the Gesel database is to find overlaps between known gene sets and a user-supplied list of genes. Let’s say we have several human genes of interest:

my.genes <- c("SNAP25", "NEUROD6", "GAD1", "GAD2", "RELN")

We use querySets() to identify all gene sets that overlap with my.genes. This also gives the enrichment \(p\)-value from a hypergeometric test.

library(gesel)
res <- querySets("9606", genes = my.genes)
head(res)
##                                                       name
## 1                                               MODULE_563
## 2 REACTOME_GABA_SYNTHESIS_RELEASE_REUPTAKE_AND_DEGRADATION
## 3                                               GO:0004351
## 4                                               GO:0006540
## 5                  REACTOME_NEUROTRANSMITTER_RELEASE_CYCLE
## 6                                  CACTGTG_MIR128A_MIR128B
##                                                                                                      description
## 1                                               https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MODULE_563
## 2 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_GABA_SYNTHESIS_RELEASE_REUPTAKE_AND_DEGRADATION
## 3                                                                               glutamate decarboxylase activity
## 4                                                                                      gamma-aminobutyrate shunt
## 5                  https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_NEUROTRANSMITTER_RELEASE_CYCLE
## 6                                  https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/CACTGTG_MIR128A_MIR128B
##   size
## 1   15
## 2   19
## 3    2
## 4    2
## 5   51
## 6  340
##                                                                         collection
## 1                           MSigDB C4: computational gene sets, CM: cancer modules
## 2            MSigDB C2: curated gene sets, CP: Canonical pathways, Reactome subset
## 3                                                                    Gene ontology
## 4                                                                    Gene ontology
## 5            MSigDB C2: curated gene sets, CP: Canonical pathways, Reactome subset
## 6 MSigDB C3: regulatory target gene sets, MIR: microRNA targets, MIR_LEGACY subset
##     set count       pvalue
## 1 30349     3 3.315470e-10
## 2 23511     3 7.059884e-10
## 3  1638     2 1.056967e-08
## 4  2860     2 1.056967e-08
## 5 23902     3 1.515581e-08
## 6 25759     4 1.822241e-08

Or we can filter by keywords in the gene set names or descriptions:

res <- querySets("9606", text = "brain")
head(res)
##                                                   name
## 1   BYSTRYKH_HEMATOPOIESIS_STEM_CELL_AND_BRAIN_QTL_CIS
## 2 BYSTRYKH_HEMATOPOIESIS_STEM_CELL_AND_BRAIN_QTL_TRANS
## 3                       CHESLER_BRAIN_D6MIT150_QTL_CIS
## 4                     CHESLER_BRAIN_D6MIT150_QTL_TRANS
## 5                     CHESLER_BRAIN_HIGHEST_EXPRESSION
## 6               CHESLER_BRAIN_HIGHEST_GENETIC_VARIANCE
##                                                                                                  description
## 1   https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/BYSTRYKH_HEMATOPOIESIS_STEM_CELL_AND_BRAIN_QTL_CIS
## 2 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/BYSTRYKH_HEMATOPOIESIS_STEM_CELL_AND_BRAIN_QTL_TRANS
## 3                       https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/CHESLER_BRAIN_D6MIT150_QTL_CIS
## 4                     https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/CHESLER_BRAIN_D6MIT150_QTL_TRANS
## 5                     https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/CHESLER_BRAIN_HIGHEST_EXPRESSION
## 6               https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/CHESLER_BRAIN_HIGHEST_GENETIC_VARIANCE
##   size                                                            collection
## 1   69 MSigDB C2: curated gene sets, CGP: chemical and genetic perturbations
## 2  155 MSigDB C2: curated gene sets, CGP: chemical and genetic perturbations
## 3    6 MSigDB C2: curated gene sets, CGP: chemical and genetic perturbations
## 4    9 MSigDB C2: curated gene sets, CGP: chemical and genetic perturbations
## 5   41 MSigDB C2: curated gene sets, CGP: chemical and genetic perturbations
## 6   38 MSigDB C2: curated gene sets, CGP: chemical and genetic perturbations
##     set
## 1 19536
## 2 19537
## 3 19647
## 4 19648
## 5 19649
## 6 19650
# Or, filtering by both genes and text. 
res <- querySets("9606", genes = my.genes, text = "neur*")
head(res)
##                                               name
## 1          REACTOME_NEUROTRANSMITTER_RELEASE_CYCLE
## 2 WP_CELL_LINEAGE_MAP_FOR_NEURONAL_DIFFERENTIATION
## 3                  MANNO_MIDBRAIN_NEUROTYPES_HGABA
## 4                   MANNO_MIDBRAIN_NEUROTYPES_HNBM
## 5                         REACTOME_NEURONAL_SYSTEM
## 6                 MANNO_MIDBRAIN_NEUROTYPES_HNBML5
##                                                                                              description
## 1          https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_NEUROTRANSMITTER_RELEASE_CYCLE
## 2 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WP_CELL_LINEAGE_MAP_FOR_NEURONAL_DIFFERENTIATION
## 3                  https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MANNO_MIDBRAIN_NEUROTYPES_HGABA
## 4                   https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MANNO_MIDBRAIN_NEUROTYPES_HNBM
## 5                         https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_NEURONAL_SYSTEM
## 6                 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MANNO_MIDBRAIN_NEUROTYPES_HNBML5
##   size
## 1   51
## 2  132
## 3 1105
## 4  295
## 5  411
## 6  465
##                                                                  collection
## 1     MSigDB C2: curated gene sets, CP: Canonical pathways, Reactome subset
## 2 MSigDB C2: curated gene sets, CP: Canonical pathways, WikiPathways subset
## 3                                  MSigDB C8: cell type signature gene sets
## 4                                  MSigDB C8: cell type signature gene sets
## 5     MSigDB C2: curated gene sets, CP: Canonical pathways, Reactome subset
## 6                                  MSigDB C8: cell type signature gene sets
##     set count       pvalue
## 1 23902     3 1.515581e-08
## 2 24904     3 2.719047e-07
## 3 42292     4 2.028974e-06
## 4 42295     3 3.056438e-06
## 5 23898     3 8.256205e-06
## 6 42297     3 1.194453e-05

We can also load all available gene sets for a species for use in other gene set-related packages, e.g., fgsea.

everything <- loadAllSets("9606", type = "symbol", as.compressed = TRUE)
everything
## CharacterList of length 42580
## [[1]] PIGV ALG12
## [[2]] ERCC8 ERCC6 LIG4 TNP1 UNG XRCC1 SIRT1 APTX NEIL3 TDP1 APLF XNDC1N
## [[3]] ENDOG ERCC1 ERCC4 MRE11 RBBP8 SETMAR EXOG RAD50 ASTE1 DCLRE1C
## [[4]] ENO1 ENO2 ENO3 ENO4
## [[5]] LCT
## [[6]] SLC5A1 SLC5A2
## [[7]] IL7R KPNA1 KPNA2 RTEL1 SMARCAD1
## [[8]] MRE11 RAD50
## [[9]] PRC1 KIF23
## [[10]] GAA
## ...
## <42570 more elements>
S4Vectors::mcols(everything)
## DataFrame with 42580 rows and 4 columns
##                         name            description      size
##                  <character>            <character> <integer>
## 1                 GO:0000009 alpha-1,6-mannosyltr..         2
## 2                 GO:0000012 single strand break ..        12
## 3                 GO:0000014 single-stranded DNA ..        10
## 4                 GO:0000015 phosphopyruvate hydr..         4
## 5                 GO:0000016       lactase activity         1
## ...                      ...                    ...       ...
## 42576 HALLMARK_UNFOLDED_PR.. https://www.gsea-msi..       113
## 42577 HALLMARK_UV_RESPONSE.. https://www.gsea-msi..       144
## 42578 HALLMARK_UV_RESPONSE.. https://www.gsea-msi..       158
## 42579 HALLMARK_WNT_BETA_CA.. https://www.gsea-msi..        42
## 42580 HALLMARK_XENOBIOTIC_.. https://www.gsea-msi..       200
##                   collection
##                  <character>
## 1              Gene ontology
## 2              Gene ontology
## 3              Gene ontology
## 4              Gene ontology
## 5              Gene ontology
## ...                      ...
## 42576 MSigDB H: hallmark g..
## 42577 MSigDB H: hallmark g..
## 42578 MSigDB H: hallmark g..
## 42579 MSigDB H: hallmark g..
## 42580 MSigDB H: hallmark g..

Using low-level functions

Finding overlaps

The functions described above are convenient but opinionated wrappers around some of gesel’s low-level functions. Developers may prefer to use these low-level functions directly for greater control of the output, e.g., to avoid unnecessary work. To demonstrate, we first map our genes of interest to Gesel gene indices, which are Gesel’s internal identifiers for each gene.

library(gesel)
gene.idx <- searchGenes("9606", my.genes)
gene.idx # this is a list of integer vectors, in case of 1:many mappings.
## [[1]]
## [1] 4645
## 
## [[2]]
## [1] 12799
## 
## [[3]]
## [1] 1758
## 
## [[4]]
## [1] 1759
## 
## [[5]]
## [1] 3916
gene.idx <- unlist(gene.idx)
fetchAllGenes("9606")[gene.idx,] # double-checking that we got it right.
##       entrez      ensembl  symbol
## 4645    6616 ENSG0000....  SNAP25
## 12799  63974 ENSG0000.... NEUROD6
## 1758    2571 ENSG0000....    GAD1
## 1759    2572 ENSG0000....    GAD2
## 3916    5649 ENSG0000....    RELN

Now we find all sets with overlapping genes. The searchOverlappingSets() identifies all sets that overlap at least one gene in the user-supplied set.

overlaps <- searchOverlappingSets("9606", gene.idx, counts.only=FALSE)
head(overlaps$overlap) # set index and the identities of overlapping genes.
##     set count genes size      pvalue
## 58   58     1  1759  700 0.077915063
## 62   62     1  4645   66 0.007563568
## 237 237     1 12799 1113 0.121555167
## 274 274     1  3916  117 0.013376739
## 299 299     1 12799 1089 0.119065304
## 301 301     1 12799 1223 0.132895108

The set index can be converted back to information about each set, along with the collection from which it came:

first.set.info <- fetchSomeSets("9606", head(overlaps$overlap$set)) # getting details of the first few sets
first.set.info
##         name
## 1 GO:0000139
## 2 GO:0000149
## 3 GO:0000785
## 4 GO:0000902
## 5 GO:0000978
## 6 GO:0000981
##                                                             description size
## 1                                                        Golgi membrane  700
## 2                                                         SNARE binding   66
## 3                                                             chromatin 1113
## 4                                                    cell morphogenesis  117
## 5 RNA polymerase II cis-regulatory region sequence-specific DNA binding 1089
## 6 DNA-binding transcription factor activity, RNA polymerase II-specific 1223
##   collection number
## 1          1     58
## 2          1     62
## 3          1    237
## 4          1    274
## 5          1    299
## 6          1    301
first.col.info <- fetchSomeCollections("9606", first.set.info$collection) # getting details of the collections
first.col.info
##           title
## 1 Gene ontology
## 2 Gene ontology
## 3 Gene ontology
## 4 Gene ontology
## 5 Gene ontology
## 6 Gene ontology
##                                                                                                             description
## 1 Gene sets defined from the Gene Ontology (version 2026-01-23), sourced from AH121953 in Bioconductor's AnnotationHub.
## 2 Gene sets defined from the Gene Ontology (version 2026-01-23), sourced from AH121953 in Bioconductor's AnnotationHub.
## 3 Gene sets defined from the Gene Ontology (version 2026-01-23), sourced from AH121953 in Bioconductor's AnnotationHub.
## 4 Gene sets defined from the Gene Ontology (version 2026-01-23), sourced from AH121953 in Bioconductor's AnnotationHub.
## 5 Gene sets defined from the Gene Ontology (version 2026-01-23), sourced from AH121953 in Bioconductor's AnnotationHub.
## 6 Gene sets defined from the Gene Ontology (version 2026-01-23), sourced from AH121953 in Bioconductor's AnnotationHub.
##   maintainer                                                           source
## 1  Aaron Lun https://github.com/gesel-inc/feedstock/blob/go-v1.1.0/go/build.R
## 2  Aaron Lun https://github.com/gesel-inc/feedstock/blob/go-v1.1.0/go/build.R
## 3  Aaron Lun https://github.com/gesel-inc/feedstock/blob/go-v1.1.0/go/build.R
## 4  Aaron Lun https://github.com/gesel-inc/feedstock/blob/go-v1.1.0/go/build.R
## 5  Aaron Lun https://github.com/gesel-inc/feedstock/blob/go-v1.1.0/go/build.R
## 6  Aaron Lun https://github.com/gesel-inc/feedstock/blob/go-v1.1.0/go/build.R
##   start  size
## 1     1 18864
## 2     1 18864
## 3     1 18864
## 4     1 18864
## 5     1 18864
## 6     1 18864

Searching on text

We can also search for gene sets based on the text in their names or descriptions. The searchSetText() function accepts a query string with multiple words and the usual * (any characters) and ? (one character) wildcards.

chits <- searchSetText("9606", "cancer")
fetchSomeSets("9606", head(chits))
##                                    name
## 1      ABDULRAHMAN_KIDNEY_CANCER_VHL_DN
## 2               ACEVEDO_LIVER_CANCER_DN
## 3               ACEVEDO_LIVER_CANCER_UP
## 4 ACEVEDO_LIVER_CANCER_WITH_H3K27ME3_DN
## 5 ACEVEDO_LIVER_CANCER_WITH_H3K27ME3_UP
## 6  ACEVEDO_LIVER_CANCER_WITH_H3K9ME3_DN
##                                                                                   description
## 1      https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/ABDULRAHMAN_KIDNEY_CANCER_VHL_DN
## 2               https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/ACEVEDO_LIVER_CANCER_DN
## 3               https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/ACEVEDO_LIVER_CANCER_UP
## 4 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/ACEVEDO_LIVER_CANCER_WITH_H3K27ME3_DN
## 5 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/ACEVEDO_LIVER_CANCER_WITH_H3K27ME3_UP
## 6  https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/ACEVEDO_LIVER_CANCER_WITH_H3K9ME3_DN
##   size collection number
## 1   13          3      6
## 2  540          3     14
## 3  972          3     15
## 4  220          3     16
## 5  289          3     17
## 6  119          3     18
ihits <- searchSetText("9606", "innate immun*")
fetchSomeSets("9606", head(ihits))
##                                                                             name
## 1 REACTOME_DENGUE_VIRUS_ACTIVATES_MODULATES_INNATE_AND_ADAPTIVE_IMMUNE_RESPONSES
## 2                                                  REACTOME_INNATE_IMMUNE_SYSTEM
## 3                REACTOME_REGULATION_OF_INNATE_IMMUNE_RESPONSES_TO_CYTOSOLIC_DNA
## 4                REACTOME_SARS_COV_1_ACTIVATES_MODULATES_INNATE_IMMUNE_RESPONSES
## 5   REACTOME_SARS_COV_2_ACTIVATES_MODULATES_INNATE_AND_ADAPTIVE_IMMUNE_RESPONSES
## 6               WP_PATHWAYS_OF_NUCLEIC_ACID_METABOLISM_AND_INNATE_IMMUNE_SENSING
##                                                                                                                            description
## 1 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_DENGUE_VIRUS_ACTIVATES_MODULATES_INNATE_AND_ADAPTIVE_IMMUNE_RESPONSES
## 2                                                  https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_INNATE_IMMUNE_SYSTEM
## 3                https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_REGULATION_OF_INNATE_IMMUNE_RESPONSES_TO_CYTOSOLIC_DNA
## 4                https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_SARS_COV_1_ACTIVATES_MODULATES_INNATE_IMMUNE_RESPONSES
## 5   https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_SARS_COV_2_ACTIVATES_MODULATES_INNATE_AND_ADAPTIVE_IMMUNE_RESPONSES
## 6               https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WP_PATHWAYS_OF_NUCLEIC_ACID_METABOLISM_AND_INNATE_IMMUNE_SENSING
##   size collection number
## 1   32          5    359
## 2 1113          5    725
## 3   15          5   1269
## 4   41          5   1426
## 5  126          5   1431
## 6   16          6    680
thits <- searchSetText("9606", "cd? t cell")
fetchSomeSets("9606", head(thits))
##                                                    name
## 1 HP_DECREASED_ANTI_CD3_28_INDUCED_T_CELL_PROLIFERATION
## 2         GAVISH_3CA_METAPROGRAM_CD4_T_CELLS_CELL_CYCLE
## 3                     HP_ABNORMAL_CD4_T_CELL_PROPORTION
## 4              HP_ABNORMAL_CD4_T_CELL_SUBSET_PROPORTION
## 5               HP_ABNORMAL_NAIVE_CD4_T_CELL_PROPORTION
## 6              HP_DECREASED_NAIVE_CD4_T_CELL_PROPORTION
##                                                                                                   description
## 1 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HP_DECREASED_ANTI_CD3_28_INDUCED_T_CELL_PROLIFERATION
## 2         https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/GAVISH_3CA_METAPROGRAM_CD4_T_CELLS_CELL_CYCLE
## 3                     https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HP_ABNORMAL_CD4_T_CELL_PROPORTION
## 4              https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HP_ABNORMAL_CD4_T_CELL_SUBSET_PROPORTION
## 5               https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HP_ABNORMAL_NAIVE_CD4_T_CELL_PROPORTION
## 6              https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HP_DECREASED_NAIVE_CD4_T_CELL_PROPORTION
##   size collection number
## 1   11         14   2207
## 2   49         11     53
## 3   47         14    388
## 4   23         14    389
## 5   11         14    850
## 6    9         14   2274

Users can construct powerful queries by intersecting the sets recovered from searchSetText() with those from findOverlappingSets().

cancer.sets <- intersect(chits, overlaps$overlap$set)
info <- fetchSomeSets("9606", cancer.sets)
m <- match(cancer.sets, overlaps$overlap$set)
info$count <- overlaps$overlap$count[m]
info$pvalue <- overlaps$overlap$pvalue[m]
info[head(order(info$pvalue)),]
##                                                       name
## 5                      LOPES_METHYLATED_IN_COLON_CANCER_UP
## 7  SCHLESINGER_H3K27ME3_IN_NORMAL_AND_METHYLATED_IN_CANCER
## 14                     WATANABE_COLON_CANCER_MSI_VS_MSS_UP
## 12         WANG_BARRETTS_ESOPHAGUS_AND_ESOPHAGUS_CANCER_DN
## 3                  KONDO_PROSTATE_CANCER_HCP_WITH_H3K27ME3
## 13                                WANG_HCP_PROSTATE_CANCER
##                                                                                                      description
## 5                      https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/LOPES_METHYLATED_IN_COLON_CANCER_UP
## 7  https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/SCHLESINGER_H3K27ME3_IN_NORMAL_AND_METHYLATED_IN_CANCER
## 14                     https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WATANABE_COLON_CANCER_MSI_VS_MSS_UP
## 12         https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WANG_BARRETTS_ESOPHAGUS_AND_ESOPHAGUS_CANCER_DN
## 3                  https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/KONDO_PROSTATE_CANCER_HCP_WITH_H3K27ME3
## 13                                https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WANG_HCP_PROSTATE_CANCER
##    size collection number count      pvalue
## 5    28          3   1799     1 0.003214398
## 7    28          3   2717     1 0.003214398
## 14   30          3   3242     1 0.003443681
## 12   37          3   3194     1 0.004245840
## 3    97          3   1524     1 0.011100320
## 13  106          3   3206     1 0.012125228

Fetching all data

Gesel is designed around partial extraction from the database files, but it may be more efficient to pull all of the data into the R session at once. This is most useful for the gene set details, which can be retrieved en masse:

set.info <- fetchAllSets("9606")
nrow(set.info)
## [1] 42580
head(set.info)
##         name                                        description size collection
## 1 GO:0000009             alpha-1,6-mannosyltransferase activity    2          1
## 2 GO:0000012                         single strand break repair   12          1
## 3 GO:0000014 single-stranded DNA endodeoxyribonuclease activity   10          1
## 4 GO:0000015                  phosphopyruvate hydratase complex    4          1
## 5 GO:0000016                                   lactase activity    1          1
## 6 GO:0000017                          alpha-glucoside transport    2          1
##   number
## 1      1
## 2      2
## 3      3
## 4      4
## 5      5
## 6      6

The set indices can then be used to directly subset the set.info data frame by row. In fact, calling fetchSomeSets() after fetchAllSets() will automatically use the data frame created by the latter, instead of attempting another request to the database.

head(set.info[cancer.sets,])
##                                                      name
## 19180                             ACEVEDO_LIVER_CANCER_DN
## 19820                             DELYS_THYROID_CANCER_DN
## 20690             KONDO_PROSTATE_CANCER_HCP_WITH_H3K27ME3
## 20913    LIU_OVARIAN_CANCER_TUMORS_AND_XENOGRAFTS_XDGS_DN
## 20965                 LOPES_METHYLATED_IN_COLON_CANCER_UP
## 21857 SATO_SILENCED_BY_METHYLATION_IN_PANCREATIC_CANCER_1
##                                                                                                     description
## 19180                             https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/ACEVEDO_LIVER_CANCER_DN
## 19820                             https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/DELYS_THYROID_CANCER_DN
## 20690             https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/KONDO_PROSTATE_CANCER_HCP_WITH_H3K27ME3
## 20913    https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/LIU_OVARIAN_CANCER_TUMORS_AND_XENOGRAFTS_XDGS_DN
## 20965                 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/LOPES_METHYLATED_IN_COLON_CANCER_UP
## 21857 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/SATO_SILENCED_BY_METHYLATION_IN_PANCREATIC_CANCER_1
##       size collection number
## 19180  540          3     14
## 19820  233          3    654
## 20690   97          3   1524
## 20913 1716          3   1747
## 20965   28          3   1799
## 21857  422          3   2691
head(fetchSomeSets("9606", cancer.sets)) # same results
##                                                  name
## 1                             ACEVEDO_LIVER_CANCER_DN
## 2                             DELYS_THYROID_CANCER_DN
## 3             KONDO_PROSTATE_CANCER_HCP_WITH_H3K27ME3
## 4    LIU_OVARIAN_CANCER_TUMORS_AND_XENOGRAFTS_XDGS_DN
## 5                 LOPES_METHYLATED_IN_COLON_CANCER_UP
## 6 SATO_SILENCED_BY_METHYLATION_IN_PANCREATIC_CANCER_1
##                                                                                                 description
## 1                             https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/ACEVEDO_LIVER_CANCER_DN
## 2                             https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/DELYS_THYROID_CANCER_DN
## 3             https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/KONDO_PROSTATE_CANCER_HCP_WITH_H3K27ME3
## 4    https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/LIU_OVARIAN_CANCER_TUMORS_AND_XENOGRAFTS_XDGS_DN
## 5                 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/LOPES_METHYLATED_IN_COLON_CANCER_UP
## 6 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/SATO_SILENCED_BY_METHYLATION_IN_PANCREATIC_CANCER_1
##   size collection number
## 1  540          3     14
## 2  233          3    654
## 3   97          3   1524
## 4 1716          3   1747
## 5   28          3   1799
## 6  422          3   2691

The same approach can be used to extract collection information, via fetchAllCollections(); gene set membership, via fetchGenesForAllSets(); and the sets containing each gene, via fetchSetsForAllGenes().

Flushing the cache

gesel uses a lot of in-memory caching to reduce the number of requests to the database files within a single R session. On rare occasions, the cache may become outdated, e.g., if the database files are updated while an R session is running. Users can prompt gesel to re-acquire all data by flushing the cache:

flushMemoryCache()

Customizing the database

Creating the custom database

Users can easily create custom Gesel database for their own gene set collections. This is most relevant to large research organizations that have accumulated many gene sets over time and want to easily search them. To demonstrate, let’s mock up some gene set information:

mock.collection.info <- data.frame(
    title=c("FOO", "BAR"),
    description=c("I am a foo", "I am a bar"),
    maintainer=c("Aaron", "Aaron"),
    source=c("https://foo", "https://bar")
)

mock.set.info <- list(
    data.frame(
        name=sprintf("FOO_%i", seq_len(20)),
        description=sprintf("this is FOO %i", seq_len(20))
    ),
    data.frame(
        name=sprintf("BAR_%i", seq_len(50)),
        description=sprintf("this is BAR %i", seq_len(50))
    )
)

mock.num.genes <- 10000
mock.set.membership <- list(
    lapply(seq_len(nrow(mock.set.info[[1]])), function(i) {
        sample(mock.num.genes, sample(500, 1))
    }),
    lapply(seq_len(nrow(mock.set.info[[2]])), function(i) {
        sample(mock.num.genes, sample(200, 1))
    })
)

Now we create the Gesel database files:

mock.db.dir <- tempfile()
dir.create(mock.db.dir)
prepareDatabaseFiles(
    "12345", # mock NCBI taxonomy ID
    mock.collection.info, 
    mock.set.info, 
    mock.set.membership,
    mock.num.genes,
    path = mock.db.dir
)
list.files(mock.db.dir)
##  [1] "12345_collections.tsv"                  
##  [2] "12345_collections.tsv.gz"               
##  [3] "12345_collections.tsv.ranges.gz"        
##  [4] "12345_gene2set.tsv"                     
##  [5] "12345_gene2set.tsv.gz"                  
##  [6] "12345_gene2set.tsv.ranges.gz"           
##  [7] "12345_set2gene.tsv"                     
##  [8] "12345_set2gene.tsv.gz"                  
##  [9] "12345_set2gene.tsv.ranges.gz"           
## [10] "12345_sets.tsv"                         
## [11] "12345_sets.tsv.gz"                      
## [12] "12345_sets.tsv.ranges.gz"               
## [13] "12345_tokens-descriptions.tsv"          
## [14] "12345_tokens-descriptions.tsv.gz"       
## [15] "12345_tokens-descriptions.tsv.ranges.gz"
## [16] "12345_tokens-names.tsv"                 
## [17] "12345_tokens-names.tsv.gz"              
## [18] "12345_tokens-names.tsv.ranges.gz"

We also create gene annotation files to map Gesel’s internal gene indices to more recognizable identifiers. Most downstream users will expect files for Ensembl, Entrez and gene symbols.

mock.ref.genes <- list()
for (type in c("ensembl", "entrez", "symbol")) {
    gene.ids <- sprintf("%s_%05d", type, seq_len(19999))

    # Generating a 1:many mapping of genes to names to make things interesting.
    genes.plus <- mock.num.genes * 1.1 
    chosen <- sample(gene.ids, genes.plus, replace=TRUE)
    ids <- sample(mock.num.genes, genes.plus, replace=TRUE)
    mock.ids <- split(chosen, factor(ids, seq_len(mock.num.genes)))

    # Cleaning things up before saving to file.
    mock.ids <- lapply(mock.ids, unique)
    mock.ref.genes[[type]] <- unname(mock.ids)
}

mock.gene.dir <- tempfile()
dir.create(mock.gene.dir)
prepareGeneFiles(
    "12345",
    mock.ref.genes,
    path = mock.gene.dir
)
list.files(mock.gene.dir)
## [1] "12345_gene-type-ensembl.tsv.gz" "12345_gene-type-entrez.tsv.gz" 
## [3] "12345_gene-type-symbol.tsv.gz"  "12345_gene-types.tsv"          
## [5] "12345_gene-version.tsv"

Reading custom files

In each gesel function, the config= argument determines how data is read from the database files. By default, gesel makes HTTP requests to the pre-built database files in the feedstock. For our custom files, we can simply direct gesel to look inside our new directory:

custom.config <- newConfig(
    fetch.file=function(x) file.path(mock.db.dir, x),
    fetch.gene=function(x) file.path(mock.gene.dir, x),
    fetch.ranges=function(n, s, e) readDatabaseRanges(mock.db.dir, n, s, e) 
)

Passing config = custom.config to any gesel function will instruct it to use our custom database files:

# Trying a low-level function:
head(fetchAllSets("12345", config = custom.config))
##    name   description size collection number
## 1 FOO_1 this is FOO 1  268          1      1
## 2 FOO_2 this is FOO 2  250          1      2
## 3 FOO_3 this is FOO 3  494          1      3
## 4 FOO_4 this is FOO 4  344          1      4
## 5 FOO_5 this is FOO 5  205          1      5
## 6 FOO_6 this is FOO 6   69          1      6
# Trying one of the more user-friendly functions:
head(querySets(
    "12345",
    genes = sprintf("ensembl_%05d", 1:100),
    text = "FOO",
    config = custom.config
))
##     name    description size collection set count      pvalue
## 1  FOO_5  this is FOO 5  205        FOO   5     5 0.007283379
## 2  FOO_7  this is FOO 7  292        FOO   7     4 0.097075705
## 3  FOO_4  this is FOO 4  344        FOO   4     4 0.150703791
## 4 FOO_15 this is FOO 15  234        FOO  15     3 0.165590697
## 5 FOO_13 this is FOO 13  169        FOO  13     2 0.270774204
## 6  FOO_9  this is FOO 9  192        FOO   9     2 0.322410207

It is possible to do some fairly sophisticated things with newConfig(), depending on the environment. For example, in hybrid environments, we can directly read files from the filesystem when running gesel on-prem for reduced latency, but then fall back to a HTTP request if the R session is running on the cloud.

Contributing custom sets

Alternatively, you might consider contributing your custom gene sets to the default Gesel database at https://github.com/gesel-inc/feedstock. Contributed gene sets will be searchable with the default gesel functions, which may be more convenient than creating a custom database.

Session information

sessionInfo()
## R version 4.6.0 (2026-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] gesel_0.99.4     BiocStyle_2.41.0
## 
## loaded via a namespace (and not attached):
##  [1] cli_3.6.6           knitr_1.51          rlang_1.2.0        
##  [4] xfun_0.58           otel_0.2.0          generics_0.1.4     
##  [7] jsonlite_2.0.0      glue_1.8.1          S4Vectors_0.51.3   
## [10] buildtools_1.0.0    htmltools_0.5.9     maketools_1.3.2    
## [13] sys_3.4.3           stats4_4.6.0        sass_0.4.10        
## [16] rmarkdown_2.31      rappdirs_0.3.4      evaluate_1.0.5     
## [19] jquerylib_0.1.4     fastmap_1.2.0       IRanges_2.47.2     
## [22] yaml_2.3.12         lifecycle_1.0.5     httr2_1.2.2        
## [25] BiocManager_1.30.27 compiler_4.6.0      Rcpp_1.1.1-1.1     
## [28] digest_0.6.39       R6_2.6.1            curl_7.1.0         
## [31] magrittr_2.0.5      bslib_0.11.0        tools_4.6.0        
## [34] BiocGenerics_0.59.7 cachem_1.1.0