The gesel package implements an R client to the Gesel database for gene set searching. The Gesel database is a collection of static files containing information about gene sets, which can be hosted anywhere - via a file server, on a HPC’s shared filesystem etc. Clients like the gesel package then use HTTP range requests to perform a variety of queries. No custom server logic is required, greatly reducing effort and cost required to keep Gesel up and running. Most database files do not need to be downloaded to the client, allowing us to easily scale with increasing numbers of gene sets in the database.
More details can be found at https://github.com/gesel-inc. A web interface is also available at https://gesel-inc.github.io/web-app.
The raison d’etre of the Gesel database is to find overlaps between known gene sets and a user-supplied list of genes. Let’s say we have several human genes of interest:
We use querySets() to identify all gene sets that
overlap with my.genes. This also gives the enrichment \(p\)-value from a hypergeometric test.
## name
## 1 MODULE_563
## 2 REACTOME_GABA_SYNTHESIS_RELEASE_REUPTAKE_AND_DEGRADATION
## 3 GO:0004351
## 4 GO:0006540
## 5 REACTOME_NEUROTRANSMITTER_RELEASE_CYCLE
## 6 CACTGTG_MIR128A_MIR128B
## description
## 1 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MODULE_563
## 2 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_GABA_SYNTHESIS_RELEASE_REUPTAKE_AND_DEGRADATION
## 3 glutamate decarboxylase activity
## 4 gamma-aminobutyrate shunt
## 5 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_NEUROTRANSMITTER_RELEASE_CYCLE
## 6 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/CACTGTG_MIR128A_MIR128B
## size
## 1 15
## 2 19
## 3 2
## 4 2
## 5 51
## 6 340
## collection
## 1 MSigDB C4: computational gene sets, CM: cancer modules
## 2 MSigDB C2: curated gene sets, CP: Canonical pathways, Reactome subset
## 3 Gene ontology
## 4 Gene ontology
## 5 MSigDB C2: curated gene sets, CP: Canonical pathways, Reactome subset
## 6 MSigDB C3: regulatory target gene sets, MIR: microRNA targets, MIR_LEGACY subset
## set count pvalue
## 1 30349 3 3.315470e-10
## 2 23511 3 7.059884e-10
## 3 1638 2 1.056967e-08
## 4 2860 2 1.056967e-08
## 5 23902 3 1.515581e-08
## 6 25759 4 1.822241e-08
Or we can filter by keywords in the gene set names or descriptions:
## name
## 1 BYSTRYKH_HEMATOPOIESIS_STEM_CELL_AND_BRAIN_QTL_CIS
## 2 BYSTRYKH_HEMATOPOIESIS_STEM_CELL_AND_BRAIN_QTL_TRANS
## 3 CHESLER_BRAIN_D6MIT150_QTL_CIS
## 4 CHESLER_BRAIN_D6MIT150_QTL_TRANS
## 5 CHESLER_BRAIN_HIGHEST_EXPRESSION
## 6 CHESLER_BRAIN_HIGHEST_GENETIC_VARIANCE
## description
## 1 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/BYSTRYKH_HEMATOPOIESIS_STEM_CELL_AND_BRAIN_QTL_CIS
## 2 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/BYSTRYKH_HEMATOPOIESIS_STEM_CELL_AND_BRAIN_QTL_TRANS
## 3 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/CHESLER_BRAIN_D6MIT150_QTL_CIS
## 4 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/CHESLER_BRAIN_D6MIT150_QTL_TRANS
## 5 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/CHESLER_BRAIN_HIGHEST_EXPRESSION
## 6 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/CHESLER_BRAIN_HIGHEST_GENETIC_VARIANCE
## size collection
## 1 69 MSigDB C2: curated gene sets, CGP: chemical and genetic perturbations
## 2 155 MSigDB C2: curated gene sets, CGP: chemical and genetic perturbations
## 3 6 MSigDB C2: curated gene sets, CGP: chemical and genetic perturbations
## 4 9 MSigDB C2: curated gene sets, CGP: chemical and genetic perturbations
## 5 41 MSigDB C2: curated gene sets, CGP: chemical and genetic perturbations
## 6 38 MSigDB C2: curated gene sets, CGP: chemical and genetic perturbations
## set
## 1 19536
## 2 19537
## 3 19647
## 4 19648
## 5 19649
## 6 19650
# Or, filtering by both genes and text.
res <- querySets("9606", genes = my.genes, text = "neur*")
head(res)## name
## 1 REACTOME_NEUROTRANSMITTER_RELEASE_CYCLE
## 2 WP_CELL_LINEAGE_MAP_FOR_NEURONAL_DIFFERENTIATION
## 3 MANNO_MIDBRAIN_NEUROTYPES_HGABA
## 4 MANNO_MIDBRAIN_NEUROTYPES_HNBM
## 5 REACTOME_NEURONAL_SYSTEM
## 6 MANNO_MIDBRAIN_NEUROTYPES_HNBML5
## description
## 1 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_NEUROTRANSMITTER_RELEASE_CYCLE
## 2 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WP_CELL_LINEAGE_MAP_FOR_NEURONAL_DIFFERENTIATION
## 3 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MANNO_MIDBRAIN_NEUROTYPES_HGABA
## 4 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MANNO_MIDBRAIN_NEUROTYPES_HNBM
## 5 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_NEURONAL_SYSTEM
## 6 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MANNO_MIDBRAIN_NEUROTYPES_HNBML5
## size
## 1 51
## 2 132
## 3 1105
## 4 295
## 5 411
## 6 465
## collection
## 1 MSigDB C2: curated gene sets, CP: Canonical pathways, Reactome subset
## 2 MSigDB C2: curated gene sets, CP: Canonical pathways, WikiPathways subset
## 3 MSigDB C8: cell type signature gene sets
## 4 MSigDB C8: cell type signature gene sets
## 5 MSigDB C2: curated gene sets, CP: Canonical pathways, Reactome subset
## 6 MSigDB C8: cell type signature gene sets
## set count pvalue
## 1 23902 3 1.515581e-08
## 2 24904 3 2.719047e-07
## 3 42292 4 2.028974e-06
## 4 42295 3 3.056438e-06
## 5 23898 3 8.256205e-06
## 6 42297 3 1.194453e-05
We can also load all available gene sets for a species for use in other gene set-related packages, e.g., fgsea.
## CharacterList of length 42580
## [[1]] PIGV ALG12
## [[2]] ERCC8 ERCC6 LIG4 TNP1 UNG XRCC1 SIRT1 APTX NEIL3 TDP1 APLF XNDC1N
## [[3]] ENDOG ERCC1 ERCC4 MRE11 RBBP8 SETMAR EXOG RAD50 ASTE1 DCLRE1C
## [[4]] ENO1 ENO2 ENO3 ENO4
## [[5]] LCT
## [[6]] SLC5A1 SLC5A2
## [[7]] IL7R KPNA1 KPNA2 RTEL1 SMARCAD1
## [[8]] MRE11 RAD50
## [[9]] PRC1 KIF23
## [[10]] GAA
## ...
## <42570 more elements>
## DataFrame with 42580 rows and 4 columns
## name description size
## <character> <character> <integer>
## 1 GO:0000009 alpha-1,6-mannosyltr.. 2
## 2 GO:0000012 single strand break .. 12
## 3 GO:0000014 single-stranded DNA .. 10
## 4 GO:0000015 phosphopyruvate hydr.. 4
## 5 GO:0000016 lactase activity 1
## ... ... ... ...
## 42576 HALLMARK_UNFOLDED_PR.. https://www.gsea-msi.. 113
## 42577 HALLMARK_UV_RESPONSE.. https://www.gsea-msi.. 144
## 42578 HALLMARK_UV_RESPONSE.. https://www.gsea-msi.. 158
## 42579 HALLMARK_WNT_BETA_CA.. https://www.gsea-msi.. 42
## 42580 HALLMARK_XENOBIOTIC_.. https://www.gsea-msi.. 200
## collection
## <character>
## 1 Gene ontology
## 2 Gene ontology
## 3 Gene ontology
## 4 Gene ontology
## 5 Gene ontology
## ... ...
## 42576 MSigDB H: hallmark g..
## 42577 MSigDB H: hallmark g..
## 42578 MSigDB H: hallmark g..
## 42579 MSigDB H: hallmark g..
## 42580 MSigDB H: hallmark g..
The functions described above are convenient but opinionated wrappers around some of gesel’s low-level functions. Developers may prefer to use these low-level functions directly for greater control of the output, e.g., to avoid unnecessary work. To demonstrate, we first map our genes of interest to Gesel gene indices, which are Gesel’s internal identifiers for each gene.
library(gesel)
gene.idx <- searchGenes("9606", my.genes)
gene.idx # this is a list of integer vectors, in case of 1:many mappings.## [[1]]
## [1] 4645
##
## [[2]]
## [1] 12799
##
## [[3]]
## [1] 1758
##
## [[4]]
## [1] 1759
##
## [[5]]
## [1] 3916
gene.idx <- unlist(gene.idx)
fetchAllGenes("9606")[gene.idx,] # double-checking that we got it right.## entrez ensembl symbol
## 4645 6616 ENSG0000.... SNAP25
## 12799 63974 ENSG0000.... NEUROD6
## 1758 2571 ENSG0000.... GAD1
## 1759 2572 ENSG0000.... GAD2
## 3916 5649 ENSG0000.... RELN
Now we find all sets with overlapping genes. The
searchOverlappingSets() identifies all sets that overlap at
least one gene in the user-supplied set.
overlaps <- searchOverlappingSets("9606", gene.idx, counts.only=FALSE)
head(overlaps$overlap) # set index and the identities of overlapping genes.## set count genes size pvalue
## 58 58 1 1759 700 0.077915063
## 62 62 1 4645 66 0.007563568
## 237 237 1 12799 1113 0.121555167
## 274 274 1 3916 117 0.013376739
## 299 299 1 12799 1089 0.119065304
## 301 301 1 12799 1223 0.132895108
The set index can be converted back to information about each set, along with the collection from which it came:
first.set.info <- fetchSomeSets("9606", head(overlaps$overlap$set)) # getting details of the first few sets
first.set.info## name
## 1 GO:0000139
## 2 GO:0000149
## 3 GO:0000785
## 4 GO:0000902
## 5 GO:0000978
## 6 GO:0000981
## description size
## 1 Golgi membrane 700
## 2 SNARE binding 66
## 3 chromatin 1113
## 4 cell morphogenesis 117
## 5 RNA polymerase II cis-regulatory region sequence-specific DNA binding 1089
## 6 DNA-binding transcription factor activity, RNA polymerase II-specific 1223
## collection number
## 1 1 58
## 2 1 62
## 3 1 237
## 4 1 274
## 5 1 299
## 6 1 301
first.col.info <- fetchSomeCollections("9606", first.set.info$collection) # getting details of the collections
first.col.info## title
## 1 Gene ontology
## 2 Gene ontology
## 3 Gene ontology
## 4 Gene ontology
## 5 Gene ontology
## 6 Gene ontology
## description
## 1 Gene sets defined from the Gene Ontology (version 2026-01-23), sourced from AH121953 in Bioconductor's AnnotationHub.
## 2 Gene sets defined from the Gene Ontology (version 2026-01-23), sourced from AH121953 in Bioconductor's AnnotationHub.
## 3 Gene sets defined from the Gene Ontology (version 2026-01-23), sourced from AH121953 in Bioconductor's AnnotationHub.
## 4 Gene sets defined from the Gene Ontology (version 2026-01-23), sourced from AH121953 in Bioconductor's AnnotationHub.
## 5 Gene sets defined from the Gene Ontology (version 2026-01-23), sourced from AH121953 in Bioconductor's AnnotationHub.
## 6 Gene sets defined from the Gene Ontology (version 2026-01-23), sourced from AH121953 in Bioconductor's AnnotationHub.
## maintainer source
## 1 Aaron Lun https://github.com/gesel-inc/feedstock/blob/go-v1.1.0/go/build.R
## 2 Aaron Lun https://github.com/gesel-inc/feedstock/blob/go-v1.1.0/go/build.R
## 3 Aaron Lun https://github.com/gesel-inc/feedstock/blob/go-v1.1.0/go/build.R
## 4 Aaron Lun https://github.com/gesel-inc/feedstock/blob/go-v1.1.0/go/build.R
## 5 Aaron Lun https://github.com/gesel-inc/feedstock/blob/go-v1.1.0/go/build.R
## 6 Aaron Lun https://github.com/gesel-inc/feedstock/blob/go-v1.1.0/go/build.R
## start size
## 1 1 18864
## 2 1 18864
## 3 1 18864
## 4 1 18864
## 5 1 18864
## 6 1 18864
We can also search for gene sets based on the text in their names or
descriptions. The searchSetText() function accepts a query
string with multiple words and the usual * (any characters)
and ? (one character) wildcards.
## name
## 1 ABDULRAHMAN_KIDNEY_CANCER_VHL_DN
## 2 ACEVEDO_LIVER_CANCER_DN
## 3 ACEVEDO_LIVER_CANCER_UP
## 4 ACEVEDO_LIVER_CANCER_WITH_H3K27ME3_DN
## 5 ACEVEDO_LIVER_CANCER_WITH_H3K27ME3_UP
## 6 ACEVEDO_LIVER_CANCER_WITH_H3K9ME3_DN
## description
## 1 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/ABDULRAHMAN_KIDNEY_CANCER_VHL_DN
## 2 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/ACEVEDO_LIVER_CANCER_DN
## 3 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/ACEVEDO_LIVER_CANCER_UP
## 4 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/ACEVEDO_LIVER_CANCER_WITH_H3K27ME3_DN
## 5 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/ACEVEDO_LIVER_CANCER_WITH_H3K27ME3_UP
## 6 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/ACEVEDO_LIVER_CANCER_WITH_H3K9ME3_DN
## size collection number
## 1 13 3 6
## 2 540 3 14
## 3 972 3 15
## 4 220 3 16
## 5 289 3 17
## 6 119 3 18
## name
## 1 REACTOME_DENGUE_VIRUS_ACTIVATES_MODULATES_INNATE_AND_ADAPTIVE_IMMUNE_RESPONSES
## 2 REACTOME_INNATE_IMMUNE_SYSTEM
## 3 REACTOME_REGULATION_OF_INNATE_IMMUNE_RESPONSES_TO_CYTOSOLIC_DNA
## 4 REACTOME_SARS_COV_1_ACTIVATES_MODULATES_INNATE_IMMUNE_RESPONSES
## 5 REACTOME_SARS_COV_2_ACTIVATES_MODULATES_INNATE_AND_ADAPTIVE_IMMUNE_RESPONSES
## 6 WP_PATHWAYS_OF_NUCLEIC_ACID_METABOLISM_AND_INNATE_IMMUNE_SENSING
## description
## 1 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_DENGUE_VIRUS_ACTIVATES_MODULATES_INNATE_AND_ADAPTIVE_IMMUNE_RESPONSES
## 2 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_INNATE_IMMUNE_SYSTEM
## 3 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_REGULATION_OF_INNATE_IMMUNE_RESPONSES_TO_CYTOSOLIC_DNA
## 4 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_SARS_COV_1_ACTIVATES_MODULATES_INNATE_IMMUNE_RESPONSES
## 5 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_SARS_COV_2_ACTIVATES_MODULATES_INNATE_AND_ADAPTIVE_IMMUNE_RESPONSES
## 6 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WP_PATHWAYS_OF_NUCLEIC_ACID_METABOLISM_AND_INNATE_IMMUNE_SENSING
## size collection number
## 1 32 5 359
## 2 1113 5 725
## 3 15 5 1269
## 4 41 5 1426
## 5 126 5 1431
## 6 16 6 680
## name
## 1 HP_DECREASED_ANTI_CD3_28_INDUCED_T_CELL_PROLIFERATION
## 2 GAVISH_3CA_METAPROGRAM_CD4_T_CELLS_CELL_CYCLE
## 3 HP_ABNORMAL_CD4_T_CELL_PROPORTION
## 4 HP_ABNORMAL_CD4_T_CELL_SUBSET_PROPORTION
## 5 HP_ABNORMAL_NAIVE_CD4_T_CELL_PROPORTION
## 6 HP_DECREASED_NAIVE_CD4_T_CELL_PROPORTION
## description
## 1 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HP_DECREASED_ANTI_CD3_28_INDUCED_T_CELL_PROLIFERATION
## 2 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/GAVISH_3CA_METAPROGRAM_CD4_T_CELLS_CELL_CYCLE
## 3 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HP_ABNORMAL_CD4_T_CELL_PROPORTION
## 4 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HP_ABNORMAL_CD4_T_CELL_SUBSET_PROPORTION
## 5 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HP_ABNORMAL_NAIVE_CD4_T_CELL_PROPORTION
## 6 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HP_DECREASED_NAIVE_CD4_T_CELL_PROPORTION
## size collection number
## 1 11 14 2207
## 2 49 11 53
## 3 47 14 388
## 4 23 14 389
## 5 11 14 850
## 6 9 14 2274
Users can construct powerful queries by intersecting the sets
recovered from searchSetText() with those from
findOverlappingSets().
cancer.sets <- intersect(chits, overlaps$overlap$set)
info <- fetchSomeSets("9606", cancer.sets)
m <- match(cancer.sets, overlaps$overlap$set)
info$count <- overlaps$overlap$count[m]
info$pvalue <- overlaps$overlap$pvalue[m]
info[head(order(info$pvalue)),]## name
## 5 LOPES_METHYLATED_IN_COLON_CANCER_UP
## 7 SCHLESINGER_H3K27ME3_IN_NORMAL_AND_METHYLATED_IN_CANCER
## 14 WATANABE_COLON_CANCER_MSI_VS_MSS_UP
## 12 WANG_BARRETTS_ESOPHAGUS_AND_ESOPHAGUS_CANCER_DN
## 3 KONDO_PROSTATE_CANCER_HCP_WITH_H3K27ME3
## 13 WANG_HCP_PROSTATE_CANCER
## description
## 5 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/LOPES_METHYLATED_IN_COLON_CANCER_UP
## 7 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/SCHLESINGER_H3K27ME3_IN_NORMAL_AND_METHYLATED_IN_CANCER
## 14 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WATANABE_COLON_CANCER_MSI_VS_MSS_UP
## 12 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WANG_BARRETTS_ESOPHAGUS_AND_ESOPHAGUS_CANCER_DN
## 3 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/KONDO_PROSTATE_CANCER_HCP_WITH_H3K27ME3
## 13 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WANG_HCP_PROSTATE_CANCER
## size collection number count pvalue
## 5 28 3 1799 1 0.003214398
## 7 28 3 2717 1 0.003214398
## 14 30 3 3242 1 0.003443681
## 12 37 3 3194 1 0.004245840
## 3 97 3 1524 1 0.011100320
## 13 106 3 3206 1 0.012125228
Gesel is designed around partial extraction from the database files, but it may be more efficient to pull all of the data into the R session at once. This is most useful for the gene set details, which can be retrieved en masse:
## [1] 42580
## name description size collection
## 1 GO:0000009 alpha-1,6-mannosyltransferase activity 2 1
## 2 GO:0000012 single strand break repair 12 1
## 3 GO:0000014 single-stranded DNA endodeoxyribonuclease activity 10 1
## 4 GO:0000015 phosphopyruvate hydratase complex 4 1
## 5 GO:0000016 lactase activity 1 1
## 6 GO:0000017 alpha-glucoside transport 2 1
## number
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
The set indices can then be used to directly subset the
set.info data frame by row. In fact, calling
fetchSomeSets() after fetchAllSets() will
automatically use the data frame created by the latter, instead of
attempting another request to the database.
## name
## 19180 ACEVEDO_LIVER_CANCER_DN
## 19820 DELYS_THYROID_CANCER_DN
## 20690 KONDO_PROSTATE_CANCER_HCP_WITH_H3K27ME3
## 20913 LIU_OVARIAN_CANCER_TUMORS_AND_XENOGRAFTS_XDGS_DN
## 20965 LOPES_METHYLATED_IN_COLON_CANCER_UP
## 21857 SATO_SILENCED_BY_METHYLATION_IN_PANCREATIC_CANCER_1
## description
## 19180 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/ACEVEDO_LIVER_CANCER_DN
## 19820 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/DELYS_THYROID_CANCER_DN
## 20690 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/KONDO_PROSTATE_CANCER_HCP_WITH_H3K27ME3
## 20913 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/LIU_OVARIAN_CANCER_TUMORS_AND_XENOGRAFTS_XDGS_DN
## 20965 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/LOPES_METHYLATED_IN_COLON_CANCER_UP
## 21857 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/SATO_SILENCED_BY_METHYLATION_IN_PANCREATIC_CANCER_1
## size collection number
## 19180 540 3 14
## 19820 233 3 654
## 20690 97 3 1524
## 20913 1716 3 1747
## 20965 28 3 1799
## 21857 422 3 2691
## name
## 1 ACEVEDO_LIVER_CANCER_DN
## 2 DELYS_THYROID_CANCER_DN
## 3 KONDO_PROSTATE_CANCER_HCP_WITH_H3K27ME3
## 4 LIU_OVARIAN_CANCER_TUMORS_AND_XENOGRAFTS_XDGS_DN
## 5 LOPES_METHYLATED_IN_COLON_CANCER_UP
## 6 SATO_SILENCED_BY_METHYLATION_IN_PANCREATIC_CANCER_1
## description
## 1 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/ACEVEDO_LIVER_CANCER_DN
## 2 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/DELYS_THYROID_CANCER_DN
## 3 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/KONDO_PROSTATE_CANCER_HCP_WITH_H3K27ME3
## 4 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/LIU_OVARIAN_CANCER_TUMORS_AND_XENOGRAFTS_XDGS_DN
## 5 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/LOPES_METHYLATED_IN_COLON_CANCER_UP
## 6 https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/SATO_SILENCED_BY_METHYLATION_IN_PANCREATIC_CANCER_1
## size collection number
## 1 540 3 14
## 2 233 3 654
## 3 97 3 1524
## 4 1716 3 1747
## 5 28 3 1799
## 6 422 3 2691
The same approach can be used to extract collection information, via
fetchAllCollections(); gene set membership, via
fetchGenesForAllSets(); and the sets containing each gene,
via fetchSetsForAllGenes().
gesel uses a lot of in-memory caching to reduce the number of requests to the database files within a single R session. On rare occasions, the cache may become outdated, e.g., if the database files are updated while an R session is running. Users can prompt gesel to re-acquire all data by flushing the cache:
Users can easily create custom Gesel database for their own gene set collections. This is most relevant to large research organizations that have accumulated many gene sets over time and want to easily search them. To demonstrate, let’s mock up some gene set information:
mock.collection.info <- data.frame(
title=c("FOO", "BAR"),
description=c("I am a foo", "I am a bar"),
maintainer=c("Aaron", "Aaron"),
source=c("https://foo", "https://bar")
)
mock.set.info <- list(
data.frame(
name=sprintf("FOO_%i", seq_len(20)),
description=sprintf("this is FOO %i", seq_len(20))
),
data.frame(
name=sprintf("BAR_%i", seq_len(50)),
description=sprintf("this is BAR %i", seq_len(50))
)
)
mock.num.genes <- 10000
mock.set.membership <- list(
lapply(seq_len(nrow(mock.set.info[[1]])), function(i) {
sample(mock.num.genes, sample(500, 1))
}),
lapply(seq_len(nrow(mock.set.info[[2]])), function(i) {
sample(mock.num.genes, sample(200, 1))
})
)Now we create the Gesel database files:
mock.db.dir <- tempfile()
dir.create(mock.db.dir)
prepareDatabaseFiles(
"12345", # mock NCBI taxonomy ID
mock.collection.info,
mock.set.info,
mock.set.membership,
mock.num.genes,
path = mock.db.dir
)
list.files(mock.db.dir)## [1] "12345_collections.tsv"
## [2] "12345_collections.tsv.gz"
## [3] "12345_collections.tsv.ranges.gz"
## [4] "12345_gene2set.tsv"
## [5] "12345_gene2set.tsv.gz"
## [6] "12345_gene2set.tsv.ranges.gz"
## [7] "12345_set2gene.tsv"
## [8] "12345_set2gene.tsv.gz"
## [9] "12345_set2gene.tsv.ranges.gz"
## [10] "12345_sets.tsv"
## [11] "12345_sets.tsv.gz"
## [12] "12345_sets.tsv.ranges.gz"
## [13] "12345_tokens-descriptions.tsv"
## [14] "12345_tokens-descriptions.tsv.gz"
## [15] "12345_tokens-descriptions.tsv.ranges.gz"
## [16] "12345_tokens-names.tsv"
## [17] "12345_tokens-names.tsv.gz"
## [18] "12345_tokens-names.tsv.ranges.gz"
We also create gene annotation files to map Gesel’s internal gene indices to more recognizable identifiers. Most downstream users will expect files for Ensembl, Entrez and gene symbols.
mock.ref.genes <- list()
for (type in c("ensembl", "entrez", "symbol")) {
gene.ids <- sprintf("%s_%05d", type, seq_len(19999))
# Generating a 1:many mapping of genes to names to make things interesting.
genes.plus <- mock.num.genes * 1.1
chosen <- sample(gene.ids, genes.plus, replace=TRUE)
ids <- sample(mock.num.genes, genes.plus, replace=TRUE)
mock.ids <- split(chosen, factor(ids, seq_len(mock.num.genes)))
# Cleaning things up before saving to file.
mock.ids <- lapply(mock.ids, unique)
mock.ref.genes[[type]] <- unname(mock.ids)
}
mock.gene.dir <- tempfile()
dir.create(mock.gene.dir)
prepareGeneFiles(
"12345",
mock.ref.genes,
path = mock.gene.dir
)
list.files(mock.gene.dir)## [1] "12345_gene-type-ensembl.tsv.gz" "12345_gene-type-entrez.tsv.gz"
## [3] "12345_gene-type-symbol.tsv.gz" "12345_gene-types.tsv"
## [5] "12345_gene-version.tsv"
In each gesel function, the
config= argument determines how data is read from the
database files. By default, gesel makes HTTP
requests to the pre-built database files in the feedstock. For our
custom files, we can simply direct gesel to look
inside our new directory:
custom.config <- newConfig(
fetch.file=function(x) file.path(mock.db.dir, x),
fetch.gene=function(x) file.path(mock.gene.dir, x),
fetch.ranges=function(n, s, e) readDatabaseRanges(mock.db.dir, n, s, e)
)Passing config = custom.config to any gesel function will
instruct it to use our custom database files:
## name description size collection number
## 1 FOO_1 this is FOO 1 268 1 1
## 2 FOO_2 this is FOO 2 250 1 2
## 3 FOO_3 this is FOO 3 494 1 3
## 4 FOO_4 this is FOO 4 344 1 4
## 5 FOO_5 this is FOO 5 205 1 5
## 6 FOO_6 this is FOO 6 69 1 6
# Trying one of the more user-friendly functions:
head(querySets(
"12345",
genes = sprintf("ensembl_%05d", 1:100),
text = "FOO",
config = custom.config
))## name description size collection set count pvalue
## 1 FOO_5 this is FOO 5 205 FOO 5 5 0.007283379
## 2 FOO_7 this is FOO 7 292 FOO 7 4 0.097075705
## 3 FOO_4 this is FOO 4 344 FOO 4 4 0.150703791
## 4 FOO_15 this is FOO 15 234 FOO 15 3 0.165590697
## 5 FOO_13 this is FOO 13 169 FOO 13 2 0.270774204
## 6 FOO_9 this is FOO 9 192 FOO 9 2 0.322410207
It is possible to do some fairly sophisticated things with
newConfig(), depending on the environment. For example, in
hybrid environments, we can directly read files from the filesystem when
running gesel on-prem for
reduced latency, but then fall back to a HTTP request if the R session
is running on the cloud.
Alternatively, you might consider contributing your custom gene sets to the default Gesel database at https://github.com/gesel-inc/feedstock. Contributed gene sets will be searchable with the default gesel functions, which may be more convenient than creating a custom database.
## R version 4.6.0 (2026-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gesel_0.99.4 BiocStyle_2.41.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.6 knitr_1.51 rlang_1.2.0
## [4] xfun_0.58 otel_0.2.0 generics_0.1.4
## [7] jsonlite_2.0.0 glue_1.8.1 S4Vectors_0.51.3
## [10] buildtools_1.0.0 htmltools_0.5.9 maketools_1.3.2
## [13] sys_3.4.3 stats4_4.6.0 sass_0.4.10
## [16] rmarkdown_2.31 rappdirs_0.3.4 evaluate_1.0.5
## [19] jquerylib_0.1.4 fastmap_1.2.0 IRanges_2.47.2
## [22] yaml_2.3.12 lifecycle_1.0.5 httr2_1.2.2
## [25] BiocManager_1.30.27 compiler_4.6.0 Rcpp_1.1.1-1.1
## [28] digest_0.6.39 R6_2.6.1 curl_7.1.0
## [31] magrittr_2.0.5 bslib_0.11.0 tools_4.6.0
## [34] BiocGenerics_0.59.7 cachem_1.1.0