| Title: | Tidy Interface to GO Semantic SQL via DuckDB |
|---|---|
| Description: | Provides a tidyverse-oriented user interface to Gene Ontology data via the Semantic SQL representation, accessed through DuckDB. Replaces the GO.db + AnnotationDbi::select nexus with lazy tibble-based operations for term lookup, ancestor/descendant traversal, and gene-GO annotation queries. The Semantic SQL resource is managed by the ontoProc2 package via BiocFileCache. |
| Authors: | Vincent Carey [aut, cre] (ORCID: <https://orcid.org/0000-0003-4046-0063>), Claude Sonnet46 [eng] |
| Maintainer: | Vincent Carey <[email protected]> |
| License: | Artistic-2.0 |
| Version: | 0.99.4 |
| Built: | 2026-05-14 19:40:18 UTC |
| Source: | https://github.com/BiocStaging/GO.ddb |
Converts the semsql SQLite tables to parquet format and writes them
to the BiocFileCache-managed parquet directory. This is a one-time
setup step - subsequent calls to make_go_con(backend = "parquet")
or make_go_con(backend = "auto") will use the cached files
without re-converting.
build_parquet_cache( sqlite_path = NULL, ontology = "go", out_dir = NULL, tables = c("statements", "entailed_edge", "term_association") )build_parquet_cache( sqlite_path = NULL, ontology = "go", out_dir = NULL, tables = c("statements", "entailed_edge", "term_association") )
sqlite_path |
character scalar path to the semsql SQLite file.
If |
ontology |
character scalar. Default |
out_dir |
path to desired folder for parquet cache management |
tables |
character vector of table names to convert. Default covers the two tables the package actively queries. |
the parquet cache directory path, invisibly.
if (!has_parquet_cache()) { build_parquet_cache() }if (!has_parquet_cache()) { build_parquet_cache() }
Disconnects the DuckDB instance and clears the package cache. The next call to any query function will trigger reconnection automatically.
disconnect_go()disconnect_go()
NULL invisibly.
make_go_con, go_connection_active
GO.ddb::make_go_con() GO.ddb::disconnect_go() GO.ddb::go_connection_active()GO.ddb::make_go_con() GO.ddb::disconnect_go() GO.ddb::go_connection_active()
Useful for diagnostics, custom queries, or passing to dbplyr directly.
Returns NULL if no connection is active.
get_go_con()get_go_con()
a DBIConnection or NULL.
make_go_con() con <- get_go_con() DBI::dbGetQuery(con, "SELECT database_name, schema_name, table_name FROM duckdb_tables()") disconnect_go()make_go_con() con <- get_go_con() DBI::dbGetQuery(con, "SELECT database_name, schema_name, table_name FROM duckdb_tables()") disconnect_go()
Uses the precomputed transitive closure in the semsql
entailed_edge table to find all ancestors of the supplied
GO CURIEs under the specified relations. Returns a long-format lazy
tibble suitable for direct use with dplyr, ggraph, or
gene set enrichment tooling.
go_ancestors( ids, relations = unname(GO_RELATIONS[c("is_a", "part_of")]), include_self = FALSE )go_ancestors( ids, relations = unname(GO_RELATIONS[c("is_a", "part_of")]), include_self = FALSE )
ids |
character vector of GO CURIEs. |
relations |
character vector of predicate CURIEs to traverse.
Defaults to |
include_self |
logical. If |
a lazy tbl_duckdb with columns:
the query term CURIE
ancestor term CURIE
predicate CURIE
Call dplyr::collect() to materialize.
make_go_con() go_ancestors("GO:0006954") |> dplyr::collect() # is_a only go_ancestors("GO:0006954", relations = GO_RELATIONS["is_a"]) |> dplyr::collect() # multiple query terms go_ancestors(c("GO:0006954", "GO:0008150"), relations = unname(GO_RELATIONS[c("is_a", "part_of")])) |> dplyr::count(id) disconnect_go()make_go_con() go_ancestors("GO:0006954") |> dplyr::collect() # is_a only go_ancestors("GO:0006954", relations = GO_RELATIONS["is_a"]) |> dplyr::collect() # multiple query terms go_ancestors(c("GO:0006954", "GO:0008150"), relations = unname(GO_RELATIONS[c("is_a", "part_of")])) |> dplyr::count(id) disconnect_go()
Lightweight predicate used in examples and tests to guard against running query code when no connection has been established and no automatic reconnection is desired.
go_connection_active()go_connection_active()
logical scalar.
Uses the precomputed transitive closure in the semsql
entailed_edge table to find all descendants of the supplied
GO CURIEs under the specified relations. Returns a long-format lazy
tibble.
go_descendants( ids, relations = unname(GO_RELATIONS[c("is_a", "part_of")]), include_self = FALSE )go_descendants( ids, relations = unname(GO_RELATIONS[c("is_a", "part_of")]), include_self = FALSE )
ids |
character vector of GO CURIEs. |
relations |
character vector of predicate CURIEs to traverse.
Defaults to |
include_self |
logical. If |
a lazy tbl_duckdb with columns:
the query term CURIE
descendant term CURIE
predicate CURIE
Call dplyr::collect() to materialize.
make_go_con() go_descendants("GO:0006950") |> dplyr::collect() # count descendants per ontology namespace go_descendants("GO:0008150") |> dplyr::left_join( go_terms() |> dplyr::select(descendant_id = id, ontology), by = "descendant_id" ) |> dplyr::count(ontology) |> dplyr::collect() disconnect_go()make_go_con() go_descendants("GO:0006950") |> dplyr::collect() # count descendants per ontology namespace go_descendants("GO:0008150") |> dplyr::left_join( go_terms() |> dplyr::select(descendant_id = id, ontology), by = "descendant_id" ) |> dplyr::count(ontology) |> dplyr::collect() disconnect_go()
The entailed_edge table contains the precomputed transitive
closure of all object property relations in GO, including
rdfs:subClassOf (is_a) and BFO:0000050 (part_of).
It is the basis for go_ancestors and
go_descendants.
go_entailed_edges(con = NULL, schema = NULL)go_entailed_edges(con = NULL, schema = NULL)
con |
optional |
schema |
optional character schema name. If |
Self-edges (subject == object) encode reflexivity under the
transitive closure and are excluded by default in
go_ancestors and go_descendants.
a lazy tbl_duckdb with columns
subject, predicate, object.
make_go_con() go_entailed_edges() disconnect_go()make_go_con() go_entailed_edges() disconnect_go()
A named character vector mapping human-readable relation names to their
CURIE representations as they appear in the semsql entailed_edge
table. Counts are from a representative 2024 GO build:
GO_RELATIONSGO_RELATIONS
An object of class character of length 5.
rdfs:subClassOf — 1,360,314 entailed edges
BFO:0000050 — 353,639 edges
BFO:0000051 — 1,038,218 edges (inverse of part_of)
BFO:0000066 — 26,234 edges (biological process)
RO:0001025 — 1,459 edges (cellular component)
Pass one or more values from this vector as the relations argument
to go_ancestors and go_descendants.
GO_RELATIONS GO_RELATIONS["is_a"] unname(GO_RELATIONS[c("is_a", "part_of")])GO_RELATIONS GO_RELATIONS["is_a"] unname(GO_RELATIONS[c("is_a", "part_of")])
The statements table contains all RDF triples from the GO OWL
source, including term labels (rdfs:label), definitions
(IAO:0000115), namespaces (oio:hasOBONamespace), and
deprecation flags (owl:deprecated).
go_statements(con = NULL, schema = NULL)go_statements(con = NULL, schema = NULL)
con |
optional |
schema |
optional character schema name. If |
a lazy tbl_duckdb with columns
stanza, subject, predicate, object, value, datatype, language.
make_go_con() go_statements() disconnect_go()make_go_con() go_statements() disconnect_go()
Retrieves all synonym types for GO terms from the semsql
statements table. Four synonym scopes are recognised by the
OBO format and all are present in the GO semsql build:
hasExactSynonym, hasRelatedSynonym,
hasNarrowSynonym, and hasBroadSynonym.
go_synonyms(ids = NULL, types = c("exact", "related", "narrow", "broad"))go_synonyms(ids = NULL, types = c("exact", "related", "narrow", "broad"))
ids |
optional character vector of GO CURIEs to restrict results.
If |
types |
character vector of synonym scopes to include. Default
includes all four. Elements must be one or more of
|
a lazy tbl_duckdb with columns:
GO CURIE
synonym string
one of "exact", "related",
"narrow", "broad"
Call dplyr::collect() to materialize.
GO.ddb::make_go_con() # All synonyms for a term GO.ddb::go_synonyms("GO:0006954") |> dplyr::collect() # Exact synonyms only across all terms GO.ddb::go_synonyms(types = "exact") |> dplyr::collect() # Synonyms for multiple terms GO.ddb::go_synonyms(c("GO:0006954", "GO:0008150")) |> dplyr::collect() GO.ddb::disconnect_go()GO.ddb::make_go_con() # All synonyms for a term GO.ddb::go_synonyms("GO:0006954") |> dplyr::collect() # Exact synonyms only across all terms GO.ddb::go_synonyms(types = "exact") |> dplyr::collect() # Synonyms for multiple terms GO.ddb::go_synonyms(c("GO:0006954", "GO:0008150")) |> dplyr::collect() GO.ddb::disconnect_go()
Reconstructs term labels, definitions, ontology namespace (BP/MF/CC),
and deprecation status from the semsql statements table.
Filters to GO-prefixed identifiers, excluding imported terms from
Uberon, CHEBI, RO, and other ontologies present in the GO OWL source.
go_terms(include_deprecated = FALSE, con = NULL, schema = NULL)go_terms(include_deprecated = FALSE, con = NULL, schema = NULL)
include_deprecated |
logical. If |
con |
optional |
schema |
optional character schema name for testing. |
Uses %like% rather than startsWith() for the GO prefix
filter — startsWith() is not translated by dbplyr to DuckDB's
starts_with() function and will cause a catalog error.
a lazy tbl_duckdb with columns:
GO CURIE, e.g. "GO:0006954"
human-readable term name
IAO:0000115 term definition
namespace string: "biological_process",
"molecular_function", or "cellular_component"
logical
make_go_con() # All non-deprecated terms go_terms() # Biological process terms only go_terms() |> dplyr::filter(ontology == "biological_process") |> dplyr::collect() # Include deprecated terms go_terms(include_deprecated = TRUE) |> dplyr::filter(deprecated) |> dplyr::select(id, label) |> dplyr::collect() disconnect_go()make_go_con() # All non-deprecated terms go_terms() # Biological process terms only go_terms() |> dplyr::filter(ontology == "biological_process") |> dplyr::collect() # Include deprecated terms go_terms(include_deprecated = TRUE) |> dplyr::filter(deprecated) |> dplyr::select(id, label) |> dplyr::collect() disconnect_go()
statements.parquet and
entailed_edge.parquet.Test whether a local parquet cache exists for an ontology
Checks that the parquet cache directory exists and contains at least
the two required files: statements.parquet and
entailed_edge.parquet.
has_parquet_cache(ontology = "go")has_parquet_cache(ontology = "go")
ontology |
character scalar. Default |
logical scalar.
has_parquet_cache()has_parquet_cache()
Maps one or more GO CURIEs to their term label, definition, or ontology
namespace. Returns a lazy tibble — call dplyr::collect() to
materialize results.
lookup_curie(curies, mapto = c("term", "definition", "ontology", "all"))lookup_curie(curies, mapto = c("term", "definition", "ontology", "all"))
curies |
character vector of GO CURIEs in the form
|
mapto |
character scalar, one of:
Partial matching is supported via |
a lazy tbl_duckdb with column id and the
requested field(s). Call dplyr::collect() to materialize.
make_go_con() lookup_curie(c("GO:0006954", "GO:0008150"), mapto = "term") |> dplyr::collect() lookup_curie("GO:0006954", mapto = "all") |> dplyr::collect() # partial matching works lookup_curie("GO:0006954", mapto = "def") |> dplyr::collect() disconnect_go()make_go_con() lookup_curie(c("GO:0006954", "GO:0008150"), mapto = "term") |> dplyr::collect() lookup_curie("GO:0006954", mapto = "all") |> dplyr::collect() # partial matching works lookup_curie("GO:0006954", mapto = "def") |> dplyr::collect() disconnect_go()
Retrieves GO data either from a local parquet cache or from the semsql
SQLite file managed by ontoProc2::semsql_connect(), then loads
it into an in-process DuckDB instance.
make_go_con(ontology = "go", backend = c("auto", "parquet", "sqlite"))make_go_con(ontology = "go", backend = c("auto", "parquet", "sqlite"))
ontology |
character scalar. Default |
backend |
one of:
|
NULL invisibly.
build_parquet_cache, has_parquet_cache,
disconnect_go
# auto selects parquet if available, SQLite otherwise make_go_con() go_connection_active() disconnect_go() # force parquet (must have run build_parquet_cache() first) if (has_parquet_cache()) { make_go_con(backend = "parquet") disconnect_go() }# auto selects parquet if available, SQLite otherwise make_go_con() go_connection_active() disconnect_go() # force parquet (must have run build_parquet_cache() first) if (has_parquet_cache()) { make_go_con(backend = "parquet") disconnect_go() }
Provides a familiar interface for users migrating from
AnnotationDbi::select(GO.db, ...). Unlike the lazy tibbles
returned by go_terms and lookup_curie,
this function returns an eager data.frame to match the
AnnotationDbi contract.
select_go( keys, columns = c("TERM", "DEFINITION", "ONTOLOGY"), keytype = "GOID" )select_go( keys, columns = c("TERM", "DEFINITION", "ONTOLOGY"), keytype = "GOID" )
keys |
character vector of GO CURIEs (e.g. |
columns |
character vector of columns to return. Valid values:
|
keytype |
character scalar. Only |
a data.frame with column GOID and the requested
additional columns, in the same format as
AnnotationDbi::select(GO.db, ...).
GO.ddb::make_go_con() # Direct replacement for AnnotationDbi::select(GO.db, ...) GO.ddb::select_go( keys = c("GO:0006954", "GO:0008150"), columns = c("TERM", "ONTOLOGY") ) # With synonyms GO.ddb::select_go( keys = "GO:0006954", columns = c("TERM", "DEFINITION", "ONTOLOGY", "SYNONYM") ) GO.ddb::disconnect_go()GO.ddb::make_go_con() # Direct replacement for AnnotationDbi::select(GO.db, ...) GO.ddb::select_go( keys = c("GO:0006954", "GO:0008150"), columns = c("TERM", "ONTOLOGY") ) # With synonyms GO.ddb::select_go( keys = "GO:0006954", columns = c("TERM", "DEFINITION", "ONTOLOGY", "SYNONYM") ) GO.ddb::disconnect_go()