---
title: "ontoProc2 -- leveraging semantic SQL for ontology analysis in Bioconductor"
author: "Vincent J. Carey, stvjc at channing.harvard.edu"
date: "`r format(Sys.time(), '%B %d, %Y')`"
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{ontoProc2 -- leveraging INCAtools semantic SQL for ontology analysis in Bioconductor}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document:
    highlight: pygments
    number_sections: yes
    theme: united
    toc: yes
---

# Introduction

The ontoProc2 package has two aims:

- to give convenient
access to the ontologies that are transformed to "semantic
SQL" in the [INCAtools semantic SQL project](https://github.com/INCATools/semantic-sql);
- to simplify operations that have been available in [ontoProc](https://git.bioconductor.org/packages/ontoProc), which will be deprecated in 2027.

This package is a "second generation" approach to ontology management
and processing for Bioconductor.  The ontoProc package was introduced in
2018 and had a number of experimental
approaches to interface design and content caching.  This package
streamlines the identification and management of diverse ontologies
by specifically leveraging semantic SQL concepts.

Specifically, the call `ontoProc::getOnto('cellOnto')` would produce
an `ontologyIndex` instance.  To work with Cell Ontology with ontoProc2,
use `semsql_connect(ontology="cl")`.  The abbreviated notation for ontologies follows
that of the Open Biological and Biomedical Ontologies ([OBO](https://github.com/OBOFoundry/)) Foundry.

# Installation

Prior to the release of Bioconductor 3.24, use `BiocManager::install("vjcitn/ontoProc2")`.

For Bioconductor 3.24 and subsequent versions, use `BiocManager::install("ontoProc2")`.

# Acquiring ontologies

## Make a connection

The best way to work with an ontology in this system is
to use `semsql_connect`.  The `ontology` argument will
be a short string that the INCAtools project uses as part
of the filename for the ontology.  For Gene Ontology, the
string is "go".

```{r doini, message=FALSE}
library(ontoProc2)
goss <- semsql_connect(ontology = "go")
goss
```
## Make a report

The `report` method provides details.
```{r lkrep1}
report(goss)
```

## Probe the back end

The back end is SQLite.  We can enumerate the
tables available:
```{r lktbs}
library(dplyr)
library(DBI)
allt <- dbListTables(goss@con)
length(allt)
head(allt)
```

## Use tidy methods

Individual tables are readily accessible.

```{r lkti,message=FALSE}
library(DT)
tbl(goss@con, "statements")
tbl(goss@con, "statements") |>
  head(20) |>
  as.data.frame() |>
  datatable()
```

To investigate the ontology, searching through RDF labels is a natural approach.

```{r dosrc}
search_labels(goss, "apoptosis") |>
  head() |>
  datatable()
```

Additional filtering could be useful here to focus on GO terms.  The `_riog...`
labels have special roles in RDF inference, and this will be addressed in 
vignettes to be added in the future.

Let's improve the query:
```{r dosrc2}
search_labels(goss, "apoptosis") |>
  filter(grepl("^GO:", subject)) |>
  head() |>
  datatable()
```
Clearly it will be valuable to filter away obsolete terms.  We will investigate
the use of edge tables to accomplish this in a future vignette.


# Transformation to ontology_index instances

The [ontologyX suite](https://academic.oup.com/bioinformatics/article/33/7/1104/2843897) of
Daniel Greene and colleagues provides very convenient ontology handling functions.
We can transform the SQLite data to this format.  We'll illustrate with cell ontology.

```{r lkoi, cache=TRUE}
clss <- semsql_connect(ontology = "cl")
cloi <- semsql_to_oi(clss@con)
cloi
```

A convenience function assists with visualizations:

```{r dopl}
onto_plot2(cloi, c("CL:0000624", "CL:0000492", "CL:0000793", "CL:0000803"))
```

# Background

The S7 class design in this package was initiated by a request to Anthropic Claude
to use S7 in establishing code that mirrors the tasks accomplished in the
[INCAtools jupyter notebook](https://github.com/INCATools/semantic-sql/blob/main/notebooks/SemanticSQL-Tutorial.ipynb).

## Searching in label text

The code of `search_labels` is:
```{r lksrch}
library(S7)
method(search_labels, SemsqlConn)
```

## Exploring concept properties with 'edge tables'

The INCAtools notebook discusses the fact that `rdfs_label_statement` is a SQLite table "view".

The notebook indicates that a SPARQL query on an RDF store for the following computation
would be "quite hard".  We want to find all the "edges" leading from "enteric neuron", which
would constitute the set of subject-predicate-object statements about this cell type with "enteric
neuron" as subject.

In this code we use the concept of a "CURIE" (Compact Uniform Resource Identifier):
a fixed length numerical identifier with
a prefix indicating the source ontology in which the ontologic concept is based.

```{r doentr}
if (!is_connected(clss)) clss <- reconnect(clss)
entcurie <- search_labels(clss, "enteric neuron") |>
  filter(grepl("^CL", subject)) |>
  dplyr::select(subject) |>
  unlist()
entcurie
get_direct_edges(clss, entcurie)
```

Here the underlying code is performing a join:
```{r lkmeth}
method(get_direct_edges, SemsqlConn)
```

## Generalizing a concept: Ancestors

The notebook mentions that the "entailed edges" table includes
all statements that can be inferred from the application of
base axioms of the ontology.

```{r lkent}
get_ancestors(clss, entcurie)
```

## Working with multiple ontologies

The INCAtools notebook includes an example of finding all neurons
that are part of the forebrain.  This involves identifying
CURIEs for relations and anatomical structures, thus working
with the relational ontology (RO) and UBERON.

```{r getmore}
ub <- semsql_connect(ontology = "uberon")
ro <- semsql_connect(ontology = "ro")
```

First question: What's the CURIE for "forebrain" in UBERON?
```{r lkub}
fbcur <- search_labels(ub, "forebrain", limit = 1000) |>
  filter(label == "forebrain") |>
  select(subject) |>
  unlist()
fbcur
```
Second question: What's the CURIE for "has soma location" in RO?
```{r lkro}
loccur <- search_labels(ro, "has soma location") |>
  select(subject) |>
  unlist()
loccur
```
What's the CURIE for "neuron"?
```{r lkcurn}
ncur <- search_labels(clss, "neuron", limit = 1000) |>
  filter(label == "neuron") |>
  select(subject) |>
  unlist()
ncur
```

Now we use three steps to obtain the solution.

First, enumerate all cell types that are located in forebrain.
```{r infb}
clinfb <- tbl(clss@con, "entailed_edge") |>
  filter(predicate == loccur, object == fbcur) |>
  select(subject) |>
  collect() |>
  unlist()
length(clinfb)
```

Second, filter these to those identified as 'subclassOf' "neuron".
```{r isne}
clisneur <- tbl(clss@con, "entailed_edge") |>
  filter(predicate == "rdfs:subClassOf", object == ncur) |>
  filter(subject %in% clinfb) |>
  select(subject) |>
  collect() |>
  unlist()
length(clisneur)
```

Finally, get the labels.
```{r doint}
tbl(clss@con, "rdfs_label_statement") |>
  filter(subject %in% clisneur) |>
  select(subject, value) |>
  collect() |>
  DT::datatable()
```

# Session information

```{r lksess}
sessionInfo()
```