---
title: "Integrative Multi-Omics Analysis with MultiOmicsBridge"
author:
  - name: "Subhadip Jana"
    email: "subhadipjana1409@gmail.com"
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc: true
    toc_float: true
    number_sections: true
vignette: >
  %\VignetteIndexEntry{Integrative Multi-Omics Analysis with MultiOmicsBridge}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
    collapse  = TRUE,
    comment   = "#>",
    fig.width = 7,
    fig.height = 5,
    message   = TRUE,
    warning   = FALSE
)
```

# Introduction

Multi-omics integration is now central to mechanistic studies of complex
diseases. Combining host transcriptomics (bulk RNA-seq) with gut microbiome
profiling (16S rRNA or metagenomics) has revealed key host-microbe
interactions in inflammatory bowel disease, tuberculosis, HIV, and metabolic
syndrome. Yet the computational toolkit for this specific integration
scenario remains fragmented: no single Bioconductor package takes a
researcher from raw paired data through joint dimensionality reduction,
biomarker discovery, and diagnostic classification.

`MultiOmicsBridge` closes this gap. The package provides a unified,
reproducible workflow with five modules:

| Module | Functions | Purpose |
|---|---|---|
| 1. Harmonization | `loadHostData()`, `loadMicrobiomeData()`, `matchSamples()` | Import, normalize, pair data |
| 2. Joint Dim. Reduction | `jointDimReduction()` | DIABLO sparse multi-block PLS-DA |
| 3. Biomarker Discovery | `biomarkerDiscovery()` | Cross-omics ranked biomarkers |
| 4. Classification | `diagnosticClassifier()` | Host-only vs microbiome-only vs joint |
| 5. Visualization | `plotIntegration()`, `plotBiomarkerNetwork()`, `plotClassifierComparison()`, `plotSankey()` | Publication figures |

The one-call wrapper `MultiOmicsBridgeAnalysis()` executes all five modules
and returns a structured `MOBResult` S4 object.

# Installation

```{r install, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("MultiOmicsBridge")
```

# Simulated Multi-Omics Dataset

We demonstrate the package on a simulated dataset that mimics a real
paired study: 20 samples (10 healthy controls, 10 disease cases), 500 host
genes, and 80 microbial taxa. To model a biologically realistic signal, we
inject upregulation of the first 20 genes and enrichment of the first 5
taxa in the disease group.

```{r simulate}
library(MultiOmicsBridge)
library(SummarizedExperiment)

set.seed(2026)
n_genes   <- 500
n_taxa    <- 80
n_samples <- 20

# Host RNA-seq count matrix (genes x samples)
host_counts <- matrix(
    rpois(n_genes * n_samples, lambda = 150L),
    nrow = n_genes, ncol = n_samples
)
rownames(host_counts) <- paste0("Gene",   seq_len(n_genes))
colnames(host_counts) <- paste0("Sample", seq_len(n_samples))

# Inject transcriptional signal: genes 1-20 upregulated in disease
host_counts[seq_len(20), seq(11, n_samples)] <-
    host_counts[seq_len(20), seq(11, n_samples)] * 5L

# Microbiome count matrix (taxa x samples)
mb_counts <- matrix(
    rpois(n_taxa * n_samples, lambda = 40L),
    nrow = n_taxa, ncol = n_samples
)
rownames(mb_counts) <- paste0("Taxon",  seq_len(n_taxa))
colnames(mb_counts) <- paste0("Sample", seq_len(n_samples))

# Inject microbial enrichment: taxa 1-5 enriched in disease
mb_counts[seq_len(5), seq(11, n_samples)] <-
    mb_counts[seq_len(5), seq(11, n_samples)] * 4L

# Sample metadata
col_data <- data.frame(
    condition = rep(c("Control", "Disease"), each = 10),
    age       = sample(30:65, n_samples, replace = TRUE),
    sex       = sample(c("M", "F"), n_samples, replace = TRUE),
    row.names = paste0("Sample", seq_len(n_samples))
)

cat(sprintf("Host:       %d genes × %d samples\n", nrow(host_counts), ncol(host_counts)))
cat(sprintf("Microbiome: %d taxa  × %d samples\n", nrow(mb_counts), ncol(mb_counts)))
```

# Module 1: Data Harmonization

## Step 1.1 — Load host RNA-seq data

`loadHostData()` applies TMM normalization (via `edgeR`) followed by
limma-voom precision weighting. Both raw counts and log2-CPM values are
stored in the output `SummarizedExperiment`.

```{r load_host}
host_se <- loadHostData(host_counts, col_data = col_data)
host_se
assayNames(host_se)
```

## Step 1.2 — Load microbiome data

`loadMicrobiomeData()` applies centered log-ratio (CLR) transformation.
This is the compositionally appropriate normalization that removes the
unit-sum constraint inherent to relative abundance data.

```{r load_mb}
mb_se <- loadMicrobiomeData(mb_counts, normalization = "CLR",
                             min_prevalence = 0.1)
mb_se
assayNames(mb_se)

# Verify CLR: each sample should have zero mean
cat("Column means of CLR matrix (should be ≈ 0):\n")
print(round(colMeans(assay(mb_se, "CLR")), 10)[1:5])
```

## Step 1.3 — Match paired samples

`matchSamples()` finds the intersection of sample names across both SEs
and packages them into a `MultiAssayExperiment` (MAE). Unpaired samples
are transparently reported and excluded.

```{r match_samples}
mae <- matchSamples(host_se, mb_se, min_paired = 10)
mae
```

All 20 samples are paired in this dataset. In real studies, `matchSamples()`
will report and exclude any samples missing from either platform.

# Module 2: Joint Dimensionality Reduction

`jointDimReduction()` runs DIABLO (Data Integration Analysis for Biomarker
discovery using Latent cOmponents) from the `mixOmics` package. DIABLO
simultaneously maximises the covariance between the host and microbiome
latent variates while discriminating between outcome groups, using L1
sparsity to select only the most informative features.

```{r dim_reduction}
outcome <- col_data$condition   # "Control" or "Disease"

dr_result <- jointDimReduction(
    mae,
    outcome         = outcome,
    n_components    = 2L,
    n_features_host = 40L,
    n_features_mb   = 15L,
    design_off_diag = 0.1
)

cat("Score matrix dimensions:", dim(dr_result$scores), "\n")
cat("Host loading matrix dimensions:", dim(dr_result$host_loadings), "\n")
cat("Microbiome loading matrix dimensions:", dim(dr_result$mb_loadings), "\n")
cat("\nExplained variance per component:\n")
print(round(dr_result$explained_variance, 4))
```

# Module 3: Biomarker Discovery

`biomarkerDiscovery()` ranks host genes and microbial taxa using their
DIABLO sparse loading scores (L2 norm across components) and annotates
each with its maximum Spearman correlation to a feature in the other omics
layer. High-loading features with strong cross-omics correlations represent
the most credible multi-omics biomarker candidates.

```{r biomarker_discovery}
bm_table <- biomarkerDiscovery(mae, dr_result, n_biomarkers = 30)

# Show top 10 biomarkers
bm_df <- as.data.frame(bm_table)
bm_df_sorted <- bm_df[order(bm_df$loading_score, decreasing = TRUE), ]
head(bm_df_sorted[, c("feature","omics_layer","loading_score",
                        "component","max_cross_cor","top_partner")], 10)
```

```{r biomarker_summary}
n_host <- sum(bm_df$omics_layer == "host")
n_mb   <- sum(bm_df$omics_layer == "microbiome")
cat(sprintf("Host biomarkers     : %d\n", n_host))
cat(sprintf("Microbiome biomarkers: %d\n", n_mb))

# Are the injected genes recovered?
injected <- paste0("Gene", seq_len(20))
detected <- intersect(bm_df$feature, injected)
cat(sprintf("Injected genes recovered: %d / 20\n", length(detected)))
```

# Module 4: Diagnostic Classification

`diagnosticClassifier()` trains three Random Forest models using nested
cross-validation and compares their AUC-ROC values. This comparison
directly quantifies the added diagnostic value of multi-omics integration.

```{r classifier}
clf_results <- diagnosticClassifier(
    mae,
    outcome         = outcome,
    biomarker_table = bm_table,
    cv_folds        = 5L,
    seed            = 42L
)

cat("Classifier AUC-ROC (mean ± SD across 5-fold CV):\n")
for (nm in c("host_only", "microbiome_only", "joint")) {
    r <- clf_results[[nm]]
    cat(sprintf("  %-22s  %.3f ± %.3f\n",
                paste0(nm, ":"), r$mean_auc, r$sd_auc))
}

delta <- clf_results$joint$mean_auc - clf_results$host_only$mean_auc
cat(sprintf("\nMulti-omics gain vs host-only: +%.3f AUC\n", delta))
```

# Full Pipeline with One Call

For standard analyses, `MultiOmicsBridgeAnalysis()` runs all modules in
sequence and returns a `MOBResult` S4 object:

```{r full_pipeline}
result <- MultiOmicsBridgeAnalysis(
    mae,
    outcome,
    n_components    = 2L,
    n_features_host = 40L,
    n_features_mb   = 15L,
    n_biomarkers    = 30L,
    cv_folds        = 5L,
    seed            = 42L
)

result
```

Access individual results with typed accessor methods:

```{r accessors}
# Integration scores: samples x components
head(integrationScores(result))

# Top biomarkers
head(as.data.frame(biomarkers(result))[, c("feature","omics_layer",
                                             "loading_score")])

# Classifier performance
perf <- performance(result)
cat(sprintf("Joint AUC: %.3f\n", perf$joint$mean_auc))
```

# Module 5: Visualization

## Joint integration biplot

```{r plot_integration, fig.cap="DIABLO sample scores coloured by outcome group. Loading vectors show the top contributing features per omics layer."}
plotIntegration(result, outcome = outcome, n_loading_arrows = 5)
```

## Cross-omics biomarker network

```{r plot_network, fig.cap="Clustered heatmap of Spearman correlations between the top host genes and microbial taxa. Strong positive/negative correlations indicate candidate host-microbe interactions."}
plotBiomarkerNetwork(result, mae, n_host = 15, n_mb = 10)
```

## Classifier comparison (bar chart)

```{r plot_bar, fig.cap="Mean cross-validated AUC-ROC for host-only, microbiome-only, and joint classifiers. The multi-omics advantage is immediately visible."}
plotClassifierComparison(result, type = "bar")
```

## Classifier comparison (ROC curves)

```{r plot_roc, fig.cap="Overlaid ROC curves from the last cross-validation fold for each classifier configuration."}
plotClassifierComparison(result, type = "roc")
```

## Feature flow (Sankey) diagram

```{r plot_sankey, fig.cap="Feature flow from omics layer through selected biomarkers to outcome classes. Edge width is proportional to loading score."}
plotSankey(result, n_features = 8)
```

## Structured text report

```{r report}
generateReport(result, n_top = 8)
```

# Working with Real Data

## HMP2 / IBDMDB dataset

The IBDMDB dataset (Franzosa et al. 2019) is the recommended primary
validation dataset. To use it with MultiOmicsBridge:

```{r real_data, eval=FALSE}
# The HMP2 data is available from the IBDMDB portal.
# After downloading, load as:
library(MultiOmicsBridge)

# Host RNA-seq
host_se <- loadHostData(
    counts   = host_count_matrix,   # genes x samples
    col_data = sample_metadata
)

# Microbiome 16S
mb_se <- loadMicrobiomeData(
    taxa_table    = taxa_count_matrix,   # taxa x samples
    normalization = "CLR"
)

# Match and run
mae    <- matchSamples(host_se, mb_se)
result <- MultiOmicsBridgeAnalysis(mae, outcome = sample_metadata$diagnosis)
result
```

## Complex disease contexts

MultiOmicsBridge is designed to generalize to complex disease contexts.
For tuberculosis studies (e.g. GEO: GSE79362), the same workflow applies:

```{r disease_context, eval=FALSE}
# TB study: blood RNA-seq + sputum microbiome
host_se <- loadHostData(blood_counts, col_data = patient_metadata)
mb_se   <- loadMicrobiomeData(sputum_taxa, normalization = "CLR")
mae     <- matchSamples(host_se, mb_se)

result  <- MultiOmicsBridgeAnalysis(
    mae,
    outcome         = patient_metadata$tb_status,
    n_features_host = 100,
    n_features_mb   = 30,
    cv_folds        = 5
)
plotClassifierComparison(result, type = "bar")
```

# Design Principles

## Why CLR for microbiome data?

Microbiome count data is **compositional**: only relative abundances are
observed. Analyzing composites with Euclidean distances or Pearson
correlations leads to spurious results (the Aitchison problem). The CLR
transformation maps compositional data to real space by:

$$clr(x_j) = \log\left(\frac{x_j + \delta}{g_\delta(x)}\right)$$

where $g_\delta(x) = \exp\left(\frac{1}{D}\sum_{k}\log(x_k + \delta)\right)$
is the geometric mean with pseudocount $\delta$. This removes the unit-sum
constraint and enables standard Euclidean geometry, making correlations
between host genes and microbial taxa statistically valid.

## Why DIABLO for integration?

Unlike concatenation (simply merging feature matrices) or independent
analysis of each omics layer, DIABLO enforces **cross-block correlation**:
the latent variates from the host and microbiome blocks are constrained to
covary. This means DIABLO identifies features that change **together**
across conditions, which is biologically more meaningful than features that
are simply marginally associated with the outcome in each dataset
independently.

## Quantifying the multi-omics advantage

A key scientific contribution of every multi-omics study is demonstrating
that integrating two data types improves diagnostic performance over either
alone. `diagnosticClassifier()` makes this comparison automatic and
transparent, following best practices from clinical prediction modelling:
nested cross-validation with stratified folds to avoid overfitting bias.

# Session information

```{r session}
sessionInfo()
```