--- title: "Integrative Multi-Omics Analysis with MultiOmicsBridge" author: - name: "Subhadip Jana" email: "subhadipjana1409@gmail.com" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true toc_float: true number_sections: true vignette: > %\VignetteIndexEntry{Integrative Multi-Omics Analysis with MultiOmicsBridge} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, message = TRUE, warning = FALSE ) ``` # Introduction Multi-omics integration is now central to mechanistic studies of complex diseases. Combining host transcriptomics (bulk RNA-seq) with gut microbiome profiling (16S rRNA or metagenomics) has revealed key host-microbe interactions in inflammatory bowel disease, tuberculosis, HIV, and metabolic syndrome. Yet the computational toolkit for this specific integration scenario remains fragmented: no single Bioconductor package takes a researcher from raw paired data through joint dimensionality reduction, biomarker discovery, and diagnostic classification. `MultiOmicsBridge` closes this gap. The package provides a unified, reproducible workflow with five modules: | Module | Functions | Purpose | |---|---|---| | 1. Harmonization | `loadHostData()`, `loadMicrobiomeData()`, `matchSamples()` | Import, normalize, pair data | | 2. Joint Dim. Reduction | `jointDimReduction()` | DIABLO sparse multi-block PLS-DA | | 3. Biomarker Discovery | `biomarkerDiscovery()` | Cross-omics ranked biomarkers | | 4. Classification | `diagnosticClassifier()` | Host-only vs microbiome-only vs joint | | 5. Visualization | `plotIntegration()`, `plotBiomarkerNetwork()`, `plotClassifierComparison()`, `plotSankey()` | Publication figures | The one-call wrapper `MultiOmicsBridgeAnalysis()` executes all five modules and returns a structured `MOBResult` S4 object. # Installation ```{r install, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("MultiOmicsBridge") ``` # Simulated Multi-Omics Dataset We demonstrate the package on a simulated dataset that mimics a real paired study: 20 samples (10 healthy controls, 10 disease cases), 500 host genes, and 80 microbial taxa. To model a biologically realistic signal, we inject upregulation of the first 20 genes and enrichment of the first 5 taxa in the disease group. ```{r simulate} library(MultiOmicsBridge) library(SummarizedExperiment) set.seed(2026) n_genes <- 500 n_taxa <- 80 n_samples <- 20 # Host RNA-seq count matrix (genes x samples) host_counts <- matrix( rpois(n_genes * n_samples, lambda = 150L), nrow = n_genes, ncol = n_samples ) rownames(host_counts) <- paste0("Gene", seq_len(n_genes)) colnames(host_counts) <- paste0("Sample", seq_len(n_samples)) # Inject transcriptional signal: genes 1-20 upregulated in disease host_counts[seq_len(20), seq(11, n_samples)] <- host_counts[seq_len(20), seq(11, n_samples)] * 5L # Microbiome count matrix (taxa x samples) mb_counts <- matrix( rpois(n_taxa * n_samples, lambda = 40L), nrow = n_taxa, ncol = n_samples ) rownames(mb_counts) <- paste0("Taxon", seq_len(n_taxa)) colnames(mb_counts) <- paste0("Sample", seq_len(n_samples)) # Inject microbial enrichment: taxa 1-5 enriched in disease mb_counts[seq_len(5), seq(11, n_samples)] <- mb_counts[seq_len(5), seq(11, n_samples)] * 4L # Sample metadata col_data <- data.frame( condition = rep(c("Control", "Disease"), each = 10), age = sample(30:65, n_samples, replace = TRUE), sex = sample(c("M", "F"), n_samples, replace = TRUE), row.names = paste0("Sample", seq_len(n_samples)) ) cat(sprintf("Host: %d genes × %d samples\n", nrow(host_counts), ncol(host_counts))) cat(sprintf("Microbiome: %d taxa × %d samples\n", nrow(mb_counts), ncol(mb_counts))) ``` # Module 1: Data Harmonization ## Step 1.1 — Load host RNA-seq data `loadHostData()` applies TMM normalization (via `edgeR`) followed by limma-voom precision weighting. Both raw counts and log2-CPM values are stored in the output `SummarizedExperiment`. ```{r load_host} host_se <- loadHostData(host_counts, col_data = col_data) host_se assayNames(host_se) ``` ## Step 1.2 — Load microbiome data `loadMicrobiomeData()` applies centered log-ratio (CLR) transformation. This is the compositionally appropriate normalization that removes the unit-sum constraint inherent to relative abundance data. ```{r load_mb} mb_se <- loadMicrobiomeData(mb_counts, normalization = "CLR", min_prevalence = 0.1) mb_se assayNames(mb_se) # Verify CLR: each sample should have zero mean cat("Column means of CLR matrix (should be ≈ 0):\n") print(round(colMeans(assay(mb_se, "CLR")), 10)[1:5]) ``` ## Step 1.3 — Match paired samples `matchSamples()` finds the intersection of sample names across both SEs and packages them into a `MultiAssayExperiment` (MAE). Unpaired samples are transparently reported and excluded. ```{r match_samples} mae <- matchSamples(host_se, mb_se, min_paired = 10) mae ``` All 20 samples are paired in this dataset. In real studies, `matchSamples()` will report and exclude any samples missing from either platform. # Module 2: Joint Dimensionality Reduction `jointDimReduction()` runs DIABLO (Data Integration Analysis for Biomarker discovery using Latent cOmponents) from the `mixOmics` package. DIABLO simultaneously maximises the covariance between the host and microbiome latent variates while discriminating between outcome groups, using L1 sparsity to select only the most informative features. ```{r dim_reduction} outcome <- col_data$condition # "Control" or "Disease" dr_result <- jointDimReduction( mae, outcome = outcome, n_components = 2L, n_features_host = 40L, n_features_mb = 15L, design_off_diag = 0.1 ) cat("Score matrix dimensions:", dim(dr_result$scores), "\n") cat("Host loading matrix dimensions:", dim(dr_result$host_loadings), "\n") cat("Microbiome loading matrix dimensions:", dim(dr_result$mb_loadings), "\n") cat("\nExplained variance per component:\n") print(round(dr_result$explained_variance, 4)) ``` # Module 3: Biomarker Discovery `biomarkerDiscovery()` ranks host genes and microbial taxa using their DIABLO sparse loading scores (L2 norm across components) and annotates each with its maximum Spearman correlation to a feature in the other omics layer. High-loading features with strong cross-omics correlations represent the most credible multi-omics biomarker candidates. ```{r biomarker_discovery} bm_table <- biomarkerDiscovery(mae, dr_result, n_biomarkers = 30) # Show top 10 biomarkers bm_df <- as.data.frame(bm_table) bm_df_sorted <- bm_df[order(bm_df$loading_score, decreasing = TRUE), ] head(bm_df_sorted[, c("feature","omics_layer","loading_score", "component","max_cross_cor","top_partner")], 10) ``` ```{r biomarker_summary} n_host <- sum(bm_df$omics_layer == "host") n_mb <- sum(bm_df$omics_layer == "microbiome") cat(sprintf("Host biomarkers : %d\n", n_host)) cat(sprintf("Microbiome biomarkers: %d\n", n_mb)) # Are the injected genes recovered? injected <- paste0("Gene", seq_len(20)) detected <- intersect(bm_df$feature, injected) cat(sprintf("Injected genes recovered: %d / 20\n", length(detected))) ``` # Module 4: Diagnostic Classification `diagnosticClassifier()` trains three Random Forest models using nested cross-validation and compares their AUC-ROC values. This comparison directly quantifies the added diagnostic value of multi-omics integration. ```{r classifier} clf_results <- diagnosticClassifier( mae, outcome = outcome, biomarker_table = bm_table, cv_folds = 5L, seed = 42L ) cat("Classifier AUC-ROC (mean ± SD across 5-fold CV):\n") for (nm in c("host_only", "microbiome_only", "joint")) { r <- clf_results[[nm]] cat(sprintf(" %-22s %.3f ± %.3f\n", paste0(nm, ":"), r$mean_auc, r$sd_auc)) } delta <- clf_results$joint$mean_auc - clf_results$host_only$mean_auc cat(sprintf("\nMulti-omics gain vs host-only: +%.3f AUC\n", delta)) ``` # Full Pipeline with One Call For standard analyses, `MultiOmicsBridgeAnalysis()` runs all modules in sequence and returns a `MOBResult` S4 object: ```{r full_pipeline} result <- MultiOmicsBridgeAnalysis( mae, outcome, n_components = 2L, n_features_host = 40L, n_features_mb = 15L, n_biomarkers = 30L, cv_folds = 5L, seed = 42L ) result ``` Access individual results with typed accessor methods: ```{r accessors} # Integration scores: samples x components head(integrationScores(result)) # Top biomarkers head(as.data.frame(biomarkers(result))[, c("feature","omics_layer", "loading_score")]) # Classifier performance perf <- performance(result) cat(sprintf("Joint AUC: %.3f\n", perf$joint$mean_auc)) ``` # Module 5: Visualization ## Joint integration biplot ```{r plot_integration, fig.cap="DIABLO sample scores coloured by outcome group. Loading vectors show the top contributing features per omics layer."} plotIntegration(result, outcome = outcome, n_loading_arrows = 5) ``` ## Cross-omics biomarker network ```{r plot_network, fig.cap="Clustered heatmap of Spearman correlations between the top host genes and microbial taxa. Strong positive/negative correlations indicate candidate host-microbe interactions."} plotBiomarkerNetwork(result, mae, n_host = 15, n_mb = 10) ``` ## Classifier comparison (bar chart) ```{r plot_bar, fig.cap="Mean cross-validated AUC-ROC for host-only, microbiome-only, and joint classifiers. The multi-omics advantage is immediately visible."} plotClassifierComparison(result, type = "bar") ``` ## Classifier comparison (ROC curves) ```{r plot_roc, fig.cap="Overlaid ROC curves from the last cross-validation fold for each classifier configuration."} plotClassifierComparison(result, type = "roc") ``` ## Feature flow (Sankey) diagram ```{r plot_sankey, fig.cap="Feature flow from omics layer through selected biomarkers to outcome classes. Edge width is proportional to loading score."} plotSankey(result, n_features = 8) ``` ## Structured text report ```{r report} generateReport(result, n_top = 8) ``` # Working with Real Data ## HMP2 / IBDMDB dataset The IBDMDB dataset (Franzosa et al. 2019) is the recommended primary validation dataset. To use it with MultiOmicsBridge: ```{r real_data, eval=FALSE} # The HMP2 data is available from the IBDMDB portal. # After downloading, load as: library(MultiOmicsBridge) # Host RNA-seq host_se <- loadHostData( counts = host_count_matrix, # genes x samples col_data = sample_metadata ) # Microbiome 16S mb_se <- loadMicrobiomeData( taxa_table = taxa_count_matrix, # taxa x samples normalization = "CLR" ) # Match and run mae <- matchSamples(host_se, mb_se) result <- MultiOmicsBridgeAnalysis(mae, outcome = sample_metadata$diagnosis) result ``` ## Complex disease contexts MultiOmicsBridge is designed to generalize to complex disease contexts. For tuberculosis studies (e.g. GEO: GSE79362), the same workflow applies: ```{r disease_context, eval=FALSE} # TB study: blood RNA-seq + sputum microbiome host_se <- loadHostData(blood_counts, col_data = patient_metadata) mb_se <- loadMicrobiomeData(sputum_taxa, normalization = "CLR") mae <- matchSamples(host_se, mb_se) result <- MultiOmicsBridgeAnalysis( mae, outcome = patient_metadata$tb_status, n_features_host = 100, n_features_mb = 30, cv_folds = 5 ) plotClassifierComparison(result, type = "bar") ``` # Design Principles ## Why CLR for microbiome data? Microbiome count data is **compositional**: only relative abundances are observed. Analyzing composites with Euclidean distances or Pearson correlations leads to spurious results (the Aitchison problem). The CLR transformation maps compositional data to real space by: $$clr(x_j) = \log\left(\frac{x_j + \delta}{g_\delta(x)}\right)$$ where $g_\delta(x) = \exp\left(\frac{1}{D}\sum_{k}\log(x_k + \delta)\right)$ is the geometric mean with pseudocount $\delta$. This removes the unit-sum constraint and enables standard Euclidean geometry, making correlations between host genes and microbial taxa statistically valid. ## Why DIABLO for integration? Unlike concatenation (simply merging feature matrices) or independent analysis of each omics layer, DIABLO enforces **cross-block correlation**: the latent variates from the host and microbiome blocks are constrained to covary. This means DIABLO identifies features that change **together** across conditions, which is biologically more meaningful than features that are simply marginally associated with the outcome in each dataset independently. ## Quantifying the multi-omics advantage A key scientific contribution of every multi-omics study is demonstrating that integrating two data types improves diagnostic performance over either alone. `diagnosticClassifier()` makes this comparison automatic and transparent, following best practices from clinical prediction modelling: nested cross-validation with stratified folds to avoid overfitting bias. # Session information ```{r session} sessionInfo() ```