--- title: "Introduction to WOVEN: Multi-Omics Integration for Incomplete Patient Data" author: "Nathan Bresette, Ai-Ling Lin, Jianlin Cheng" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Introduction to WOVEN} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 8, fig.height = 5, warning = FALSE, message = FALSE ) ``` # Overview **WOVEN** (Weighted Omics View Embedding via Nystrom) is a supervised multi-omics integration method designed for clinical cohorts where patients are missing entire assay blocks. Standard methods such as `mixOmics::DIABLO` enforce a strict intersection constraint: every patient must have every modality, so patients missing even one block are discarded. In a typical three-platform study this can eliminate 50--80% of enrolled subjects, systematically excluding the sickest patients. WOVEN solves this by learning projection matrices W from **anchor subjects** (fully-observed complete cases), then projecting block-missing subjects using their available views. No feature-level imputation is performed. # Installation ```{r install, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("woven") ``` # Preparing Your Data ## Input format `woven()` takes a named list of matrices --- one per omics platform --- where: - Rows are **subjects** (same order across all matrices) - Columns are **features** (genes, CpGs, proteins, metabolites, etc.) - Subjects missing an **entire modality block** have that row set to `NA` ```{r data-format, eval=FALSE} # Correct format: row = subject, col = feature # Subjects missing a platform have that entire row set to NA X_list <- list( RNA = rna_matrix, # n x p_rna, missing rows = all NA Methyl = methyl_matrix, # n x p_meth, missing rows = all NA Prot = prot_matrix # n x p_prot, missing rows = all NA ) ``` ## Pre-processing recommendations **WOVEN expects pre-processed, approximately continuous data.** Apply standard platform-specific normalization before passing data to `woven()`: | Platform | Recommended pre-processing | |---|---| | RNA-seq (bulk) | `log1p(CPM)` then `scale()` per gene | | DNA methylation | M-values (`log2(beta/(1-beta))`), then `scale()` | | Proteomics (LC-MS) | `log2(intensity)`, median-center per sample | | Metabolomics | `log2` transform, then `scale()` | | Microbiome (16S) | CLR transform (e.g. `compositions::clr()`) | Raw counts, raw beta values, or raw intensities will produce poorly scaled projection matrices. When in doubt, `scale()` each modality matrix so all features have mean 0 and SD 1. ```{r preprocess, eval=FALSE} # Example: log1p + scale for RNA-seq X_rna_scaled <- scale(log1p(raw_counts)) # Then set block-missing rows to NA (do NOT impute) X_rna_scaled[subjects_missing_rna, ] <- NA ``` # Quick Start ## Simulate a three-modality dataset ```{r simulate} library(woven) set.seed(42) n <- 150 # total subjects K <- 3 # latent dimensions to learn n_groups <- 3 # True group labels (e.g. CN, MCI, Dementia) Y <- factor(rep(c("CN", "MCI", "Dementia"), each = n / n_groups), levels = c("CN", "MCI", "Dementia")) # Simulate three pre-processed modality matrices with group signal make_modality <- function(p, Y, signal = 3) { X <- matrix(rnorm(length(Y) * p), length(Y), p) for (g in levels(Y)) X[Y == g, seq_len(8)] <- X[Y == g, seq_len(8)] + signal colnames(X) <- paste0("Feature_", seq_len(p)) X } X_rna <- make_modality(300, Y, signal = 3) X_meth <- make_modality(120, Y, signal = 4) X_prot <- make_modality(40, Y, signal = 5) ``` ## Induce block-level missingness ```{r missingness} set.seed(7) V <- 3 miss_mask <- matrix(runif(n * V) < 0.35, n, V) # Every subject must retain at least one modality for (i in which(rowSums(miss_mask) == V)) miss_mask[i, sample(V, 1)] <- FALSE # Apply missingness: entire row -> NA for missing modalities X_rna[ miss_mask[, 1], ] <- NA X_meth[miss_mask[, 2], ] <- NA X_prot[miss_mask[, 3], ] <- NA n_anchors <- sum(rowSums(miss_mask) == 0) cat(sprintf( "Anchors (all 3 views): %d / %d (%.0f%%)\n", n_anchors, n, 100 * n_anchors / n )) ``` ## Fit WOVEN Pass a **named** list so modality labels appear in all plots automatically. `anchor_idx` is optional -- WOVEN detects anchors from the NA pattern. ```{r fit} fit <- woven( X_list = list(RNA = X_rna, Methylation = X_meth, Proteomics = X_prot), Y = Y, K = K, lambdas = 0.01, # graph Laplacian regularization (robust to this) gamma_y = 5.0, # label supervision strength (any value > 0 works) k_nn = 10L # k-NN graph for Laplacian ) fit # prints summary + guided next steps ``` # Exploring Results ## Latent space scatter `plot()` shows all scored subjects colored by group. Solid points are anchors; faded points are block-missing subjects projected from their available views. Ellipses show the 68% confidence region per group (anchors only). ```{r plot-scatter, fig.cap="WOVEN latent space. All subjects are scored regardless of missingness."} plot(fit, labels = Y) ``` ## VIP scores: which features matter most? Variable Importance in Projection (VIP) scores rank features by their contribution to the shared latent space across all K dimensions. VIP > 1 indicates above-average importance. ```{r plot-vip, fig.cap="Top RNA features by VIP score."} woven_plot_vip(fit, modality = "RNA", n_top = 15) ``` Plot VIP for a different modality: ```{r plot-vip2, fig.cap="Top methylation features by VIP score."} woven_plot_vip(fit, modality = "Methylation", n_top = 15) ``` ## Feature loadings: direction of effect Loadings show which features push subjects in the positive vs. negative direction along a given latent dimension. Equivalent to `plotLoadings()` in DIABLO. ```{r plot-loadings, fig.width=10, fig.cap="Top feature loadings for latent dimension 1 across all three modalities."} woven_plot_loadings(fit, dim = 1L, n_top = 10) ``` ## Variance explained: choosing K Use this to verify that your chosen K captures most of the shared signal. If the last dimension still explains substantial variance, increase K. ```{r plot-variance, fig.cap="Proportion of shared variance explained per latent dimension."} woven_plot_variance(fit) ``` ## Quantitative metrics `woven_metrics()` computes silhouette score, Davies-Bouldin index, NMI, and effective sample size (ESS) retention directly from the fit object. ```{r metrics} woven_metrics(fit, Y) ``` ESS = 1.00 means every subject with at least one observed modality was scored. Compare this to DIABLO, which can only score subjects present in all modalities. # Predicting New Subjects ```{r predict} # Simulate 20 new subjects with partial missingness set.seed(99) X_new <- list( RNA = make_modality(300, Y[1:20], signal = 3), Methylation = make_modality(120, Y[1:20], signal = 4), Proteomics = make_modality(40, Y[1:20], signal = 5) ) # Some new subjects are missing a modality X_new$RNA[1:5, ] <- NA X_new$Proteomics[6:10, ] <- NA pred <- woven_predict(fit, X_new) head(pred[, 1:3]) ``` `predicted_class` returns original label names ("CN", "MCI", "Dementia"), not integer codes. # Comparing WOVEN and DIABLO: Effective Sample Size The key advantage of WOVEN over `mixOmics::DIABLO` is **effective sample size (ESS)**. DIABLO requires every subject to have every modality; WOVEN scores all subjects with at least one observed view. ```{r ess-comparison} n_anchors <- length(fit$anchor_idx) # what DIABLO would use n_woven <- sum(!is.na(fit$Z[, 1])) # what WOVEN scores cat(sprintf("DIABLO ESS: %d / %d (%.0f%%)\n", n_anchors, n, round(100 * n_anchors / n))) cat(sprintf("WOVEN ESS: %d / %d (%.0f%%)\n", n_woven, n, round(100 * n_woven / n))) ``` In real cohorts, the subjects DIABLO discards are often the sickest patients (those who miss a blood draw or imaging scan due to disease severity), introducing systematic bias in comparative effectiveness estimates. # Accessing Raw Results The fit object exposes all internal quantities for custom downstream analyses: ```{r access} dim(fit$Z) # n x K consensus latent scores for ALL subjects names(fit$W_list) # one projection matrix per modality dim(fit$W_list$RNA) # p_rna x K # Row names on Z come from row names of your input matrices # (set rownames on X_list matrices to get named rows in fit$Z) ``` `fit$Z` can be passed directly to survival analysis, clustering, or any downstream model as a low-dimensional representation of the full cohort. # Session Info ```{r session} sessionInfo() ```