Introduction to WOVEN: Multi-Omics Integration for Incomplete Patient Data

Overview

WOVEN (Weighted Omics View Embedding via Nystrom) is a supervised multi-omics integration method designed for clinical cohorts where patients are missing entire assay blocks. Standard methods such as mixOmics::DIABLO enforce a strict intersection constraint: every patient must have every modality, so patients missing even one block are discarded. In a typical three-platform study this can eliminate 50–80% of enrolled subjects, systematically excluding the sickest patients.

WOVEN solves this by learning projection matrices W from anchor subjects (fully-observed complete cases), then projecting block-missing subjects using their available views. No feature-level imputation is performed.

Installation

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("woven")

Preparing Your Data

Input format

woven() takes a named list of matrices — one per omics platform — where:

  • Rows are subjects (same order across all matrices)
  • Columns are features (genes, CpGs, proteins, metabolites, etc.)
  • Subjects missing an entire modality block have that row set to NA
# Correct format: row = subject, col = feature
# Subjects missing a platform have that entire row set to NA
X_list <- list(
    RNA   = rna_matrix,    # n x p_rna,  missing rows = all NA
    Methyl = methyl_matrix, # n x p_meth, missing rows = all NA
    Prot  = prot_matrix    # n x p_prot, missing rows = all NA
)

Pre-processing recommendations

WOVEN expects pre-processed, approximately continuous data. Apply standard platform-specific normalization before passing data to woven():

Platform Recommended pre-processing
RNA-seq (bulk) log1p(CPM) then scale() per gene
DNA methylation M-values (log2(beta/(1-beta))), then scale()
Proteomics (LC-MS) log2(intensity), median-center per sample
Metabolomics log2 transform, then scale()
Microbiome (16S) CLR transform (e.g. compositions::clr())

Raw counts, raw beta values, or raw intensities will produce poorly scaled projection matrices. When in doubt, scale() each modality matrix so all features have mean 0 and SD 1.

# Example: log1p + scale for RNA-seq
X_rna_scaled <- scale(log1p(raw_counts))

# Then set block-missing rows to NA (do NOT impute)
X_rna_scaled[subjects_missing_rna, ] <- NA

Quick Start

Simulate a three-modality dataset

library(woven)
set.seed(42)

n       <- 150   # total subjects
K       <- 3     # latent dimensions to learn
n_groups <- 3

# True group labels (e.g. CN, MCI, Dementia)
Y <- factor(rep(c("CN", "MCI", "Dementia"), each = n / n_groups),
            levels = c("CN", "MCI", "Dementia"))

# Simulate three pre-processed modality matrices with group signal
make_modality <- function(p, Y, signal = 3) {
    X <- matrix(rnorm(length(Y) * p), length(Y), p)
    for (g in levels(Y))
        X[Y == g, seq_len(8)] <- X[Y == g, seq_len(8)] + signal
    colnames(X) <- paste0("Feature_", seq_len(p))
    X
}

X_rna   <- make_modality(300, Y, signal = 3)
X_meth  <- make_modality(120, Y, signal = 4)
X_prot  <- make_modality(40,  Y, signal = 5)

Induce block-level missingness

set.seed(7)
V        <- 3
miss_mask <- matrix(runif(n * V) < 0.35, n, V)

# Every subject must retain at least one modality
for (i in which(rowSums(miss_mask) == V))
    miss_mask[i, sample(V, 1)] <- FALSE

# Apply missingness: entire row -> NA for missing modalities
X_rna[ miss_mask[, 1], ] <- NA
X_meth[miss_mask[, 2], ] <- NA
X_prot[miss_mask[, 3], ] <- NA

n_anchors <- sum(rowSums(miss_mask) == 0)
cat(sprintf(
    "Anchors (all 3 views): %d / %d (%.0f%%)\n",
    n_anchors, n, 100 * n_anchors / n
))
#> Anchors (all 3 views): 47 / 150 (31%)

Fit WOVEN

Pass a named list so modality labels appear in all plots automatically. anchor_idx is optional – WOVEN detects anchors from the NA pattern.

fit <- woven(
    X_list  = list(RNA = X_rna, Methylation = X_meth, Proteomics = X_prot),
    Y       = Y,
    K       = K,
    lambdas = 0.01,   # graph Laplacian regularization (robust to this)
    gamma_y = 5.0,    # label supervision strength (any value > 0 works)
    k_nn    = 10L     # k-NN graph for Laplacian
)

fit   # prints summary + guided next steps
#> WOVEN fit
#>   Modalities : 3    Subjects: 150    Dimensions: 3
#>   Anchors    : 47 (31%)    Scored: 150 (100%)
#>   Solver     : mcca_dual (closed-form, globally optimal)
#>   gamma_y    : 5.00    lambda: 0.01/0.01/0.01    k_nn: 10
#>   Singular values: 5.719, 4.846, 1.999
#>   Classes    : CN, MCI, Dementia
#>   Modalities : RNA, Methylation, Proteomics
#> 
#>   -- Next steps --
#>   plot(fit, labels = Y)                         # latent space scatter
#>   woven_plot_vip(fit, modality = "RNA")        # top features by VIP
#>   woven_plot_loadings(fit, dim = 1)             # loadings per modality
#>   woven_plot_variance(fit)                      # variance per dimension
#>   woven_metrics(fit, Y)                         # silhouette, NMI, ESS
#>   woven_predict(fit, X_list_new)                # predict on new data

Exploring Results

Latent space scatter

plot() shows all scored subjects colored by group. Solid points are anchors; faded points are block-missing subjects projected from their available views. Ellipses show the 68% confidence region per group (anchors only).

plot(fit, labels = Y)
WOVEN latent space. All subjects are scored regardless of missingness.

WOVEN latent space. All subjects are scored regardless of missingness.

VIP scores: which features matter most?

Variable Importance in Projection (VIP) scores rank features by their contribution to the shared latent space across all K dimensions. VIP > 1 indicates above-average importance.

woven_plot_vip(fit, modality = "RNA", n_top = 15)
Top RNA features by VIP score.

Top RNA features by VIP score.

Plot VIP for a different modality:

woven_plot_vip(fit, modality = "Methylation", n_top = 15)
Top methylation features by VIP score.

Top methylation features by VIP score.

Feature loadings: direction of effect

Loadings show which features push subjects in the positive vs. negative direction along a given latent dimension. Equivalent to plotLoadings() in DIABLO.

woven_plot_loadings(fit, dim = 1L, n_top = 10)
Top feature loadings for latent dimension 1 across all three modalities.

Top feature loadings for latent dimension 1 across all three modalities.

Variance explained: choosing K

Use this to verify that your chosen K captures most of the shared signal. If the last dimension still explains substantial variance, increase K.

woven_plot_variance(fit)
Proportion of shared variance explained per latent dimension.

Proportion of shared variance explained per latent dimension.

Quantitative metrics

woven_metrics() computes silhouette score, Davies-Bouldin index, NMI, and effective sample size (ESS) retention directly from the fit object.

woven_metrics(fit, Y)
#>     Silhouette Davies-Bouldin            NMI            ESS 
#>     0.01947443     4.17068379     0.06068035     1.00000000

ESS = 1.00 means every subject with at least one observed modality was scored. Compare this to DIABLO, which can only score subjects present in all modalities.

Predicting New Subjects

# Simulate 20 new subjects with partial missingness
set.seed(99)
X_new <- list(
    RNA         = make_modality(300, Y[1:20], signal = 3),
    Methylation = make_modality(120, Y[1:20], signal = 4),
    Proteomics  = make_modality(40,  Y[1:20], signal = 5)
)
# Some new subjects are missing a modality
X_new$RNA[1:5, ]        <- NA
X_new$Proteomics[6:10, ] <- NA

pred <- woven_predict(fit, X_new)
head(pred[, 1:3])
#>   predicted_class confidence      p_CN
#> 1             MCI  0.6216826 0.1592255
#> 2             MCI  0.6459433 0.1093645
#> 3              CN  0.7694142 0.7694142
#> 4        Dementia  0.5644197 0.2521266
#> 5        Dementia  0.6529592 0.2326491
#> 6        Dementia  0.5116187 0.1828805

predicted_class returns original label names (“CN”, “MCI”, “Dementia”), not integer codes.

Comparing WOVEN and DIABLO: Effective Sample Size

The key advantage of WOVEN over mixOmics::DIABLO is effective sample size (ESS). DIABLO requires every subject to have every modality; WOVEN scores all subjects with at least one observed view.

n_anchors <- length(fit$anchor_idx)    # what DIABLO would use
n_woven   <- sum(!is.na(fit$Z[, 1]))  # what WOVEN scores

cat(sprintf("DIABLO ESS: %d / %d (%.0f%%)\n",
            n_anchors, n, round(100 * n_anchors / n)))
#> DIABLO ESS: 47 / 150 (31%)
cat(sprintf("WOVEN  ESS: %d / %d (%.0f%%)\n",
            n_woven,   n, round(100 * n_woven   / n)))
#> WOVEN  ESS: 150 / 150 (100%)

In real cohorts, the subjects DIABLO discards are often the sickest patients (those who miss a blood draw or imaging scan due to disease severity), introducing systematic bias in comparative effectiveness estimates.

Accessing Raw Results

The fit object exposes all internal quantities for custom downstream analyses:

dim(fit$Z)          # n x K consensus latent scores for ALL subjects
#> [1] 150   3
names(fit$W_list)   # one projection matrix per modality
#> [1] "RNA"         "Methylation" "Proteomics"
dim(fit$W_list$RNA) # p_rna x K
#> [1] 300   3

# Row names on Z come from row names of your input matrices
# (set rownames on X_list matrices to get named rows in fit$Z)

fit$Z can be passed directly to survival analysis, clustering, or any downstream model as a low-dimensional representation of the full cohort.

Session Info

sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 26.04 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.32.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] woven_0.99.0   rmarkdown_2.31
#> 
#> loaded via a namespace (and not attached):
#>  [1] vctrs_0.7.3           cli_3.6.6             knitr_1.51           
#>  [4] rlang_1.2.0           xfun_0.59             otel_0.2.0           
#>  [7] S7_0.2.2              jsonlite_2.0.0        labeling_0.4.3       
#> [10] glue_1.8.1            buildtools_1.0.0      htmltools_0.5.9      
#> [13] maketools_1.3.2       sys_3.4.3             sass_0.4.10          
#> [16] MatrixGenerics_1.25.0 scales_1.4.0          grid_4.6.0           
#> [19] evaluate_1.0.5        jquerylib_0.1.4       fastmap_1.2.0        
#> [22] yaml_2.3.12           lifecycle_1.0.5       cluster_2.1.8.2      
#> [25] compiler_4.6.0        RColorBrewer_1.1-3    farver_2.1.2         
#> [28] lattice_0.22-9        digest_0.6.39         R6_2.6.1             
#> [31] RANN_2.6.2            parallel_4.6.0        bslib_0.11.0         
#> [34] Matrix_1.7-5          withr_3.0.3           gtable_0.3.6         
#> [37] tools_4.6.0           matrixStats_1.5.0     ggplot2_4.0.3        
#> [40] cachem_1.1.0