WOVEN (Weighted Omics View Embedding via Nystrom) is
a supervised multi-omics integration method designed for clinical
cohorts where patients are missing entire assay blocks. Standard methods
such as mixOmics::DIABLO enforce a strict intersection
constraint: every patient must have every modality, so patients missing
even one block are discarded. In a typical three-platform study this can
eliminate 50–80% of enrolled subjects, systematically excluding the
sickest patients.
WOVEN solves this by learning projection matrices W from anchor subjects (fully-observed complete cases), then projecting block-missing subjects using their available views. No feature-level imputation is performed.
woven() takes a named list of matrices — one per omics
platform — where:
NAWOVEN expects pre-processed, approximately continuous
data. Apply standard platform-specific normalization before
passing data to woven():
| Platform | Recommended pre-processing |
|---|---|
| RNA-seq (bulk) | log1p(CPM) then scale() per gene |
| DNA methylation | M-values (log2(beta/(1-beta))), then
scale() |
| Proteomics (LC-MS) | log2(intensity), median-center per sample |
| Metabolomics | log2 transform, then scale() |
| Microbiome (16S) | CLR transform (e.g. compositions::clr()) |
Raw counts, raw beta values, or raw intensities will produce poorly
scaled projection matrices. When in doubt, scale() each
modality matrix so all features have mean 0 and SD 1.
library(woven)
set.seed(42)
n <- 150 # total subjects
K <- 3 # latent dimensions to learn
n_groups <- 3
# True group labels (e.g. CN, MCI, Dementia)
Y <- factor(rep(c("CN", "MCI", "Dementia"), each = n / n_groups),
levels = c("CN", "MCI", "Dementia"))
# Simulate three pre-processed modality matrices with group signal
make_modality <- function(p, Y, signal = 3) {
X <- matrix(rnorm(length(Y) * p), length(Y), p)
for (g in levels(Y))
X[Y == g, seq_len(8)] <- X[Y == g, seq_len(8)] + signal
colnames(X) <- paste0("Feature_", seq_len(p))
X
}
X_rna <- make_modality(300, Y, signal = 3)
X_meth <- make_modality(120, Y, signal = 4)
X_prot <- make_modality(40, Y, signal = 5)set.seed(7)
V <- 3
miss_mask <- matrix(runif(n * V) < 0.35, n, V)
# Every subject must retain at least one modality
for (i in which(rowSums(miss_mask) == V))
miss_mask[i, sample(V, 1)] <- FALSE
# Apply missingness: entire row -> NA for missing modalities
X_rna[ miss_mask[, 1], ] <- NA
X_meth[miss_mask[, 2], ] <- NA
X_prot[miss_mask[, 3], ] <- NA
n_anchors <- sum(rowSums(miss_mask) == 0)
cat(sprintf(
"Anchors (all 3 views): %d / %d (%.0f%%)\n",
n_anchors, n, 100 * n_anchors / n
))
#> Anchors (all 3 views): 47 / 150 (31%)Pass a named list so modality labels appear in all
plots automatically. anchor_idx is optional – WOVEN detects
anchors from the NA pattern.
fit <- woven(
X_list = list(RNA = X_rna, Methylation = X_meth, Proteomics = X_prot),
Y = Y,
K = K,
lambdas = 0.01, # graph Laplacian regularization (robust to this)
gamma_y = 5.0, # label supervision strength (any value > 0 works)
k_nn = 10L # k-NN graph for Laplacian
)
fit # prints summary + guided next steps
#> WOVEN fit
#> Modalities : 3 Subjects: 150 Dimensions: 3
#> Anchors : 47 (31%) Scored: 150 (100%)
#> Solver : mcca_dual (closed-form, globally optimal)
#> gamma_y : 5.00 lambda: 0.01/0.01/0.01 k_nn: 10
#> Singular values: 5.719, 4.846, 1.999
#> Classes : CN, MCI, Dementia
#> Modalities : RNA, Methylation, Proteomics
#>
#> -- Next steps --
#> plot(fit, labels = Y) # latent space scatter
#> woven_plot_vip(fit, modality = "RNA") # top features by VIP
#> woven_plot_loadings(fit, dim = 1) # loadings per modality
#> woven_plot_variance(fit) # variance per dimension
#> woven_metrics(fit, Y) # silhouette, NMI, ESS
#> woven_predict(fit, X_list_new) # predict on new dataplot() shows all scored subjects colored by group. Solid
points are anchors; faded points are block-missing subjects projected
from their available views. Ellipses show the 68% confidence region per
group (anchors only).
WOVEN latent space. All subjects are scored regardless of missingness.
Variable Importance in Projection (VIP) scores rank features by their contribution to the shared latent space across all K dimensions. VIP > 1 indicates above-average importance.
Top RNA features by VIP score.
Plot VIP for a different modality:
Top methylation features by VIP score.
Loadings show which features push subjects in the positive
vs. negative direction along a given latent dimension. Equivalent to
plotLoadings() in DIABLO.
Top feature loadings for latent dimension 1 across all three modalities.
Use this to verify that your chosen K captures most of the shared signal. If the last dimension still explains substantial variance, increase K.
Proportion of shared variance explained per latent dimension.
woven_metrics() computes silhouette score,
Davies-Bouldin index, NMI, and effective sample size (ESS) retention
directly from the fit object.
woven_metrics(fit, Y)
#> Silhouette Davies-Bouldin NMI ESS
#> 0.01947443 4.17068379 0.06068035 1.00000000ESS = 1.00 means every subject with at least one observed modality was scored. Compare this to DIABLO, which can only score subjects present in all modalities.
# Simulate 20 new subjects with partial missingness
set.seed(99)
X_new <- list(
RNA = make_modality(300, Y[1:20], signal = 3),
Methylation = make_modality(120, Y[1:20], signal = 4),
Proteomics = make_modality(40, Y[1:20], signal = 5)
)
# Some new subjects are missing a modality
X_new$RNA[1:5, ] <- NA
X_new$Proteomics[6:10, ] <- NA
pred <- woven_predict(fit, X_new)
head(pred[, 1:3])
#> predicted_class confidence p_CN
#> 1 MCI 0.6216826 0.1592255
#> 2 MCI 0.6459433 0.1093645
#> 3 CN 0.7694142 0.7694142
#> 4 Dementia 0.5644197 0.2521266
#> 5 Dementia 0.6529592 0.2326491
#> 6 Dementia 0.5116187 0.1828805predicted_class returns original label names (“CN”,
“MCI”, “Dementia”), not integer codes.
The key advantage of WOVEN over mixOmics::DIABLO is
effective sample size (ESS). DIABLO requires every
subject to have every modality; WOVEN scores all subjects with at least
one observed view.
n_anchors <- length(fit$anchor_idx) # what DIABLO would use
n_woven <- sum(!is.na(fit$Z[, 1])) # what WOVEN scores
cat(sprintf("DIABLO ESS: %d / %d (%.0f%%)\n",
n_anchors, n, round(100 * n_anchors / n)))
#> DIABLO ESS: 47 / 150 (31%)
cat(sprintf("WOVEN ESS: %d / %d (%.0f%%)\n",
n_woven, n, round(100 * n_woven / n)))
#> WOVEN ESS: 150 / 150 (100%)In real cohorts, the subjects DIABLO discards are often the sickest patients (those who miss a blood draw or imaging scan due to disease severity), introducing systematic bias in comparative effectiveness estimates.
The fit object exposes all internal quantities for custom downstream analyses:
dim(fit$Z) # n x K consensus latent scores for ALL subjects
#> [1] 150 3
names(fit$W_list) # one projection matrix per modality
#> [1] "RNA" "Methylation" "Proteomics"
dim(fit$W_list$RNA) # p_rna x K
#> [1] 300 3
# Row names on Z come from row names of your input matrices
# (set rownames on X_list matrices to get named rows in fit$Z)fit$Z can be passed directly to survival analysis,
clustering, or any downstream model as a low-dimensional representation
of the full cohort.
sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 26.04 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.32.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] woven_0.99.0 rmarkdown_2.31
#>
#> loaded via a namespace (and not attached):
#> [1] vctrs_0.7.3 cli_3.6.6 knitr_1.51
#> [4] rlang_1.2.0 xfun_0.59 otel_0.2.0
#> [7] S7_0.2.2 jsonlite_2.0.0 labeling_0.4.3
#> [10] glue_1.8.1 buildtools_1.0.0 htmltools_0.5.9
#> [13] maketools_1.3.2 sys_3.4.3 sass_0.4.10
#> [16] MatrixGenerics_1.25.0 scales_1.4.0 grid_4.6.0
#> [19] evaluate_1.0.5 jquerylib_0.1.4 fastmap_1.2.0
#> [22] yaml_2.3.12 lifecycle_1.0.5 cluster_2.1.8.2
#> [25] compiler_4.6.0 RColorBrewer_1.1-3 farver_2.1.2
#> [28] lattice_0.22-9 digest_0.6.39 R6_2.6.1
#> [31] RANN_2.6.2 parallel_4.6.0 bslib_0.11.0
#> [34] Matrix_1.7-5 withr_3.0.3 gtable_0.3.6
#> [37] tools_4.6.0 matrixStats_1.5.0 ggplot2_4.0.3
#> [40] cachem_1.1.0