--- title: "IntegratedLearner" author: "Nalin Arora, Anupreet Porwal and Himel Mallick" date: "2026-03-10" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{IntegratedLearner} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE, warning=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) java_major_version <- function() { out <- tryCatch( suppressWarnings(system2("java", "-version", stdout = TRUE, stderr = TRUE)), error = function(e) character() ) line1 <- out[1] if (length(line1) == 0 || is.na(line1)) return(NA_integer_) m <- regmatches(line1, regexec('"([0-9]+)(\\.[0-9]+)?', line1))[[1]] if (length(m) < 2) return(NA_integer_) as.integer(m[2]) } has_java_vector_module <- function() { out <- tryCatch( suppressWarnings(system2("java", "--list-modules", stdout = TRUE, stderr = TRUE)), error = function(e) character() ) any(grepl("^jdk.incubator.vector@", out)) } # Set Java flags before any potential bartMachine/rJava namespace load. java_params <- "-Xmx6G" if (has_java_vector_module()) { java_params <- c(java_params, "--add-modules=jdk.incubator.vector") } options(java.parameters = java_params) can_run_sl_bart <- function() { have_bart <- requireNamespace("bartMachine", quietly = TRUE) jv <- java_major_version() have_bart && !is.na(jv) && jv >= 21 && has_java_vector_module() } use_sl_bart <- isTRUE(can_run_sl_bart()) load_il_dataset <- function(dataset_name, envir = parent.frame()) { # Prefer packaged datasets when available. try(utils::data(list = dataset_name, package = "IntegratedLearner", envir = envir), silent = TRUE) if (exists(dataset_name, envir = envir, inherits = FALSE)) { return(invisible(TRUE)) } # Fallback for local/source rendering (for example, knitting from vignettes/). candidates <- c( file.path("data", paste0(dataset_name, ".RData")), file.path("..", "data", paste0(dataset_name, ".RData")) ) for (path in unique(candidates)) { if (file.exists(path)) { load(path, envir = envir) if (exists(dataset_name, envir = envir, inherits = FALSE)) { return(invisible(TRUE)) } } } stop( "Could not load dataset '", dataset_name, "'. ", "Install the latest IntegratedLearner package or ensure source data file exists." ) } ``` This vignette is a practical tutorial for binary, multiclass, continuous, and survival outcome workflows in `IntegratedLearner`. The goal is to show a complete end-to-end pattern you can adapt to your own multi-omics study: 1. Build correctly formatted training/validation inputs. 2. Fit per-layer, stacked, and concatenated learners. 3. Interpret model outputs (AUC, accuracy, balanced accuracy, R2, survival discrimination curves, layer weights, and feature signals). `IntegratedLearner` supports two integration paradigms: - Early fusion: concatenated features across layers. - Late fusion: layer-specific models combined by a meta-learner. Optional feature selection workflow used in this vignette: 1. Filtering first (`filter_method`, `filter_pct`) on training features. 2. Screening second (`run_screening = TRUE`, `screen_pct`) in a fold-safe manner: - selected on fold-training only, - applied to fold-validation, - repeated for each fold, - repeated once on full training data for final model fit. ## Load Packages ```{r chunk-01, warning=FALSE, message=FALSE} # Main package library(IntegratedLearner) # Tutorial dependencies library(dplyr) library(ggplot2) library(SuperLearner) library(caret) library(cowplot) library(bayesplot) library(S4Vectors) library(SummarizedExperiment) library(MultiAssayExperiment) library(survival) if (use_sl_bart) { library(bartMachine) } ``` ```{r chunk-02, include=FALSE} if (!use_sl_bart) { message( "SL.BART sections are in fallback mode: Java >= 21 with bartMachine is required for SL.BART." ) } ``` ## Input Data Contract For the `PCL_*` interface used in this tutorial, each dataset is a list with: - `feature_table`: data frame with features in rows and samples in columns. - `sample_metadata`: data frame with samples in rows. Must include: - one subject identifier column (default name: `subjectID`). - one outcome column (default name: `Y`). - `feature_metadata`: data frame with features in rows. Must include: - `featureID`: unique feature identifier. - `featureType`: layer label (for example, species, metabolites). Required alignments: - `rownames(feature_table) == rownames(feature_metadata)` - `colnames(feature_table) == rownames(sample_metadata)` If you provide a validation set, it must use the same feature set and ordering as training. For survival workflows, include `time` and `event` columns in `sample_metadata` (with `event` coded as 0/1). You can also provide `Y` as a `Surv(time, event)` object. You can keep your own column names and map them in the wrapper: ```r fit <- IntegratedLearner( PCL_train = pcl_train, outcome_col = "disease_status", subject_id_col = "participant_id", family = stats::binomial() ) ``` Automatic coercion in the wrapper: - `family = gaussian()`: outcome is coerced to numeric (errors if conversion fails). - binary `family = binomial()`: two classes are mapped internally to `{0,1}`. - multiclass `family = binomial()`: class labels are retained. ## Alternative Input Mode: MAE (Complete Binary Example) `IntegratedLearner` accepts `MultiAssayExperiment` inputs through `MAE_train`/`MAE_valid`. This is often the cleanest path when each omics layer is already represented as a `SummarizedExperiment`/`TreeSummarizedExperiment`. ```{r chunk-03, eval=TRUE,warning = FALSE, eval=FALSE} library(curatedMetagenomicData) # 1) Download two aligned layers from curatedMetagenomicData asnicar_tax <- curatedMetagenomicData( "DavidLA_2015.relative_abundance", dryrun = FALSE )[[1]] asnicar_path <- curatedMetagenomicData( "DavidLA_2015.pathway_abundance", dryrun = FALSE )[[1]] tax_tse <- as(asnicar_tax, "TreeSummarizedExperiment") path_tse <- as(asnicar_path, "TreeSummarizedExperiment") # 2) Keep common samples in both layers common_samples <- intersect(colnames(tax_tse), colnames(path_tse)) common_samples <- as.character(common_samples) tax_tse <- tax_tse[, common_samples] path_tse <- path_tse[, common_samples] # 3) Build binary outcome and subject IDs inside each experiment Yvec <- ifelse(as.character(colData(tax_tse)$disease) == "healthy", 0L, 1L) SummarizedExperiment::colData(tax_tse)$Y <- Yvec SummarizedExperiment::colData(path_tse)$Y <- Yvec SummarizedExperiment::colData(tax_tse)$subjectID <- common_samples SummarizedExperiment::colData(path_tse)$subjectID <- common_samples # 4) Build top-level MAE colData cd <- S4Vectors::DataFrame( Y = as.integer(Yvec), subjectID = common_samples, row.names = common_samples ) # 5) Build explicit sampleMap smap <- S4Vectors::DataFrame( assay = c(rep("taxonomy", length(common_samples)), rep("pathway", length(common_samples))), primary = c(common_samples, common_samples), colname = c(common_samples, common_samples) ) smap$assay <- as.character(smap$assay) smap$primary <- as.character(smap$primary) smap$colname <- as.character(smap$colname) # 6) Build MAE container mae <- MultiAssayExperiment( experiments = ExperimentList( taxonomy = tax_tse, pathway = path_tse ), colData = cd, sampleMap = smap ) # 7) Stratified train/validation split y <- MultiAssayExperiment::colData(mae)$Y names(y) <- rownames(MultiAssayExperiment::colData(mae)) set.seed(1) i0 <- which(y == 0) i1 <- which(y == 1) train0 <- sample(i0, floor(0.7 * length(i0))) train1 <- sample(i1, floor(0.7 * length(i1))) train_ids <- names(y)[sort(c(train0, train1))] valid_ids <- setdiff(names(y), train_ids) mae_train <- mae[, train_ids] mae_valid <- mae[, valid_ids] # 8) Fit IntegratedLearner in MAE mode fit_mae_bin <- IntegratedLearner( MAE_train = mae_train, MAE_valid = mae_valid, experiment = c("taxonomy", "pathway"), assay.type = c("relative_abundance", "pathway_abundance"), folds = 2, base_learner = "SL.randomForest", meta_learner = "SL.nnls.auc", filter_method = "prevalence", filter_pct = 40, run_screening = TRUE, screen_pct = 30, family = binomial(), verbose = TRUE ) # 10) Results fit_mae_bin$AUC.train fit_mae_bin$AUC.test fit_mae_bin$weights ``` The returned object is the same style as `PCL` mode, so downstream interpretation (`AUC.train`, `R2.train`, `weights`, `plot.learner`) is unchanged. ## Parameter Reference (Conbin, Multiclass, and Survival) ### Common Wrapper Parameters (`IntegratedLearner`) | Parameter | Default | Applies to | Description | |---|---|---|---| | `MAE_train`, `MAE_valid` | `NULL` | all | MAE-mode inputs (training and optional validation). | | `PCL_train`, `PCL_valid` | `NULL` | all | PCL-mode inputs (training and optional validation). | | `experiment` | `NULL` | MAE mode | Selected MAE experiment names/indices; defaults to all experiments. | | `assay.type` | `NULL` | MAE mode | Assay names per selected experiment. | | `outcome_col` | `"Y"` | all | Outcome column name in PCL `sample_metadata` / MAE `colData`. | | `subject_id_col` | `"subjectID"` | all | Subject identifier column name in PCL `sample_metadata` / MAE `colData`. | | `na.rm` | `FALSE` | all | Drop features with missing values after extraction/prep. | | `folds` | `5` | all | Outer CV folds. | | `seed` | `1234` | all | Reproducibility seed. | | `base_learner` | `"SL.BART"` | all | Base learner. Use `SL.*` IDs for continuous/binary, native multiclass IDs (for example `randomforest`, `xgboost`, `mbart`) for multiclass, and explicitly set a supported `surv.*` ID for survival runs. | | `filter_method` | `NULL` | all | Optional feature filtering method: `"prevalence"` or `"variance"`. | | `filter_pct` | `NULL` | all | Optional retention percentage in `(0,100]` for filtering. | | `run_screening` | `FALSE` | all | Enable supervised screening. | | `screen_pct` | `NULL` | all | Retention percentage in `(0,100]` for screening. Required when screening is enabled. | | `prevalence_pct` | `NULL` | all | Deprecated alias for prevalence filtering (`filter_method = "prevalence"`). | | `drop_poor_performing_layers` | `FALSE` | continuous, binary, survival | If `TRUE`, removes layers with poor single-layer performance from early and late fusion only (AUC < 0.5 for binary, R² < 0.5 for continuous, C-index < 0.5 for survival). Single-layer results are still retained. | | `verbose` | `FALSE` | all | Print progress. | | `family` | `gaussian()` | all | Non-survival: `gaussian()`/`binomial()`. Multiclass is auto-detected when `family = binomial()` and outcome has more than two classes. Survival is auto-detected from metadata or family. | | `...` | — | all | Passed to the selected backend (`IL_conbin` or `ILsurv`). | ### Conbin-Specific Parameters (Continuous/Binary Path) | Parameter | Default | Description | |---|---|---| | `base_screener` | `"All"` | Deprecated compatibility parameter. Prefer `run_screening` + `screen_pct`. | | `meta_learner` | `"SL.nnls.auc"` | Stacked meta learner for late fusion. | | `run_stacked` | `TRUE` | Enables late-fusion stacked model. | | `run_concat` | `TRUE` | Enables early-fusion concatenated model. | | `print_learner` | `TRUE` | Prints fit summary. | | `refit.stack` | `FALSE` | Refit stacked learner on full data for final predictions. | ### Multiclass-Specific Parameters (Native Multiclass Path) | Parameter | Default | Description | |---|---|---| | `base_learner` | `"glmnet"` | Native multiclass learner per layer and for concatenated fit. Supported: `glmnet`, `randomforest`, `ranger`, `xgboost`, `mbart`, `multinom`. | | `meta_learner` | `"glmnet"` | Native multiclass learner used for stacked fusion. | | `base_screener` | `"All"` | Deprecated compatibility parameter. Prefer `run_screening` + `screen_pct`. | | `run_stacked` | `TRUE` | Enables late-fusion stacked multiclass model. | | `run_concat` | `TRUE` | Enables early-fusion concatenated multiclass model. | | `folds` | `5` | Subject-level CV folds for OOF multiclass probabilities. | | `run_screening`, `screen_pct` | `FALSE`, `NULL` | Fold-safe multiclass screening (glmnet-based) after optional filtering. | ### Survival-Specific Parameters (via `...`) | Parameter | Default | Description | |---|---|---| | `do_early_fusion` | `TRUE` | Train an early-fusion survival model on all features. | | `weight_method` | `"IBS"` | Late-fusion weighting objective (`"IBS"` or `"COX"`). | | `t_vec`, `t_vec_probs` | `NULL`, quantiles | Time grid used in COX-style weighting summaries. | | `layer_score` | `"sum"` | Aggregation of cumulative hazard increments (`sum`, `mean`, `l2`). | | `weight_lambda` | `0.02` | Regularization strength for COX weighting optimizer. | | `weight_penalty` | `"l2_to_uniform"` or `"entropy"` | Penalty used while learning survival late-fusion weights. | | `weight_cap` | `1.0` | Optional cap on individual layer weights. | | `optim_maxit_cox` | `4000` | Max iterations for COX weighting optimization. | | `optim_maxit_ibs` | `300` | Max iterations for IBS weighting optimization. | | `ibs_shrink_to_uniform` | `0` | Shrink IBS weights toward uniform blend. | ## Supported Models and Fusion Modules ### Supported Models | Path | Supported base models | |---|---| | Continuous/Binary (`IL_conbin`) | Any `SuperLearner`-compatible `SL.*` learner available in your R session. Package wrappers include: `SL.BART`, `SL.LASSO`, `SL.enet`, `SL.glmnet2`, `SL.horseshoe`, `SL.mxBART` (plus standard SuperLearner learners such as `SL.glm`, `SL.randomForest`, etc.). | | Multiclass (`IL_multiclass`) | Native multiclass learner IDs: `glmnet`, `randomforest`, `ranger`, `xgboost`, `mbart`, `multinom`. | | Survival (`ILsurv`) | Built-in survival learner IDs: `surv.coxph`, `surv.glmnet`, `surv.ranger`, `surv.ranger.extratrees`, `surv.ranger.maxstat`, `surv.ranger.C`, `surv.rfsrc`, `surv.coxboost`, `surv.gbm`, `surv.xgboost.cox`, `surv.xgboost.aft`, `surv.mboost`, `surv.bart`. | ### Supported Fusion Outputs | Path | Single-layer | Early fusion | Late fusion | |---|---|---|---| | Continuous/Binary | Yes | `run_concat = TRUE` | `run_stacked = TRUE` with `meta_learner` | | Multiclass | Yes | `run_concat = TRUE` | `run_stacked = TRUE` with native multiclass `meta_learner` | | Survival | Yes (`train_out$single`) | `do_early_fusion = TRUE` | Weighted layer blending (`weight_method = "IBS"` or `"COX"`) | ## Output Reference: What You Get and How to Access It This section summarizes the outputs produced by each integration method and where to find weights/importance values. ### Conbin Outputs (Binary/Continuous) | Method | What it returns | Where to access | |---|---|---| | Single-layer (per omics layer) | Layer-specific predictions and metrics | `fit$yhat.train[, layer_name]`, `fit$yhat.test[, layer_name]` (if validation), `fit$AUC.train` / `fit$AUC.test`, `fit$accuracy.train` / `fit$accuracy.test`, `fit$balanced_accuracy.train` / `fit$balanced_accuracy.test` (binomial), `fit$R2.train` / `fit$R2.test` (gaussian) | | Early fusion (concatenated) | One model on all features concatenated | Enable with `run_concat = TRUE`; outputs in `fit$yhat.train[, "concatenated"]`, `fit$model_fits$model_concat`, `fit$SL_fits$SL_fit_concat` | | Late fusion (stacked) | Meta-model over layer-level predictions | Enable with `run_stacked = TRUE`; outputs in `fit$yhat.train[, "stacked"]`, `fit$model_fits$model_stacked`, `fit$SL_fits$SL_fit_stacked` | | Layer weights (stacked) | Relative contribution of each layer in late fusion | `fit$weights` (available when `meta_learner = "SL.nnls.auc"` and `run_stacked = TRUE`) | | Binary metric table | Per-model AUC, accuracy, and balanced accuracy | `fit$metrics.train` and `fit$metrics.test` (if validation provided) | ### Multiclass Outputs | Method | What it returns | Where to access | |---|---|---| | Single-layer (per omics layer) | Layer-wise multiclass probability and class predictions | `fit$prob.train[[layer_name]]`, `fit$class.train[, layer_name]`, plus validation analogs `fit$prob.test[[layer_name]]`, `fit$class.test[, layer_name]` | | Early fusion (concatenated) | One multiclass model on concatenated features | Enable with `run_concat = TRUE`; outputs in `fit$prob.train$concatenated`, `fit$class.train[, "concatenated"]`, `fit$model_fits$model_concat` | | Late fusion (stacked) | Multiclass meta-model over OOF layer probabilities | Enable with `run_stacked = TRUE`; outputs in `fit$prob.train$stacked`, `fit$class.train[, "stacked"]`, `fit$model_fits$model_stacked` | | Multiclass performance metrics | Accuracy, balanced accuracy, one-vs-rest AUC, and log-loss | `fit$metrics.train` and `fit$metrics.test` (if validation provided) | | Feature-selection metadata | Filtering/screening settings used in fit | `fit$filter_method`, `fit$filter_pct`, `fit$prevalence_pct`, `fit$screening_used`, `fit$screen_method`, `fit$screen_pct` | | Screened feature sets | Features retained by fold-safe screening | `fit$selected_features_by_layer`, `fit$selected_features_concat` | ### Survival Outputs (Single/Early/Late) | Method | Training outputs | Validation outputs | |---|---|---| | Single-layer | `fit$train_out$single$metrics`, `fit$train_out$single$train_risk` | `fit$valid_out$single$valid_cindex`, `fit$valid_out$single$valid_auc`, `fit$valid_out$single$valid_risk` | | Early fusion | `fit$train_out$early$train_cindex`, `fit$train_out$early$train_auc`, `fit$train_out$early$train_risk` | `fit$valid_out$early$valid_cindex`, `fit$valid_out$early$valid_auc`, `fit$valid_out$early$valid_risk` | | Late fusion | `fit$train_out$late$weights`, `fit$train_out$late$train_cindex`, `fit$train_out$late$train_auc`, `fit$train_out$late$train_risk` | `fit$valid_out$late$valid_cindex`, `fit$valid_out$late$valid_auc`, `fit$valid_out$late$valid_risk` | | Survival plotting payload | `fit$surv_plot_data$train` | `fit$surv_plot_data$valid` | ### Importance Outputs (Conbin, Multiclass, and Survival) | Importance type | Where to access | Notes | |---|---|---| | Conbin signed global feature importance | `fit$feature_importance_signed` | Always returned for non-survival fits; named numeric vector sorted by effect magnitude/sign. | | Conbin signed per-layer importance | `fit$feature_importance_signed_by_layer` | List split by `featureType`. | | Multiclass signed global feature importance | `fit$feature_importance_global` | Global score aggregated across multiclass contrasts. | | Multiclass signed importance by class | `fit$feature_importance_signed_by_class` | List with one signed vector per class. | | Multiclass signed importance by layer and class | `fit$feature_importance_signed_by_layer_by_class` | Nested list by layer then class. | | Survival early-fusion combined importance | `fit$train_out$early$combined_importance` | Available when `do_early_fusion = TRUE`. | | Survival late-fusion combined importance | `fit$train_out$late$combined_importance` | Weighted signed importance; names are prefixed like `layer::feature`. | | BART-specific layer importance (optional) | `bartMachine::investigate_var_importance(fit$model_fits$model_layers[[layer]], plot = FALSE)` | Only for BART-based conbin fits (`base_learner = "SL.BART"`). | ### Quick Access Snippets ```{r chunk-04, eval=FALSE} # ---- Conbin: weights + top features ---- fit$weights head(fit$feature_importance_signed, 20) names(fit$feature_importance_signed_by_layer) head(fit$feature_importance_signed_by_layer[[1]], 20) # ---- Multiclass: metrics + class probabilities + importance ---- fit_mc$metrics.train fit_mc$metrics.test head(fit_mc$class.train) head(fit_mc$class.test) head(fit_mc$prob.train$stacked) head(fit_mc$feature_importance_global, 20) head(fit_mc$feature_importance_signed_by_class[[1]], 20) # ---- Survival: late-fusion weights + top combined features ---- fit_surv$train_out$late$weights head(fit_surv$train_out$late$combined_importance, 20) # ---- Survival: inspect all fusion branches ---- fit_surv$train_out$single fit_surv$train_out$early fit_surv$train_out$late ``` ## Example 1: Binary Outcome (IBD Classification) This section uses the PRISM dataset (Franzosa et al., 2019) for classifying IBD status. In these fixtures the binary target is in `sample_metadata$Y` (default `outcome_col` behavior). ### Step 1: Load and Inspect Training and Validation Data ```{r chunk-05} # Training data load_il_dataset("PRISM", envir = environment()) pcl <- PRISM feature_table <- pcl$feature_table sample_metadata <- pcl$sample_metadata feature_metadata <- pcl$feature_metadata rm(pcl) # Quick checks head(feature_table[1:5, 1:5]) head(sample_metadata[1:5, ]) head(feature_metadata[1:5, ]) table(feature_metadata$featureType) table(sample_metadata$Y) all(rownames(feature_table) == rownames(feature_metadata)) all(colnames(feature_table) == rownames(sample_metadata)) # Independent validation data load_il_dataset("NLIBD", envir = environment()) pcl <- NLIBD feature_table_valid <- pcl$feature_table sample_metadata_valid <- pcl$sample_metadata rm(pcl) # Align validation features to training feature set/order (required by IntegratedLearner) if (!identical(rownames(feature_table), rownames(feature_table_valid))) { missing_in_valid <- setdiff(rownames(feature_table), rownames(feature_table_valid)) if (length(missing_in_valid) > 0) { stop("Validation set is missing training features, e.g.: ", paste(head(missing_in_valid, 5), collapse = ", ")) } feature_table_valid <- feature_table_valid[rownames(feature_table), , drop = FALSE] } all(rownames(feature_table) == rownames(feature_table_valid)) all(colnames(feature_table_valid) == rownames(sample_metadata_valid)) ``` ### Step 2: Build PCL Inputs ```{r chunk-06} PCL_train <- list( feature_table = feature_table, sample_metadata = sample_metadata, feature_metadata = feature_metadata ) PCL_valid <- list( feature_table = feature_table_valid, sample_metadata = sample_metadata_valid, feature_metadata = feature_metadata ) ``` ### Step 3: Fit the Model `IntegratedLearner` fits one model per layer (`base_learner`) and then combines layer-level predictions with a meta-learner (`meta_learner`). ```{r chunk-07, warning=FALSE} fit <- IntegratedLearner( PCL_train = PCL_train, PCL_valid = PCL_valid, folds = 2, base_learner = "SL.randomForest", meta_learner = "SL.nnls.auc", filter_method = "prevalence", filter_pct = 40, run_screening = TRUE, screen_pct = 30, verbose = TRUE, family = binomial() ) ``` ### Step 4: Inspect and Interpret Outputs Core outputs for binary tasks include: - `fit$AUC.train` and `fit$AUC.test`: AUC per layer and fusion model. - `fit$accuracy.train` and `fit$accuracy.test`: thresholded accuracy per layer and fusion model. - `fit$balanced_accuracy.train` and `fit$balanced_accuracy.test`: balanced accuracy per layer and fusion model. - `fit$metrics.train` and `fit$metrics.test`: compact metric tables with AUC, accuracy, and balanced accuracy. - `fit$weights`: layer contributions in the stacked model (when `SL.nnls.auc` is used). - `fit$yhat.train` and `fit$yhat.test`: predicted probabilities. ```{r chunk-08} fit$AUC.train fit$AUC.test fit$accuracy.train fit$balanced_accuracy.train fit$metrics.test fit$weights ``` Plot ROC summaries for train and validation sets: ```{r chunk-09, warning=FALSE, fig.height=10, fig.width=7} plot.obj <- IntegratedLearner:::plot.learner(fit) plot.obj$plot ``` In this PRISM setting, you can compare which single layer is strongest and whether stacked fusion outperforms both individual layers and simple concatenation. ## Example 2: Continuous Outcome (Gestational Age) This section uses the pregnancy dataset (Ghaemi et al., 2019), where `Y` is continuous gestational age (default `outcome_col` behavior). ### Step 1: Load and Inspect Data ```{r chunk-10} load_il_dataset("pregnancy", envir = environment()) pcl <- pregnancy feature_table <- pcl$feature_table sample_metadata <- pcl$sample_metadata feature_metadata <- pcl$feature_metadata rm(pcl) head(feature_table[1:5, 1:5]) head(sample_metadata[1:5, ]) head(feature_metadata[1:5, ]) table(feature_metadata$featureType) length(unique(sample_metadata$subjectID)) all(rownames(feature_table) == rownames(feature_metadata)) all(colnames(feature_table) == rownames(sample_metadata)) # Optional speed-up for local experimentation # top_n <- 50 # subsetIDs <- c(1:top_n, (nrow(feature_table) - top_n + 1):nrow(feature_table)) # feature_table <- feature_table[subsetIDs, ] # feature_metadata <- feature_metadata[subsetIDs, ] ``` ### Step 2: Build PCL Input ```{r chunk-11} PCL_train <- list( feature_table = feature_table, sample_metadata = sample_metadata, feature_metadata = feature_metadata ) ``` ### Step 3: Fit Continuous Model For this example, we use BART base learners (`SL.BART`). If you hit: `java.lang.UnsupportedClassVersionError ... class file version 65.0 ... recognizes up to 61.0` your Java runtime is older than the version used by your installed `bartMachine` build (typically Java 17 runtime vs Java 21 bytecode). In that case, either: - upgrade Java runtime to 21 and restart R, or - use a non-Java learner (example fallback shown below). ```{r chunk-12, warning=FALSE, eval=use_sl_bart} fit <- IntegratedLearner( PCL_train = PCL_train, folds = 2, base_learner = "SL.BART", meta_learner = "SL.nnls.auc", filter_method = "variance", filter_pct = 40, run_screening = TRUE, screen_pct = 30, family = gaussian() ) ``` Fallback (non-Java) run: ```{r chunk-13, eval=!use_sl_bart} fit <- IntegratedLearner( PCL_train = PCL_train, folds = 2, base_learner = "SL.randomForest", meta_learner = "SL.nnls.auc", filter_method = "variance", filter_pct = 40, run_screening = TRUE, screen_pct = 30, family = gaussian() ) ``` ### Step 4: Evaluate Predictive Accuracy For continuous outcomes, `IntegratedLearner` reports `R2.train` (and `R2.test` if validation is provided). ```{r chunk-14} fit$R2.train ``` ```{r chunk-21, fig.asp=0.8, fig.width=7} plot.obj <- IntegratedLearner:::plot.learner(fit) plot.obj$plot ``` ### Step 5: Uncertainty and Feature-Level Interpretation (BART) When using `SL.BART`, you can inspect posterior predictive distributions and derive weighted posterior summaries. `r if (!use_sl_bart) "This section is skipped in this session because Java < 21 (or bartMachine is unavailable)." else ""` ```{r chunk-15, eval=use_sl_bart} weights <- fit$weights dataX <- fit$X_train_layers dataY <- fit$Y_train post.samples <- vector("list", length(weights)) names(post.samples) <- names(dataX) for (i in seq_along(post.samples)) { post.samples[[i]] <- bartMachine::bart_machine_get_posterior( fit$model_fits$model_layers[[i]], dataX[[i]] )$y_hat_posterior_samples } weighted.post.samples <- Reduce("+", Map("*", post.samples, weights)) rownames(weighted.post.samples) <- rownames(dataX[[1]]) names(dataY) <- rownames(dataX[[1]]) ``` Visualize 68% and 95% credible intervals for observations: ```{r chunk-22, fig.width=9, fig.height=7, eval=use_sl_bart} ord_names <- names(sort(rowMeans(weighted.post.samples), decreasing = TRUE)) mcmc_intervals(t(weighted.post.samples), prob = 0.68, prob_outer = 0.95) + scale_y_discrete(limits = ord_names) + geom_point(aes(x = dataY[ord_names], y = ord_names), shape = 1, size = 3, color = "black") + coord_flip() + theme_bw() + labs( x = "Gestational age (in months)", y = "Observations" ) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) ``` Layer weights and feature-level inclusion proportions can also be examined for biological interpretation. ```{r chunk-16, eval=use_sl_bart} omicsEye_theme <- function() { angle <- 45 hjust <- 1 ggplot2::theme_bw() + ggplot2::theme( axis.text.x = ggplot2::element_text(size = 8, vjust = 1, hjust = hjust, angle = angle), axis.text.y = ggplot2::element_text(size = 8, hjust = 1), axis.title = ggplot2::element_text(size = 10), plot.title = ggplot2::element_text(size = 10), plot.subtitle = ggplot2::element_text(size = 8), legend.title = ggplot2::element_text(size = 6, face = "bold"), legend.text = ggplot2::element_text(size = 7), axis.line = ggplot2::element_line(colour = "black", linewidth = 0.25), axis.line.x = ggplot2::element_line(colour = "black", linewidth = 0.25), axis.line.y = ggplot2::element_line(colour = "black", linewidth = 0.25), panel.border = ggplot2::element_blank(), panel.grid.major = ggplot2::element_blank(), panel.grid.minor = ggplot2::element_blank() ) } safe_var_importance <- function(model, layer_label) { tryCatch({ qq <- bartMachine::investigate_var_importance(model, plot = FALSE) df <- cbind.data.frame(qq$avg_var_props, qq$sd_var_props) colnames(df) <- c("mean", "sd") df$type <- layer_label df }, error = function(e) { warning(sprintf("Skipping variable importance for %s: %s", layer_label, conditionMessage(e))) data.frame(mean = numeric(), sd = numeric(), type = character()) }) } vimp_stack <- cbind.data.frame(fit$weights) colnames(vimp_stack) <- "mean" vimp_stack$sd <- NA vimp_stack$type <- "stack" layer_names <- names(fit$model_fits$model_layers) vimp_layers <- lapply(layer_names, function(layer_nm) { safe_var_importance(fit$model_fits$model_layers[[layer_nm]], layer_nm) }) vimp_layers <- vimp_layers[lengths(vimp_layers) > 0] vimp_top <- do.call( rbind, lapply(vimp_layers, function(df) head(df[order(-df$mean), , drop = FALSE], 20)) ) VIMP <- as.data.frame(rbind.data.frame(vimp_stack, vimp_top)) VIMP <- tibble::rownames_to_column(VIMP, "ID") p4 <- VIMP %>% dplyr::filter(type == "stack") %>% dplyr::arrange(desc(mean)) %>% ggplot(aes(y = mean, x = reorder(ID, -mean))) + geom_bar(stat = "identity", fill = "darkseagreen") + theme_bw() + omicsEye_theme() + ylab("Layer Weights") + xlab("") p5 <- VIMP %>% dplyr::filter(type != "stack") %>% dplyr::arrange(mean) %>% dplyr::mutate(ID = stringr::str_replace_all(ID, stringr::fixed("_"), " ")) %>% ggplot(aes(reorder(ID, -mean), mean, fill = type)) + facet_wrap(. ~ type, scales = "free") + geom_bar(stat = "identity", fill = "lightsalmon") + geom_errorbar(aes(ymin = ifelse(mean - sd > 0, mean - sd, 0), ymax = mean + sd), width = 0.2, position = position_dodge(0.9)) + theme_bw() + coord_flip() + omicsEye_theme() + theme(strip.background = element_blank()) + ylab("Inclusion proportion") + xlab("") ``` ```{r chunk-23, fig.width=7, fig.height=4, eval=use_sl_bart} plot_grid( p4, ncol = 1, labels = c("Estimated IntegratedLearner Layer Weights"), label_size = 8, vjust = 0.1 ) + theme(plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm")) ``` ```{r chunk-24, fig.width=9, fig.height=7, eval=use_sl_bart} plot_grid( p5, ncol = 1, labels = c("Top Features by Layer (BART Inclusion Proportions)"), label_size = 8, vjust = 0.1 ) + theme(plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm")) ``` ## Example 3: Multiclass Outcome (Franzosa MAE with External Validation) This section shows a full multiclass MAE workflow using packaged local fixtures. Here we keep the original outcome column name (`diseaseCat`) and subject ID column (`sample_id`) and pass them through `outcome_col` and `subject_id_col`. ```{r chunk-17} load_il_dataset("FranzosaE_2019_CuratedMetabolome", envir = environment()) load_il_dataset("FranzosaE_2019_CuratedMetadata", envir = environment()) load_il_dataset("FranzosaE_2019_CuratedSpeciesProfile", envir = environment()) load_il_dataset("FranzosaE_2019_Validation_CuratedMetabolome", envir = environment()) load_il_dataset("FranzosaE_2019_Validation_CuratedMetadata", envir = environment()) load_il_dataset("FranzosaE_2019_Validation_CuratedSpeciesProfile", envir = environment()) as_feature_matrix <- function(df, id_col = "X") { ids <- as.character(df[[id_col]]) mat <- as.matrix(df[, setdiff(colnames(df), id_col), drop = FALSE]) storage.mode(mat) <- "numeric" rownames(mat) <- ids t(mat) } prep_sample_metadata <- function(df, id_col = "X") { sm <- as.data.frame(df, stringsAsFactors = FALSE) sm$sample_id <- as.character(sm[[id_col]]) rownames(sm) <- sm$sample_id sm } met_train <- as_feature_matrix(FranzosaE_2019_CuratedMetabolome) met_valid <- as_feature_matrix(FranzosaE_2019_Validation_CuratedMetabolome) species_train <- as_feature_matrix(FranzosaE_2019_CuratedSpeciesProfile) species_valid <- as_feature_matrix(FranzosaE_2019_Validation_CuratedSpeciesProfile) # Enforce exact train/validation feature alignment per layer. met_shared <- intersect(rownames(met_train), rownames(met_valid)) species_shared <- intersect(rownames(species_train), rownames(species_valid)) met_train <- met_train[met_shared, , drop = FALSE] met_valid <- met_valid[met_shared, , drop = FALSE] species_train <- species_train[species_shared, , drop = FALSE] species_valid <- species_valid[species_shared, , drop = FALSE] sm_train <- prep_sample_metadata(FranzosaE_2019_CuratedMetadata) sm_valid <- prep_sample_metadata(FranzosaE_2019_Validation_CuratedMetadata) train_ids <- Reduce(intersect, list(colnames(met_train), colnames(species_train), rownames(sm_train))) valid_ids <- Reduce(intersect, list(colnames(met_valid), colnames(species_valid), rownames(sm_valid))) met_train <- met_train[, train_ids, drop = FALSE] met_valid <- met_valid[, valid_ids, drop = FALSE] species_train <- species_train[, train_ids, drop = FALSE] species_valid <- species_valid[, valid_ids, drop = FALSE] sm_train <- sm_train[train_ids, , drop = FALSE] sm_valid <- sm_valid[valid_ids, , drop = FALSE] class_levels <- sort(unique(as.character(sm_train$diseaseCat))) sm_train$diseaseCat <- factor(sm_train$diseaseCat, levels = class_levels) sm_valid$diseaseCat <- factor(sm_valid$diseaseCat, levels = class_levels) cd_train <- S4Vectors::DataFrame( sample_id = sm_train$sample_id, diseaseCat = sm_train$diseaseCat, row.names = sm_train$sample_id ) cd_valid <- S4Vectors::DataFrame( sample_id = sm_valid$sample_id, diseaseCat = sm_valid$diseaseCat, row.names = sm_valid$sample_id ) MAE_train <- MultiAssayExperiment( experiments = ExperimentList( metabolome = SummarizedExperiment::SummarizedExperiment( assays = list(abundance = met_train), colData = cd_train ), species = SummarizedExperiment::SummarizedExperiment( assays = list(abundance = species_train), colData = cd_train ) ), colData = cd_train ) MAE_valid <- MultiAssayExperiment( experiments = ExperimentList( metabolome = SummarizedExperiment::SummarizedExperiment( assays = list(abundance = met_valid), colData = cd_valid ), species = SummarizedExperiment::SummarizedExperiment( assays = list(abundance = species_valid), colData = cd_valid ) ), colData = cd_valid ) fit <- IntegratedLearner::IntegratedLearner( MAE_train = MAE_train, MAE_valid = MAE_valid, experiment = c("metabolome", "species"), assay.type = c("abundance", "abundance"), outcome_col = "diseaseCat", subject_id_col = "sample_id", family = stats::binomial(), base_learner = "glmnet", meta_learner = "glmnet", run_stacked = TRUE, run_concat = TRUE, filter_method = "variance", filter_pct = 50, run_screening = TRUE, screen_pct = 25, folds = 2, verbose = TRUE ) ``` Useful multiclass outputs: - `fit$metrics.train` - `fit$metrics.test` - `fit$class.train` - `fit$class.test` - `fit$prob.train` - `fit$prob.test` - `fit$feature_importance_signed_by_class` - `fit$filter_method`, `fit$filter_pct` - `fit$screening_used`, `fit$screen_pct` The multiclass metric tables now report accuracy, balanced accuracy, one-vs-rest AUC, and log-loss. The plotting helper also returns a single one-vs-rest ROC figure with all class curves overlaid for each fitted model. ```{r chunk-17b, warning=FALSE, fig.height=9, fig.width=8} plot.obj.mc <- IntegratedLearner:::plot.learner(fit) plot.obj.mc$plot ``` ## Example 4: Survival Outcome (Time-to-event) For survival tasks, `IntegratedLearner` dispatches to `ILsurv` when survival metadata are detected. The expected fields are: - `time`: follow-up time (non-negative). - `event`: event indicator (`0/1`). - optional `Y`: `Surv(time, event)` convenience column. This path uses the package-native survival backend (no `mlr3` dependency required). For plotting, the survival backend now stores: - a time-dependent AUC table evaluated over a denser event-time grid (rather than only a few summary quantiles), and - Kaplan-Meier curve payloads built from predicted risk groups for the best fused survival model. This section provides a complete MAE workflow, followed by an equivalent PCL sketch. ```{r chunk-18} load_il_dataset("gene_all", envir = environment()) load_il_dataset("mir_all", envir = environment()) to_feature_matrix <- function(df, id_col = "patient_id", n_keep = 120L) { drop_cols <- c("patient_id", "OS", "OS.time", "age", "race_white", "stage_i", "stage_ii") d <- as.data.frame(df, stringsAsFactors = FALSE) rownames(d) <- as.character(d[[id_col]]) feature_cols <- setdiff(colnames(d), drop_cols) feature_cols <- feature_cols[seq_len(min(length(feature_cols), n_keep))] mat <- t(as.matrix(d[, feature_cols, drop = FALSE])) storage.mode(mat) <- "numeric" mat } gene_all <- gene_all[order(gene_all$patient_id), , drop = FALSE] mir_all <- mir_all[order(mir_all$patient_id), , drop = FALSE] common_ids <- intersect(as.character(gene_all$patient_id), as.character(mir_all$patient_id)) gene_all <- gene_all[match(common_ids, gene_all$patient_id), , drop = FALSE] mir_all <- mir_all[match(common_ids, mir_all$patient_id), , drop = FALSE] gene_mat <- to_feature_matrix(gene_all, n_keep = 120L) mirna_mat <- to_feature_matrix(mir_all, n_keep = 100L) tcga_metadata <- data.frame( patient_id = as.character(gene_all$patient_id), time = as.numeric(gene_all$OS.time), event = as.numeric(gene_all$OS), stringsAsFactors = FALSE ) rownames(tcga_metadata) <- tcga_metadata$patient_id common_ids <- Reduce(intersect, list(colnames(gene_mat), colnames(mirna_mat), rownames(tcga_metadata))) gene_mat <- gene_mat[, common_ids, drop = FALSE] mirna_mat <- mirna_mat[, common_ids, drop = FALSE] tcga_metadata <- tcga_metadata[common_ids, , drop = FALSE] tcga_metadata$outcome_surv <- I(survival::Surv(tcga_metadata$time, tcga_metadata$event)) set.seed(123) event_ids <- rownames(tcga_metadata)[tcga_metadata$event == 1] censor_ids <- rownames(tcga_metadata)[tcga_metadata$event == 0] train_ids <- c( sample(event_ids, max(1L, floor(0.7 * length(event_ids)))), sample(censor_ids, max(1L, floor(0.7 * length(censor_ids)))) ) train_ids <- sort(unique(train_ids)) valid_ids <- setdiff(rownames(tcga_metadata), train_ids) cd_train <- S4Vectors::DataFrame(tcga_metadata[train_ids, c("patient_id", "time", "event"), drop = FALSE]) cd_train$outcome_surv <- I(survival::Surv(cd_train$time, cd_train$event)) rownames(cd_train) <- cd_train$patient_id cd_valid <- S4Vectors::DataFrame(tcga_metadata[valid_ids, c("patient_id", "time", "event"), drop = FALSE]) cd_valid$outcome_surv <- I(survival::Surv(cd_valid$time, cd_valid$event)) rownames(cd_valid) <- cd_valid$patient_id mae_train <- MultiAssayExperiment( experiments = ExperimentList( gene = SummarizedExperiment::SummarizedExperiment( assays = list(abundance = gene_mat[, train_ids, drop = FALSE]), colData = cd_train ), mirna = SummarizedExperiment::SummarizedExperiment( assays = list(abundance = mirna_mat[, train_ids, drop = FALSE]), colData = cd_train ) ), colData = cd_train ) mae_valid <- MultiAssayExperiment( experiments = ExperimentList( gene = SummarizedExperiment::SummarizedExperiment( assays = list(abundance = gene_mat[, valid_ids, drop = FALSE]), colData = cd_valid ), mirna = SummarizedExperiment::SummarizedExperiment( assays = list(abundance = mirna_mat[, valid_ids, drop = FALSE]), colData = cd_valid ) ), colData = cd_valid ) feature_metadata_surv <- data.frame( featureID = c(rownames(gene_mat), rownames(mirna_mat)), featureType = c(rep("gene", nrow(gene_mat)), rep("mirna", nrow(mirna_mat))), stringsAsFactors = FALSE ) rownames(feature_metadata_surv) <- feature_metadata_surv$featureID PCL_train <- list( feature_table = as.data.frame(rbind( gene_mat[, train_ids, drop = FALSE], mirna_mat[, train_ids, drop = FALSE] )), sample_metadata = as.data.frame(cd_train), feature_metadata = feature_metadata_surv ) PCL_valid <- list( feature_table = as.data.frame(rbind( gene_mat[, valid_ids, drop = FALSE], mirna_mat[, valid_ids, drop = FALSE] )), sample_metadata = as.data.frame(cd_valid), feature_metadata = feature_metadata_surv ) fit_surv_mae <- IntegratedLearner( MAE_train = mae_train, MAE_valid = mae_valid, experiment = c("gene", "mirna"), assay.type = c("abundance", "abundance"), outcome_col = "outcome_surv", subject_id_col = "patient_id", folds = 2, base_learner = "surv.coxph", filter_method = "variance", filter_pct = 40, run_screening = TRUE, screen_pct = 25, weight_method = "COX", # alternative: "IBS" verbose = TRUE ) ``` ### Interpret Survival Outputs ```{r chunk-19} # Single-layer training metrics # fit_surv_mae$train_out$single$metrics # Late fusion (weighted integration) fit_surv_mae$train_out$late$weights fit_surv_mae$train_out$late$train_cindex fit_surv_mae$train_out$late$train_auc # Early fusion summary # fit_surv_mae$train_out$early # Validation metrics fit_surv_mae$valid_out$late$valid_cindex fit_surv_mae$valid_out$late$valid_auc fit_surv_mae$valid_out$single$valid_cindex ``` The `train_auc` and `valid_auc` objects are data frames with `time` and `AUC` columns, so they can be plotted as true time-dependent discrimination curves. ```{r chunk-19b, warning=FALSE, fig.height=9, fig.width=8} plot.obj.surv <- IntegratedLearner:::plot.learner(fit_surv_mae) plot.obj.surv$plot ``` ## Session Information ```{r chunk-20} sessionInfo() ``` ## References Franzosa EA et al. (2019). [Gut microbiome structure and metabolic activity in inflammatory bowel disease]( https://pubmed.ncbi.nlm.nih.gov/30531976/ ). *Nature Microbiology* 4(2):293-305. Ghaemi MS et al. (2019). [Multiomics modeling of the immunome, transcriptome, microbiome, proteome and metabolome adaptations during human pregnancy]( https://pubmed.ncbi.nlm.nih.gov/30561547/ ). *Bioinformatics* 35(1):95-103. ## Citation Mallick et al. (2024). [An integrated Bayesian framework for multi-omics prediction and classification]( https://pubmed.ncbi.nlm.nih.gov/38146838/ ). *Statistics in Medicine* 43(5):983-1002.