--- title: "How to run MOTL: basic example" output: BiocStyle::html_document: toc: true toc_depth: 2 bibliography: references.bib vignette: > %\VignetteIndexEntry{How to run MOTL: basic example} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = TRUE, echo = TRUE ) ``` # Introduction This is an example of **how to use MOTL** with the basics commands. Data used in this example are subsets of the original data. The complete learning dataset can be found via this link: [zenodo link](https://zenodo.org/records/10848217). ## Toy datasets in MOTL > **⚠️ WARNING** > > Don't use the data inside the MOTL package to perform analysis. Datasets used in this example are stored in two objects: - `Lrn`: learning dataset (used for transfer learning) - `Trg`: target dataset (data to analyse) For more details, see `?MOTL::Lrn` and `?MOTL::Trg`. ## Main steps Main steps to perform transfer learning analysis using MOTL: 1. Initialize the learning dataset ``Lrn`` 2. Prepare the target dataset ``Trg`` 3. Define parameters for transfer learning ``TL_param`` 4. Run `transferLearning_function()` ## Load libraries Libraries used for input data preparation and transfer learning: - `MOTL` (@motl) for multiomics data integration using transfer learning - `MOFA2` (@MOFA2) for multiomics data integration ```{r libraries, message = FALSE, warning = FALSE} library("MOTL") library("MOFA2") ``` # Learning dataset ``Lrn`` Load the learning dataset and the corresponding factorization model from the `MOTL` package. ```{r} data("Lrn", package = "MOTL") ``` For the learning dataset, you need two data: - the **metadata** that contains information about the learning dataset construction and its composition - the **learning dataset factorization model** that was created using ``MOFA2`` ## Learning dataset metadata The learning dataset metadata are stored in ``Lrn$Lrn_meta``. ```{r Lrn_loadData} expdat_meta_Lrn <- Lrn$Lrn_meta ``` The `expdat_meta_Lrn` object contains information about the learning dataset construction. ```{r Lrn_metadata-2} names(expdat_meta_Lrn) ``` You can retrieve for example mRNA feature names. ```{r Lrn_metadata-3} expdat_meta_Lrn$ftrs_mRNA[c(1:5)] ``` You can also retrieve the SNV feature names. ```{r Lrn_metadata-4} expdat_meta_Lrn$ftrs_SNV[c(1:5)] ``` Or, you can retrieve the sample names. ```{r Lrn_metadata-5} expdat_meta_Lrn$smpls[c(1:3)] ``` ## Learning dataset factorization model > **📝 NOTE** > > To load the model file, you can use `load_model()` function from `MOFA2` > > And to load .rds file, you can use the `readRDS()` function. > > `expdat_meta_Lrn <- readRDS(file.path(LrnDir, "expdat_meta.rds"))` > `InputModel <- file.path(LrnFctrnDir, "Model.hdf5")` > `Fctrzn <- load_model(file = InputModel)` The learning dataset factorization model is stored in `Lrn$Fctrzn`. ```{r Lrn_model-1} Fctrzn <- Lrn$Fctrzn ``` The `Fctrzn` was created using `MOFA2` package and is composed of: - 4 views, used for the factorization: `mRNA`, `miRNA`, `DNAme` and `SNV`, - `mRNA`, `DNAme` and `SNV` datasets with 1000 features each and `miRNA` with 250 features, - only one group, define for the factorization, - datasets with 250 samples (shared between each view), - 20 factors, found during the factorization of the learning dataset. ```{r Lrn_fctrzn-1, message = FALSE, warning = FALSE} Fctrzn ``` See `?MOFA2::MOFA` for more details about the `MOFA` object. ## Learning dataset initialization You need to retrieve some information from the learning dataset factorization model: - `viewsLrn`: views of the learning dataset - `likelihoodsLrn`: defined likelihoods of each view - `MLrn`: dimension (number of views) of the learning dataset ```{r init} viewsLrn <- get_default_data_options(Fctrzn)$views likelihoodsLrn <- get_default_model_options(Fctrzn)$likelihoods MLrn <- get_dimensions(Fctrzn)$M ``` ```{r init-1} viewsLrn likelihoodsLrn MLrn ``` Then, you need to specify the `CenterTrg` parameter. If it set to `TRUE`, it allows the user to center the target dataset during processing. If it set to `FALSE`, it leaves it uncentered and use the estimated learning dataset intercepts (for normalization). Here, we will use the estimated learning dataset intercepts. ```{r} CenterTrg <- FALSE ``` Then, the factorization expectations values need to be initialized. ```{r init_valuesLrn-1, eval = FALSE} Fctrzn@expectations[["Tau"]] <- Tau_init(viewsLrn, Fctrzn, InputModel) Fctrzn@expectations[["TauLn"]] <- sapply(viewsLrn, TauLn_calculation, likelihoodsLrn, Fctrzn, LrnFctrnDir) Fctrzn@expectations[["WSq"]] <- sapply(viewsLrn, WSq_calculation, Fctrzn, LrnFctrnDir) Fctrzn@expectations[["W0"]] <- sapply(viewsLrn, W0_calculation, CenterTrg, Fctrzn, LrnFctrnDir) ``` Initialized data are stored in the `Lrn$Fctrzn_init` object. The following line replaces the previous 4 lines. ```{r init_valuesLrn} Fctrzn <- Lrn$Fctrzn_init ``` # Target dataset ``Trg`` > **📝 NOTE** > > Target dataset is a list of matrices. You can create it like this: > > `YTrg_list <- list(mRNA = expdat_mRNA, miRNA = expdat_miRNA,` > `DNAme = expdat_DNAme, SNV = expdat_SNV)` **List of matrices** Target dataset is a list of named matrices. Each matrix corresponds to a view (i.e. one omic data). **Features in rows** Features should be in rows. They will be different between views. But, feature names should be consistent with the learning dataset. The features order is not important. **Samples in columns** Samples should be in columns. Columns need to be the same between views. They will be automatically ordered. For instance, in this analysis the learning dataset was creating using the [TCGA](https://portal.gdc.cancer.gov/) cancer data. So: - **mRNA** matrix view will contain raw counts with genes in rows and samples in columns. Genes IDs are **Ensembl IDs** without version (e.g. `ENSG00000000005`). You should add the gene versions that are in the learning dataset. To do that, you have the `mRNA_addVersion()` function. - **miRNA** matrix view will contain raw counts with miRNA in rows and samples in columns. miRNA IDs are **miRBase IDs** (e.g. `hsa-mir-1-1`). - **DNAme** matrix view will contain DNA methylation M-values with cpg probes in rows and samples in columns. Cpg probes IDs are coming from **450k or epic illumina** (e.g. `cg09364122`). - **SNV** matrix view will contain SNV mutation absence or presence (binary matrix) with genes in rows and samples in columns. Genes IDs are **HGCN symbols** (e.g. `AKAP13`). In this example, you have just to load the target dataset and the corresponding metadata from the `MOTL` package. ```{r} data("Trg", package = "MOTL") ``` Target dataset are stored in the `Trg$YTrg_prep` object. ```{r Trg_loadData-1} YTrg_list <- Trg$YTrg_prep ``` # Transfer learning inputs Extract sample names from the target dataset. Then, extract view names shared between the target and the learning datasets and the corresponding likelihoods. ```{r Trg_preprocessing-1} smpls <- colnames(YTrg_list[[1]]) viewsTrg <- names(YTrg_list) views <- viewsLrn[is.element(viewsLrn, viewsTrg)] likelihoods <- likelihoodsLrn[views] ``` To prepare the target dataset, use the `TCGATargetDataPreparation()`: - normalize and/or transform - harmonize with `Lrn` (have same features and same order) - order sample names between views Here, we will no transform (`transformation = FALSE`) neither normalize dataset (`normalization = FALSE`), data are already prepared. ```{r Trg_preprocessing-2, message = FALSE} YTrg_prep <- TargetDataPreparation(views = views, YTrg_list = YTrg_list, Fctrzn = Fctrzn, smpls = smpls, normalization = FALSE, expdat_meta_Lrn = expdat_meta_Lrn, transformation = FALSE) ``` Prepare inputs of the transfer learning: - `YTrg`: list of the target dataset matrices (prepared with `TargetDataPreparation()`) - `views`: vector of target dataset view names - `Fctrzn`: the learning dataset factorization model - `likelihoods`: list of view likelihoods ```{r init-TL-2, message = FALSE} TL_param <- initTransferLearningParamaters(YTrg = YTrg_prep, views = views, Fctrzn = Fctrzn, likelihoods = likelihoods) ``` ```{r init-TL-3} names(TL_param) ``` # Transfer learning using `MOTL` Set the parameter of `MOTL`: - `minFactors`: floor when dropping factors - number of samples in evaluations - `StartDropFactor`: after which iteration to start dropping factors - `FreqDropFactor`: how often to drop factors - `StartELBO`: which iteration to start checking ELBO on, exclude initiation iteration - `FreqELBO`: how often to assess the ELBO - `DropFactorTH`: factor with lowest max variance, that is less than this, is dropped - `MaxIterations`: maximum iteration number - `MinIterations`: minimum iteration number - at least 2 and exclude initial setup (`2` is default in MOFA) - `ConvergenceIts`: number of consecutive iterations that change in ELBO is (`2` is default in MOFA) - `ConvergenceTH`: threshold number for change in ELBO for checking convergence (`0.0005` is default in MOFA, correspond to the *fast* option) - `PoisRateCstnt`: amount to add to the poison rate function to avoid errors(`1e-04` default) ```{r MOTL-input} ss_start_time <- Sys.time() minFactors <- 13 StartDropFactor <- 1 FreqDropFactor <- 1 StartELBO <- 1 FreqELBO <- 5 DropFactorTH <- 0.01 MaxIterations <- 1000 MinIterations <- 2 ConvergenceIts <- 2 ConvergenceTH <- 0.0005 PoisRateCstnt <- 0.0001 ``` ```{r MOTL-results, warning = FALSE, message = FALSE, results = "hide"} TL_data <- transferLearning_function(TL_param = TL_param, views = views, likelihoods = likelihoods, Fctrzn = Fctrzn, CenterTrg = CenterTrg, MaxIterations = MaxIterations, MinIterations = MinIterations, minFactors = minFactors, StartDropFactor = StartDropFactor, FreqDropFactor = FreqDropFactor, StartELBO = StartELBO, FreqELBO = FreqELBO, DropFactorTH = DropFactorTH, ConvergenceIts = ConvergenceIts, ConvergenceTH = ConvergenceTH, ss_start_time = ss_start_time) ``` ```{r, echo = FALSE, message = FALSE, warning = FALSE} file.remove("TL_data.rds") ``` Results are saved into .rds in the `outputDir`. Then have access to the results: - `ZMu` corresponds to the inferred Z matrix that contains samples in rows and factors in columns - `Fctrzn_Lrn_W$mRNA` corresponds to the weight matrix of mRNA, features are in rows and factors in columns. ```{r} ZMu <- TL_data$ZMu W_mRNA <- TL_data$Fctrzn_Lrn_W$mRNA ``` ```{r} dim(W_mRNA) W_mRNA[c(1:5), c(1:3)] ``` The results shown in this example may differ from yours due to the use of random number generation. So, two runs of MOTL will produce different results. To obtain a reproducible analysis, you can configure random number generation using `set.seed(NumberYouChose)` and run it before MOTL. # Session info
**Session Info** ```{r info} sessionInfo() ```
# References