How to run MOTL: basic example

Introduction

This is an example of how to use MOTL with the basics commands. Data used in this example are subsets of the original data.

The complete learning dataset can be found via this link: zenodo link.

Toy datasets in MOTL

⚠️ WARNING

Don’t use the data inside the MOTL package to perform analysis.

Datasets used in this example are stored in two objects:

  • Lrn: learning dataset (used for transfer learning)
  • Trg: target dataset (data to analyse)

For more details, see ?MOTL::Lrn and ?MOTL::Trg.

Main steps

Main steps to perform transfer learning analysis using MOTL:

  1. Initialize the learning dataset Lrn
  2. Prepare the target dataset Trg
  3. Define parameters for transfer learning TL_param
  4. Run transferLearning_function()

Load libraries

Libraries used for input data preparation and transfer learning:

  • MOTL (Hirst et al. (2025)) for multiomics data integration using transfer learning
  • MOFA2 (Argelaguet et al. (2018)) for multiomics data integration
library("MOTL")
library("MOFA2")

Learning dataset Lrn

Load the learning dataset and the corresponding factorization model from the MOTL package.

data("Lrn", package = "MOTL")

For the learning dataset, you need two data:

  • the metadata that contains information about the learning dataset construction and its composition
  • the learning dataset factorization model that was created using MOFA2

Learning dataset metadata

The learning dataset metadata are stored in Lrn$Lrn_meta.

expdat_meta_Lrn <- Lrn$Lrn_meta

The expdat_meta_Lrn object contains information about the learning dataset construction.

names(expdat_meta_Lrn)
#>  [1] "if_vst"            "smpls"             "ftrs_mRNA"        
#>  [4] "ftrs_miRNA"        "ftrs_DNAme"        "ftrs_SNV"         
#>  [7] "PCVarPrcnt_mRNA"   "PCVarPrcnt_miRNA"  "PCVarPrcnt_DNAme" 
#> [10] "PCVarPrcnt_SNV"    "ElbowK_Total"      "ElbowK_mRNA"      
#> [13] "ElbowK_miRNA"      "ElbowK_DNAme"      "ElbowK_SNV"       
#> [16] "GeoMeans_mRNA"     "GeoMeans_miRNA"    "Seed"             
#> [19] "script_start_time" "script_end_time"

You can retrieve for example mRNA feature names.

expdat_meta_Lrn$ftrs_mRNA[c(1:5)]
#> [1] "ENSG00000232216.1"  "ENSG00000170561.13" "ENSG00000155011.9" 
#> [4] "ENSG00000128714.6"  "ENSG00000009950.16"

You can also retrieve the SNV feature names.

expdat_meta_Lrn$ftrs_SNV[c(1:5)]
#> [1] "NBEA"   "TRPS1"  "WDR72"  "HEPHL1" "CELSR3"

Or, you can retrieve the sample names.

expdat_meta_Lrn$smpls[c(1:3)]
#> [1] "TCGA-OR-A5LR-01A" "TCGA-AR-A1AJ-01A" "TCGA-75-7027-01A"

Learning dataset factorization model

📝 NOTE

To load the model file, you can use load_model() function from MOFA2

And to load .rds file, you can use the readRDS() function.

expdat_meta_Lrn <- readRDS(file.path(LrnDir, "expdat_meta.rds")) InputModel <- file.path(LrnFctrnDir, "Model.hdf5") Fctrzn <- load_model(file = InputModel)

The learning dataset factorization model is stored in Lrn$Fctrzn.

Fctrzn <- Lrn$Fctrzn

The Fctrzn was created using MOFA2 package and is composed of:

  • 4 views, used for the factorization: mRNA, miRNA, DNAme and SNV,
  • mRNA, DNAme and SNV datasets with 1000 features each and miRNA with 250 features,
  • only one group, define for the factorization,
  • datasets with 250 samples (shared between each view),
  • 20 factors, found during the factorization of the learning dataset.
Fctrzn
#> Trained MOFA with the following characteristics: 
#>  Number of views: 4 
#>  Views names: mRNA miRNA DNAme SNV 
#>  Number of features (per view): 1000 250 1000 1000 
#>  Number of groups: 1 
#>  Groups names: group0 
#>  Number of samples (per group): 250 
#>  Number of factors: 20

See ?MOFA2::MOFA for more details about the MOFA object.

Learning dataset initialization

You need to retrieve some information from the learning dataset factorization model:

  • viewsLrn: views of the learning dataset
  • likelihoodsLrn: defined likelihoods of each view
  • MLrn: dimension (number of views) of the learning dataset
viewsLrn <- get_default_data_options(Fctrzn)$views
likelihoodsLrn <- get_default_model_options(Fctrzn)$likelihoods
MLrn <- get_dimensions(Fctrzn)$M
viewsLrn
#> [1] "mRNA"  "miRNA" "DNAme" "SNV"
likelihoodsLrn
#>        mRNA       miRNA       DNAme         SNV 
#>  "gaussian"  "gaussian"  "gaussian" "bernoulli"
MLrn
#> [1] 4

Then, you need to specify the CenterTrg parameter. If it set to TRUE, it allows the user to center the target dataset during processing. If it set to FALSE, it leaves it uncentered and use the estimated learning dataset intercepts (for normalization).

Here, we will use the estimated learning dataset intercepts.

CenterTrg <- FALSE

Then, the factorization expectations values need to be initialized.

Fctrzn@expectations[["Tau"]] <- Tau_init(viewsLrn, Fctrzn, InputModel)
Fctrzn@expectations[["TauLn"]] <- sapply(viewsLrn, TauLn_calculation, likelihoodsLrn, Fctrzn, LrnFctrnDir)
Fctrzn@expectations[["WSq"]] <- sapply(viewsLrn, WSq_calculation, Fctrzn, LrnFctrnDir)
Fctrzn@expectations[["W0"]] <- sapply(viewsLrn, W0_calculation, CenterTrg, Fctrzn, LrnFctrnDir)

Initialized data are stored in the Lrn$Fctrzn_init object. The following line replaces the previous 4 lines.

Fctrzn <- Lrn$Fctrzn_init

Target dataset Trg

📝 NOTE

Target dataset is a list of matrices. You can create it like this:

YTrg_list <- list(mRNA = expdat_mRNA, miRNA = expdat_miRNA, DNAme = expdat_DNAme, SNV = expdat_SNV)

List of matrices Target dataset is a list of named matrices. Each matrix corresponds to a view (i.e. one omic data).

Features in rows Features should be in rows. They will be different between views. But, feature names should be consistent with the learning dataset. The features order is not important.

Samples in columns Samples should be in columns. Columns need to be the same between views. They will be automatically ordered.

For instance, in this analysis the learning dataset was creating using the TCGA cancer data. So:

  • mRNA matrix view will contain raw counts with genes in rows and samples in columns. Genes IDs are Ensembl IDs without version (e.g. ENSG00000000005). You should add the gene versions that are in the learning dataset. To do that, you have the mRNA_addVersion() function.
  • miRNA matrix view will contain raw counts with miRNA in rows and samples in columns. miRNA IDs are miRBase IDs (e.g. hsa-mir-1-1).
  • DNAme matrix view will contain DNA methylation M-values with cpg probes in rows and samples in columns. Cpg probes IDs are coming from 450k or epic illumina (e.g. cg09364122).
  • SNV matrix view will contain SNV mutation absence or presence (binary matrix) with genes in rows and samples in columns. Genes IDs are HGCN symbols (e.g. AKAP13).

In this example, you have just to load the target dataset and the corresponding metadata from the MOTL package.

data("Trg", package = "MOTL")

Target dataset are stored in the Trg$YTrg_prep object.

YTrg_list <- Trg$YTrg_prep

Transfer learning inputs

Extract sample names from the target dataset. Then, extract view names shared between the target and the learning datasets and the corresponding likelihoods.

smpls <- colnames(YTrg_list[[1]])
viewsTrg <- names(YTrg_list)
views <- viewsLrn[is.element(viewsLrn, viewsTrg)]
likelihoods <- likelihoodsLrn[views]

To prepare the target dataset, use the TCGATargetDataPreparation():

  • normalize and/or transform
  • harmonize with Lrn (have same features and same order)
  • order sample names between views

Here, we will no transform (transformation = FALSE) neither normalize dataset (normalization = FALSE), data are already prepared.

YTrg_prep <- TargetDataPreparation(views = views, YTrg_list = YTrg_list,
                                       Fctrzn = Fctrzn,
                                       smpls = smpls,
                                       normalization = FALSE,
                                       expdat_meta_Lrn = expdat_meta_Lrn,
                                       transformation = FALSE)

Prepare inputs of the transfer learning:

  • YTrg: list of the target dataset matrices (prepared with TargetDataPreparation())
  • views: vector of target dataset view names
  • Fctrzn: the learning dataset factorization model
  • likelihoods: list of view likelihoods
TL_param <- initTransferLearningParamaters(YTrg = YTrg_prep, 
                                           views = views, 
                                           Fctrzn = Fctrzn, 
                                           likelihoods = likelihoods)
names(TL_param)
#> [1] "YTrg"           "Fctrzn_Lrn_W0"  "Fctrzn_Lrn_W"   "Fctrzn_Lrn_WSq"
#> [5] "Tau"            "TauLn"

Transfer learning using MOTL

Set the parameter of MOTL:

  • minFactors: floor when dropping factors - number of samples in evaluations
  • StartDropFactor: after which iteration to start dropping factors
  • FreqDropFactor: how often to drop factors
  • StartELBO: which iteration to start checking ELBO on, exclude initiation iteration
  • FreqELBO: how often to assess the ELBO
  • DropFactorTH: factor with lowest max variance, that is less than this, is dropped
  • MaxIterations: maximum iteration number
  • MinIterations: minimum iteration number - at least 2 and exclude initial setup (2 is default in MOFA)
  • ConvergenceIts: number of consecutive iterations that change in ELBO is (2 is default in MOFA)
  • ConvergenceTH: threshold number for change in ELBO for checking convergence (0.0005 is default in MOFA, correspond to the fast option)
  • PoisRateCstnt: amount to add to the poison rate function to avoid errors(1e-04 default)
ss_start_time <- Sys.time()
minFactors <- 13 
StartDropFactor <- 1
FreqDropFactor <- 1 
StartELBO <- 1 
FreqELBO <- 5 
DropFactorTH <- 0.01 
MaxIterations <- 1000
MinIterations <- 2 
ConvergenceIts <- 2 
ConvergenceTH <- 0.0005 
PoisRateCstnt <- 0.0001 
TL_data <- transferLearning_function(TL_param = TL_param, 
                                     views = views,
                                     likelihoods = likelihoods,
                                     Fctrzn = Fctrzn,
                                     CenterTrg = CenterTrg,
                                     MaxIterations = MaxIterations, 
                                     MinIterations = MinIterations,
                                     minFactors = minFactors, 
                                     StartDropFactor = StartDropFactor, 
                                     FreqDropFactor = FreqDropFactor, 
                                     StartELBO = StartELBO, 
                                     FreqELBO = FreqELBO, 
                                     DropFactorTH = DropFactorTH,
                                     ConvergenceIts = ConvergenceIts, 
                                     ConvergenceTH = ConvergenceTH,
                                     ss_start_time = ss_start_time)
#> [1] TRUE

Results are saved into .rds in the outputDir.

Then have access to the results:

  • ZMu corresponds to the inferred Z matrix that contains samples in rows and factors in columns
  • Fctrzn_Lrn_W$mRNA corresponds to the weight matrix of mRNA, features are in rows and factors in columns.
ZMu <- TL_data$ZMu
W_mRNA <- TL_data$Fctrzn_Lrn_W$mRNA
dim(W_mRNA)
#> [1] 68 19
W_mRNA[c(1:5), c(1:3)]
#>                        Factor1     Factor2    Factor3
#> ENSG00000179477.11  0.21934002  0.14642247 0.04294529
#> ENSG00000129451.12  1.11481450  0.11383030 0.27891928
#> ENSG00000119938.9  -0.28775595 -0.02471034 0.03258801
#> ENSG00000126752.8  -0.03388396 -0.41945300 0.02424950
#> ENSG00000130700.7   0.02040603 -0.01100176 0.50016305

The results shown in this example may differ from yours due to the use of random number generation. So, two runs of MOTL will produce different results. To obtain a reproducible analysis, you can configure random number generation using set.seed(NumberYouChose) and run it before MOTL.

Session info

Session Info
sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] MOFA2_1.23.0     MOTL_0.99.1      BiocStyle_2.41.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1            dplyr_1.2.1                
#>  [3] farver_2.1.2                filelock_1.0.3             
#>  [5] S7_0.2.2                    fastmap_1.2.0              
#>  [7] digest_0.6.39               lifecycle_1.0.5            
#>  [9] magrittr_2.0.5              compiler_4.6.0             
#> [11] rlang_1.2.0                 sass_0.4.10                
#> [13] tools_4.6.0                 yaml_2.3.12                
#> [15] corrplot_0.95               knitr_1.51                 
#> [17] S4Arrays_1.13.0             reticulate_1.46.0          
#> [19] DelayedArray_0.39.3         plyr_1.8.9                 
#> [21] RColorBrewer_1.1-3          abind_1.4-8                
#> [23] BiocParallel_1.47.0         HDF5Array_1.41.0           
#> [25] Rtsne_0.17                  purrr_1.2.2                
#> [27] BiocGenerics_0.59.7         sys_3.4.3                  
#> [29] grid_4.6.0                  stats4_4.6.0               
#> [31] Rhdf5lib_2.1.0              ggplot2_4.0.3              
#> [33] scales_1.4.0                SummarizedExperiment_1.43.0
#> [35] cli_3.6.6                   rmarkdown_2.31             
#> [37] generics_0.1.4              otel_0.2.0                 
#> [39] reshape2_1.4.5              cachem_1.1.0               
#> [41] rhdf5_2.57.1                stringr_1.6.0              
#> [43] parallel_4.6.0              BiocManager_1.30.27        
#> [45] XVector_0.53.0              matrixStats_1.5.0          
#> [47] basilisk_1.25.0             vctrs_0.7.3                
#> [49] Matrix_1.7-5                jsonlite_2.0.0             
#> [51] dir.expiry_1.21.0           IRanges_2.47.2             
#> [53] S4Vectors_0.51.3            ggrepel_0.9.8              
#> [55] maketools_1.3.2             h5mread_1.5.0              
#> [57] locfit_1.5-9.12             jquerylib_0.1.4            
#> [59] tidyr_1.3.2                 glue_1.8.1                 
#> [61] codetools_0.2-20            uwot_0.2.4                 
#> [63] cowplot_1.2.0               stringi_1.8.7              
#> [65] gtable_0.3.6                GenomicRanges_1.65.0       
#> [67] tibble_3.3.1                pillar_1.11.1              
#> [69] htmltools_0.5.9             Seqinfo_1.3.0              
#> [71] rhdf5filters_1.25.0         R6_2.6.1                   
#> [73] evaluate_1.0.5              lattice_0.22-9             
#> [75] Biobase_2.73.1              png_0.1-9                  
#> [77] pheatmap_1.0.13             bslib_0.11.0               
#> [79] Rcpp_1.1.1-1.1              SparseArray_1.13.2         
#> [81] DESeq2_1.53.0               xfun_0.59                  
#> [83] MatrixGenerics_1.25.0       forcats_1.0.1              
#> [85] buildtools_1.0.0            pkgconfig_2.0.3

References

Argelaguet, R., B. Velten, D. Arnol, et al. 2018. “Multi-Omics Factor Analysis-a Framework for Unsupervised Integration of Multi-Omics Data Sets.” Mol Syst Biol 14 (6): e8124.
Hirst, David P, Morgane Térézol, Laura Cantini, Paul Villoutreix, Matthieu Vignes, and Anaı̈s Baudot. 2025. “MOTL: Enhancing Multi-Omics Matrix Factorization with Transfer Learning.” Genome Biology 26 (1): 224.