---
title: "How to run MOTL: basic example"
output: 
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{How to run MOTL: basic example}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = TRUE,
  echo = TRUE
)
```

# Introduction 

This is an example of **how to use MOTL** with the basics commands. Data used in this example are subsets of the original data. 

The complete learning dataset can be found via this link: [zenodo link](https://zenodo.org/records/10848217).

## Toy datasets in MOTL

> **⚠️ WARNING**
> 
> Don't use the data inside the MOTL package to perform analysis.

Datasets used in this example are stored in two objects:

- `Lrn`: learning dataset (used for transfer learning)
- `Trg`: target dataset (data to analyse)

For more details, see `?MOTL::Lrn` and `?MOTL::Trg`.

## Main steps

Main steps to perform transfer learning analysis using MOTL:

1. Initialize the learning dataset ``Lrn``
2. Prepare the target dataset ``Trg``
3. Define parameters for transfer learning ``TL_param``
4. Run `transferLearning_function()`

## Load libraries

Libraries used for input data preparation and transfer learning:

- `MOTL` (@motl) for multiomics data integration using transfer learning
- `MOFA2` (@MOFA2) for multiomics data integration

```{r libraries, message = FALSE, warning = FALSE}
library("MOTL")
library("MOFA2")
```

# Learning dataset ``Lrn``

Load the learning dataset and the corresponding factorization model from the `MOTL` package.
```{r}
data("Lrn", package = "MOTL")
```

For the learning dataset, you need two data:

- the **metadata** that contains information about the learning dataset construction and its composition
- the **learning dataset factorization model** that was created using ``MOFA2``

## Learning dataset metadata

The learning dataset metadata are stored in ``Lrn$Lrn_meta``. 
```{r Lrn_loadData}
expdat_meta_Lrn <- Lrn$Lrn_meta
```

The `expdat_meta_Lrn` object contains information about the learning dataset construction.
```{r Lrn_metadata-2}
names(expdat_meta_Lrn)
```

You can retrieve for example mRNA feature names.
```{r Lrn_metadata-3}
expdat_meta_Lrn$ftrs_mRNA[c(1:5)]
```

You can also retrieve the SNV feature names.
```{r Lrn_metadata-4}
expdat_meta_Lrn$ftrs_SNV[c(1:5)]
```

Or, you can retrieve the sample names.
```{r Lrn_metadata-5}
expdat_meta_Lrn$smpls[c(1:3)]
```

## Learning dataset factorization model

> **📝 NOTE**
>
> To load the model file, you can use `load_model()` function from `MOFA2`
> 
> And to load .rds file, you can use the `readRDS()` function.
>
> `expdat_meta_Lrn <- readRDS(file.path(LrnDir, "expdat_meta.rds"))`
> `InputModel <- file.path(LrnFctrnDir, "Model.hdf5")`
> `Fctrzn <- load_model(file = InputModel)`

The learning dataset factorization model is stored in `Lrn$Fctrzn`.
```{r Lrn_model-1}
Fctrzn <- Lrn$Fctrzn
```

The `Fctrzn` was created using `MOFA2` package and is composed of: 

- 4 views, used for the factorization: `mRNA`, `miRNA`, `DNAme` and `SNV`,
- `mRNA`, `DNAme` and `SNV` datasets with 1000 features each and `miRNA` with 250 features,
- only one group, define for the factorization,
- datasets with 250 samples (shared between each view),
- 20 factors, found during the factorization of the learning dataset.

```{r Lrn_fctrzn-1, message = FALSE, warning = FALSE}
Fctrzn
```

See `?MOFA2::MOFA` for more details about the `MOFA` object.

## Learning dataset initialization

You need to retrieve some information from the learning dataset factorization model:

- `viewsLrn`: views of the learning dataset
- `likelihoodsLrn`: defined likelihoods of each view
- `MLrn`: dimension (number of views) of the learning dataset

```{r init}
viewsLrn <- get_default_data_options(Fctrzn)$views
likelihoodsLrn <- get_default_model_options(Fctrzn)$likelihoods
MLrn <- get_dimensions(Fctrzn)$M
```

```{r init-1}
viewsLrn
likelihoodsLrn
MLrn
```

Then, you need to specify the `CenterTrg` parameter. If it set to `TRUE`, it allows the user to center the target dataset during processing. If it set to `FALSE`, it leaves it uncentered and use the estimated learning dataset intercepts (for normalization). 

Here, we will use the estimated learning dataset intercepts. 
```{r}
CenterTrg <- FALSE
```

Then, the factorization expectations values need to be initialized. 
```{r init_valuesLrn-1, eval = FALSE}
Fctrzn@expectations[["Tau"]] <- Tau_init(viewsLrn, Fctrzn, InputModel)
Fctrzn@expectations[["TauLn"]] <- sapply(viewsLrn, TauLn_calculation, likelihoodsLrn, Fctrzn, LrnFctrnDir)
Fctrzn@expectations[["WSq"]] <- sapply(viewsLrn, WSq_calculation, Fctrzn, LrnFctrnDir)
Fctrzn@expectations[["W0"]] <- sapply(viewsLrn, W0_calculation, CenterTrg, Fctrzn, LrnFctrnDir)
```

Initialized data are stored in the `Lrn$Fctrzn_init` object. The following line replaces the previous 4 lines. 
```{r init_valuesLrn}
Fctrzn <- Lrn$Fctrzn_init
```

# Target dataset ``Trg``

> **📝 NOTE**
>
> Target dataset is a list of matrices. You can create it like this:
>
> `YTrg_list <- list(mRNA = expdat_mRNA, miRNA = expdat_miRNA,`
> `DNAme = expdat_DNAme, SNV = expdat_SNV)`

**List of matrices**
Target dataset is a list of named matrices. Each matrix corresponds to a view (i.e. one omic data).

**Features in rows**
Features should be in rows. They will be different between views. But, feature names should be consistent with the learning dataset. The features order is not important.

**Samples in columns**
Samples should be in columns. Columns need to be the same between views. They will be automatically ordered. 

For instance, in this analysis the learning dataset was creating using the [TCGA](https://portal.gdc.cancer.gov/) cancer data. So:

- **mRNA** matrix view will contain raw counts with genes in rows and samples in columns. Genes IDs are **Ensembl IDs** without version (e.g. `ENSG00000000005`). You should add the gene versions that are in the learning dataset. To do that, you have the `mRNA_addVersion()` function. 
- **miRNA** matrix view will contain raw counts with miRNA in rows and samples in columns. miRNA IDs are **miRBase IDs** (e.g. `hsa-mir-1-1`).
- **DNAme** matrix view will contain DNA methylation M-values with cpg probes in rows and samples in columns. Cpg probes IDs are coming from **450k or epic illumina** (e.g. `cg09364122`).
- **SNV** matrix view will contain SNV mutation absence or presence (binary matrix) with genes in rows and samples in columns. Genes IDs are **HGCN symbols** (e.g. `AKAP13`).

In this example, you have just to load the target dataset and the corresponding metadata from the `MOTL` package.
```{r}
data("Trg", package = "MOTL")
```

Target dataset are stored in the `Trg$YTrg_prep` object.
```{r Trg_loadData-1}
YTrg_list <- Trg$YTrg_prep
```

# Transfer learning inputs

Extract sample names from the target dataset. Then, extract view names shared between the target and the learning datasets and the corresponding likelihoods.
```{r Trg_preprocessing-1}
smpls <- colnames(YTrg_list[[1]])
viewsTrg <- names(YTrg_list)
views <- viewsLrn[is.element(viewsLrn, viewsTrg)]
likelihoods <- likelihoodsLrn[views]
```

To prepare the target dataset, use the `TCGATargetDataPreparation()`: 

- normalize and/or transform
- harmonize with `Lrn` (have same features and same order)
- order sample names between views

Here, we will no transform (`transformation = FALSE`) neither normalize dataset (`normalization = FALSE`), data are already prepared.
```{r Trg_preprocessing-2, message = FALSE}
YTrg_prep <- TargetDataPreparation(views = views, YTrg_list = YTrg_list,
                                       Fctrzn = Fctrzn,
                                       smpls = smpls,
                                       normalization = FALSE,
                                       expdat_meta_Lrn = expdat_meta_Lrn,
                                       transformation = FALSE)
```

Prepare inputs of the transfer learning:

- `YTrg`: list of the target dataset matrices (prepared with `TargetDataPreparation()`)
- `views`: vector of target dataset view names
- `Fctrzn`: the learning dataset factorization model
- `likelihoods`: list of view likelihoods

```{r init-TL-2, message = FALSE}
TL_param <- initTransferLearningParamaters(YTrg = YTrg_prep, 
                                           views = views, 
                                           Fctrzn = Fctrzn, 
                                           likelihoods = likelihoods)
```

```{r init-TL-3}
names(TL_param)
```

# Transfer learning using `MOTL`

Set the parameter of `MOTL`: 

- `minFactors`: floor when dropping factors - number of samples in evaluations
- `StartDropFactor`: after which iteration to start dropping factors
- `FreqDropFactor`: how often to drop factors
- `StartELBO`: which iteration to start checking ELBO on, exclude initiation iteration
- `FreqELBO`: how often to assess the ELBO
- `DropFactorTH`: factor with lowest max variance, that is less than this, is dropped
- `MaxIterations`: maximum iteration number
- `MinIterations`: minimum iteration number - at least 2 and exclude initial setup (`2` is default in MOFA)
- `ConvergenceIts`: number of consecutive iterations that change in ELBO is (`2` is default in MOFA)
- `ConvergenceTH`: threshold number for change in ELBO for checking convergence (`0.0005` is default in MOFA, correspond to the *fast* option)
- `PoisRateCstnt`: amount to add to the poison rate function to avoid errors(`1e-04` default)

```{r MOTL-input}
ss_start_time <- Sys.time()
minFactors <- 13 
StartDropFactor <- 1
FreqDropFactor <- 1 
StartELBO <- 1 
FreqELBO <- 5 
DropFactorTH <- 0.01 
MaxIterations <- 1000
MinIterations <- 2 
ConvergenceIts <- 2 
ConvergenceTH <- 0.0005 
PoisRateCstnt <- 0.0001 
```

```{r MOTL-results, warning = FALSE, message = FALSE, results = "hide"}
TL_data <- transferLearning_function(TL_param = TL_param, 
                                     views = views,
                                     likelihoods = likelihoods,
                                     Fctrzn = Fctrzn,
                                     CenterTrg = CenterTrg,
                                     MaxIterations = MaxIterations, 
                                     MinIterations = MinIterations,
                                     minFactors = minFactors, 
                                     StartDropFactor = StartDropFactor, 
                                     FreqDropFactor = FreqDropFactor, 
                                     StartELBO = StartELBO, 
                                     FreqELBO = FreqELBO, 
                                     DropFactorTH = DropFactorTH,
                                     ConvergenceIts = ConvergenceIts, 
                                     ConvergenceTH = ConvergenceTH,
                                     ss_start_time = ss_start_time)
```

```{r, echo = FALSE, message = FALSE, warning = FALSE}
file.remove("TL_data.rds")
```

Results are saved into .rds in the `outputDir`. 

Then have access to the results:

- `ZMu` corresponds to the inferred Z matrix that contains samples in rows and factors in columns
- `Fctrzn_Lrn_W$mRNA` corresponds to the weight matrix of mRNA, features are in rows and factors in columns.

```{r}
ZMu <- TL_data$ZMu
W_mRNA <- TL_data$Fctrzn_Lrn_W$mRNA
```

```{r}
dim(W_mRNA)
W_mRNA[c(1:5), c(1:3)]
```

The results shown in this example may differ from yours due to the use of random number generation. So, two runs of MOTL will produce different results. To obtain a reproducible analysis, you can configure random number generation using `set.seed(NumberYouChose)` and run it before MOTL.

# Session info

<details>
  <summary>**Session Info**</summary>
```{r info}
sessionInfo()
```
</details>

# References