---
title: >
  The `EMMA` User's Guide - Enrichment Method MAtters
author:
- name: Najla Abassi
  affiliation: 
  - Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), Mainz
  email: abassina@uni-mainz.de
- name: Annekathrin Nedwed
  affiliation: 
  - Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), Mainz
  email: anneludt@uni-mainz.de
- name: Federico Marini
  affiliation: 
  - Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), Mainz
  - Research Center for Immunotherapy (FZI), Mainz
  email: marinif@uni-mainz.de
date: "`r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('EMMA')`"
output: 
  BiocStyle::html_document:
    toc: true
    toc_float: true
    code_folding: show
    code_download: yes
vignette: >
  %\VignetteIndexEntry{Introduction_to_EMMA}
  %\VignetteEncoding{UTF-8}  
  %\VignettePackage{EMMA}
  %\VignetteKeywords{Pathways, GeneSetEnrichment, Annotation, ReproducibleResearch, Transcriptomics}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
bibliography: EMMA_bibliography.bib
---


```{r knitr, include = FALSE}
knitr::opts_chunk$set(
  comment   = "#>",
  error     = FALSE,
  warning   = FALSE,
  eval      = TRUE,
  message   = FALSE
)
```

# Introduction {#introduction}

Functional Enrichment Analysis (FEA) is a key downstream step in omics workflows,
commonly applied after differential expression analysis to support biological
interpretation and generate pathway-level hypotheses. A wide range of tools and
methods are available, mainly Over-Representation Analysis (ORA) [@Khatri2012] 
and Gene Set Enrichment Analysis (GSEA) [@Subramanian2005], leading
to substantial heterogeneity in analytical choices and reported results.

Despite its widespread use, FEA is often insufficiently documented [@Wijesooriya2022].
Critical parameters such as background gene sets or multiple testing correction
methods are frequently missing or inconsistently reported in scientific papers,
limiting reproducibility and interpretability. Currently, no standardized framework
exists to ensure transparent and reproducible documentation of FEA workflows,
comparable to the MIAME guidelines [@Brazma2001].

Similar approaches have been developed outside the Bioconductor ecosystem to
improve transparency and reproducibility in other analytical contexts.
For example, `tidylog` records operations performed with `dplyr` and `tidyr`,
while [`omicslog`](https://github.com/tidyomics/omicslog) tracks transformations
applied to omics-oriented objects.
However, no tool specifically addresses the metadata and provenance requirements
of FEA workflows.

To address this gap, we introduce `r BiocStyle::Biocpkg("EMMA")` (standing for 
**E**nrichment **M**ethods **MA**tters), a framework that automatically captures key 
analytical parameters and provenance information
during the execution of FEA methods.

This vignette demonstrates how `EMMA` integrates with existing tools
(e.g. `r BiocStyle::Biocpkg("clusterProfiler")`,
`r BiocStyle::Biocpkg("topGO")`, `r BiocStyle::Biocpkg("Enrichr")`, 
`r BiocStyle::Biocpkg("gprofiler2")`) to execute enrichment
analyses while systematically capturing analysis parameters and provenance
information during runtime, returning enrichment results
**without altering their original format**, alongside structured and reusable metadata. 

## What do you get with `EMMA`?

Using `EMMA` allows you to:

* Record the exact function call and parameters used for FEA
* Automatically track annotation sources (e.g. organism, gene set database)
* Retain provenance directly within the results object
* Generate reproducible summaries of the analysis (e.g. Methods sections)
* Facilitate sharing of results together with their analysis context

## How does `EMMA` work?

`EMMA` works by wrapping an enrichment call, executing it, and capturing relevant
provenance information and parameters during runtime. The recorded metadata is then
attached directly to the result object using R’s **attribute system**.

This approach enables `EMMA` to preserve provenance information
**without modifying the original result structure or introducing new classes**, 
allowing users to continue working seamlessly with standard outputs from existing tools.

While Bioconductor provides dedicated metadata slots for certain S4 classes
(e.g. via `metadata()`), these are not consistently available across all enrichment
result types. By relying on attributes, provenance information can be attached
to any result object regardless of its underlying class.

![](EMMA_workflow.png)

# Getting started {#gettingstarted}

To install this package, start R and enter:

```{r install, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")}

BiocManager::install("EMMA")
```

Once installed, the package can be loaded and attached to the current workspace
as follows:

```{r setup}
library("EMMA")
```

# Usage example of `EMMA`: the `macrophage` dataset

## About the data

In the remainder of this vignette, we will illustrate the main features of `r BiocStyle::Biocpkg("EMMA")` on a publicly available dataset from
Alasoo, et al. "Shared genetic effects on chromatin and gene expression
indicate a role for enhancer priming in immune response", 
published in Nature Genetics, January 2018 
[@Alasoo2018].

The data is made available via the `r BiocStyle::Biocpkg("macrophage")`
Bioconductor package, which contains the files output from the Salmon
quantification (version 0.12.0, with GENCODE v29 reference), as well as the
values summarized at the gene level, which we will use to exemplify.

In the `macrophage` experimental setting, the samples are available from 6
different donors, in 4 different conditions (naive, treated with Interferon
gamma, with SL1344, or with a combination of Interferon gamma and SL1344).

Let's start by loading all the necessary packages:

```{r loadlibraries}
library("EMMA")

library("macrophage")
library("DESeq2")
library("org.Hs.eg.db")
library("clusterProfiler")
library("mosdef")
library("topGO")
library("GO.db")
```

We will show an example of how  `r BiocStyle::Biocpkg("EMMA")` fits into a
regular bulk RNA-seq data analysis workflow.

## Getting a list of Differentially Expressed Genes

For this, we will load the `macrophage` data and perform Differential Expression
Analysis with `r BiocStyle::Biocpkg("DESeq2")`

```{r differential_expression}
# load data
data(gse, "macrophage")
# set up design
dds_macrophage <- DESeqDataSet(gse, design = ~ line + condition)
# preprocess
rownames(dds_macrophage) <- substr(rownames(dds_macrophage), 1, 15)
keep <- rowSums(counts(dds_macrophage) >= 10) >= 6
dds_macrophage <- dds_macrophage[keep, ]

# set seed for reproducibility
set.seed(2711)
# sample randomly for 2k genes
selected_genes <- sample(rownames(dds_macrophage), 2000)

dds_macrophage <- dds_macrophage[selected_genes, ]

# run DESeq
dds_macrophage <- DESeq(dds_macrophage)

# get de res for 1st contrast
IFNg_vs_naive <- results(dds_macrophage,
                         contrast = c("condition", "IFNg", "naive"),
                         lfcThreshold = 1, alpha = 0.05)
IFNg_vs_naive <- lfcShrink(dds_macrophage, coef = "condition_IFNg_vs_naive",
                           res = IFNg_vs_naive,
                           type = "apeglm")
IFNg_vs_naive$SYMBOL <- rowData(dds_macrophage)$SYMBOL

# get de res for 2st contrast
SL1344_vs_naive <- results(dds_macrophage,
                         contrast = c("condition", "SL1344", "naive"),
                         lfcThreshold = 1, alpha = 0.05)
SL1344_vs_naive <- lfcShrink(dds_macrophage, coef = "condition_SL1344_vs_naive",
                           res = SL1344_vs_naive,
                           type = "apeglm")
SL1344_vs_naive$SYMBOL <- rowData(dds_macrophage)$SYMBOL

# sort by adjusted p value
de_list <- list(
  IFNg_vs_naive = IFNg_vs_naive,
  SL1344_vs_naive = SL1344_vs_naive
)

de_list <- lapply(de_list, function(df) {
  df <- df[order(df$padj), ]
  df <- df[!is.na(df$padj) & df$padj <= 0.05, ]
  df
})
# set background gene list
gene_universe <- rownames(dds_macrophage)
```

## Performing Functional Enrichment Analysis (FEA) 

Now that we have a list of DE genes for this contrast, we can perform Functional
Enrichment Analysis.  
We do this first *without* the functionality provided by `EMMA`.

In the following example, we will use the
function `enrichGO()` from `r BiocStyle::Biocpkg("clusterProfiler")`

```{r withoutemma}
de_res <- de_list$IFNg_vs_naive

fea_res <- enrichGO(gene = rownames(de_res),
                    universe = gene_universe,
                    keyType = "ENSEMBL",
                    OrgDb = org.Hs.eg.db,
                    ont = "BP",
                    pAdjustMethod = "BH",
                    pvalueCutoff = 0.05,
                    qvalueCutoff = 0.1)
```

# Entering the `EMMA` framework

From now on, we will complement the existing workflow with the functionality 
provided by `EMMA`.


## `EMMA_run()`: Capturing the recorded information

`EMMA_run()` accepts a function call (e.g. `enrichGO(...)`) and executes it as
it is, while capturing the associated parameters and provenance information:

```{r EMMA_run_1}
# perform FEA, but with EMMA!
fea_res <- enrichGO(gene = rownames(de_res),
                    universe = gene_universe,
                    keyType = "ENSEMBL",
                    OrgDb = org.Hs.eg.db,
                    ont = "BP",
                    pAdjustMethod = "BH",
                    pvalueCutoff = 0.05,
                    qvalueCutoff = 0.1) |> 
  EMMA_run() # simply pipe your call to EMMA_run()
# check res
fea_res
```

... or you can simply wrap `EMMA_run()` around your call:

```{r EMMA_run_2}
# you can also pass the function name and its namespace
# e.g. `clusterProfiler::enrichGO(...)`
fea_res_nobg <- EMMA_run(
  clusterProfiler::enrichGO(
    # no universe set
    gene = rownames(de_res),
    keyType = "ENSEMBL",
    OrgDb = org.Hs.eg.db,
    ont = "BP",
    pAdjustMethod = "BH",
    pvalueCutoff = 0.05,
    qvalueCutoff = 0.1,
    readable = TRUE
  )
)
```

As you can see, `EMMA` returns the FEA results in their **native/standard** format.
`EMMA` also warns you about good practices when performing FEA, like in this
example, we didn't define a list of background genes (which can influence the 
results leading e.g. to a larger amount of false discoveries, as a consequence 
of over-optimistic smaller p-values [@Wijesooriya2022]), so we get warned about that.

## `EMMA_show()`: Summarizing the recorded information

To get a quick summary of what `EMMA` captured while we ran the analysis, we use
`EMMA_show()`:

```{r EMMA_show}
EMMA_show(fea_res)
```

`EMMA` attaches the captured metadata to the attributes of the
results object. That's why it is always a good practice to save the original
results, and not only the subsets of interest.

## `EMMA_get_record()`: Retrieving the recorded information

To access the full recorded information, we use `EMMA_get_record()`:

```{css}
/*| echo: false */

.scrollout {
max-height: 500px;
overflow-y: scroll;
}
```

```{r EMMA_get_record}
#| class-output: scrollout

emma_record <- EMMA_get_record(fea_res)

# get all the record
emma_record
```

`EMMA` structures the `EMMA_record` attribute (i.e. the recorded provenance
information) into a list of elements:

```{bash EMMA_record_str, eval=FALSE}
├── EMMA_record
│   ├── method         # how the analysis was performed
│   │   ├── call
│   │   ├── function_name
│   │   ├── package_name
│   │   ├── package_version
│   │   ├── wrapped_function
│   │   ├── wrapped_package
│   │   └── wrapper
│   ├── input         # inputs used for the analysis
│   │   └── arguments
│   ├── annotation    # annotation context
│   │   ├── organism
│   │   ├── gene_set_db
│   │   └── gene_set_db_version
│   ├── timestamp     # when the analysis was run
│   ├── session_info  # R session information
│   ├── extra         # user-defined additions
│   └── emma_version
```

```{r EMMA_method}
# get the method record
emma_record$method
```

With `EMMA_run()`, we can decide whether we want to save the value of arguments
used in our call or not. This can be useful, for example, to avoid unnecessarily
increasing the size of the result object. For this, we can use the argument `args_form`:

```{r argument_form}
#| class-output: scrollout

fea_res_no_param <- enrichGO(gene = rownames(de_res),
                             universe = gene_universe,
                             keyType = "ENSEMBL",
                             OrgDb = org.Hs.eg.db,
                             ont = "BP",
                             pAdjustMethod = "BH",
                             pvalueCutoff = 0.05,
                             qvalueCutoff = 0.1,
                             readable = TRUE) |> 
  EMMA_run(args_form = "unevaluated") # when we don't want the values stored
                                      # else set to evaluated (default)

# check
EMMA_get_record(fea_res_no_param)
```

We can also choose whether to save the R session information with the record
using the argument `store_session_info`, which defaults to `TRUE`.

## `EMMA_explain()`: Summarizing recorded information into text

`EMMA_explain()` generates a human-readable description of the FEA, similar
to a Materials and Methods section of a paper, by summarizing the executed call,
the parameters, software context, and reference databases used.

```{r EMMA_explain, results = 'asis', message=TRUE}
EMMA_explain(fea_res, get_citation = TRUE)
```

## `EMMA` with custom/wrapper functions

You can also use a custom function that you developed, or a wrapper function
(from packages such as `r BiocStyle::Biocpkg("mosdef")`). In this case,
`EMMA_run()` will attempt to capture as much metadata as possible:

```{r mosdef_eg}
#| class-output: scrollout

mosdef_fea_res <- mosdef::run_goseq(de_genes = rownames(de_res),
                                    bg_genes = gene_universe,
                                    mapping = "org.Hs.eg.db",
                                    id = "ensGene",
                                    genome = "hg19") |> 
  EMMA_run(store_session_info = FALSE,
           args_form = "unevaluated")

# quick inspection
EMMA_get_record(mosdef_fea_res)
```

```{r custom_eg}
#| class-output: scrollout

# a custom function (not from a package)
my_custom_function <- function(gene, universe = NULL,
                               ontology = "BP", id_type = "ENTREZID",
                               org_db_name = "org.Hs.eg.db", 
                               organism = "hsapiens") {
  res1 <- mosdef::run_topGO(de_genes = gene,
                            bg_genes = gene_universe,
                            ontology = ontology,
                            gene_id = id_type,
                            mapping = org_db_name,
                            add_gene_to_terms = TRUE)

  res2 <- gprofiler2::gost(query = gene,
                           organism = organism,
                           custom_bg = gene_universe)

  return(list(topGO_res = res1,
              gost_res = res2
  ))
}

# run analysis with EMMA
frankenstein_fea <- my_custom_function(
  gene = rownames(de_res),
  universe = gene_universe,
  ontology = "BP",
  id_type = "ENSEMBL",
  org_db_name = "org.Hs.eg.db",
  organism = "hsapiens"
  ) |> EMMA_run(store_session_info = FALSE,
                args_form = "unevaluated") 

# quick inspection
EMMA_get_record(frankenstein_fea)
```

## `EMMA_add_custom_metadata()`: Adding extra information

The user can always attach extra metadata that `EMMA` might not be able to
capture automatically. To keep everything organized, we can use
`EMMA_add_custom_metadata()` function

```{r add_custom_metadata}
#| class-output: scrollout

frankenstein_fea2 <- EMMA_add_custom_metadata(res = frankenstein_fea,
                                              extra = list(
                                                wrapped_function_topGO = "runTest",
                                                notes = "any other meaningful info"))

EMMA_get_record(frankenstein_fea2)$extra
```

Since the `EMMA_record` is attached as attribute to the original results objects,
it can be preserved when integrating results into structured containers
such as `r BiocStyle::Biocpkg("DeeDeeExperiment")` [@Abassi2026].
This enables both FEA results and their associated provenance information
to be stored and managed together, facilitating reproducibility, organization,
and sharing of complex omics analyses.

```{r emma_and_dde}
#| class-output: scrollout

dde <- DeeDeeExperiment::DeeDeeExperiment(sce = dds_macrophage,
                                          de_results = IFNg_vs_naive,
                                          enrich_results =  list(
                                            IFNg_vs_naive = fea_res_no_param)
                                          )

fea <- DeeDeeExperiment::getFEA(dde, format = "original")

EMMA_get_record(fea)
```

## `EMMA_freeze()`: Recording the Analysis Environment

`EMMA_freeze()` records the R environment at the time of analysis by generating
a lockfile using `renv`. By default, the snapshot is created with `force = TRUE`,
allowing the environment to be recorded even when inconsistencies (e.g. version
mismatches) are present.

This behavior reflects the goal of preserving the analysis environment as it was
used in practice, rather than attempting to enforce a fully consistent state.

```{r EMMA_freeze}
if (requireNamespace("renv", quietly = TRUE)) {
  project_path <- tempdir()
  dir.create(project_path)
  EMMA_freeze(project = project_path,
              file = "analysis.lock",
              pkgs = loadedNamespaces(),
              prompt = FALSE,
              force = TRUE)
}
```

## `EMMA` with iterative workflows

In real-world analyses, multiple FEA results from different contrasts
are often analyzed in parallel This can be conveniently handled using
iterative approaches such as `lapply()`.

```{r emma_with_lapply}
fea_res_list <- lapply(de_list, function(dea){
  #### perform fea with EMMA_run ####
  enrichGO(gene = rownames(dea),
           universe = gene_universe,
           keyType = "ENSEMBL",
           OrgDb = org.Hs.eg.db,
           ont = "BP",
           pAdjustMethod = "BH",
           pvalueCutoff = 0.05,
           qvalueCutoff = 0.1,
           readable = TRUE) |> 
    EMMA_run()
})

#### print the metadata summary for each contrast ####
invisible(lapply(names(fea_res_list), function(nm) {
  cat("\n###", nm)
  EMMA_show(fea_res_list[[nm]])
}))

#### add more info to the metadata ####
fea_res_list <- setNames(lapply(names(fea_res_list), function(nm) {
  EMMA_add_custom_metadata(fea_res_list[[nm]], extra = list(contrast_name = nm))
  }), 
names(fea_res_list))

#### get the raw metadata ####
record_list <- list()

record_list <- setNames(lapply(names(fea_res_list), function(mn){
  EMMA_get_record(fea_res_list[[mn]])
  }),
names(fea_res_list))

record_list$SL1344_vs_naive$extra
record_list$SL1344_vs_naive$method$package_name
```

# Supported methods and future extensions

`EMMA` currently provides metadata capture support for several commonly used
FEA tools, including methods from `r BiocStyle::Biocpkg("clusterProfiler")`,
`r BiocStyle::Biocpkg("gprofiler2")`, and related wrappers/custom functions
such as `r BiocStyle::Biocpkg("mosdef")`.

`EMMA` is designed to be extensible, and support for additional FEA methods
and packages will continue to expand over time.

# Session info {.unnumbered}

```{css}
/*| echo: false */

.smaller {
  font-size: 10px
}
```

```{r sessioinfo}
#| class-output: smaller

sessionInfo()
```

# References {.unnumbered}