---
title: "LIPIDIFy: A Complete Lipidomics Analysis Workflow"
author: "Fayrouz Hammal"
date: "`r Sys.Date()`"
package: LIPIDIFy
output:
  rmarkdown::html_document:
    toc: true
    toc_float: true
    toc_depth: 3
    theme: flatly
    highlight: tango
    fig_width: 9
    fig_height: 5
vignette: >
  %\VignetteIndexEntry{LIPIDIFy: A Complete Lipidomics Analysis Workflow}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse  = TRUE,
  comment   = "#>",
  fig.width = 9,
  fig.height = 5,
  fig.align = "center",
  warning   = FALSE,
  message   = FALSE
)
```

## Overview

**LIPIDIFy** provides an end-to-end pipeline for mass-spectrometry lipidomics data:

| Step | What it does | Key function(s) |
|------|-------------|-----------------|
| 1 | Load data | `generate_example_data()`, `load_lipidomics_data_from_df()` |
| 2 | Classify lipids | `classify_lipids()` |
| 3 | Visualize raw data | `visualize_raw_data_improved()` |
| 4 | Compare and apply normalization | `apply_normalizations()`, `create_pipeline_plot()` |
| 4.5 | Impute missing values | `impute_missing_values()` |
| 4.6 | Correct batch effects | `correct_batch_effects()` |
| 5 | Quality control (PCA) | `perform_pca()`, `create_pca_plot_with_ellipses()` |
| 6 | Differential analysis | `perform_differential_analysis()` |
| 7 | Visualize results | `create_volcano_plot_labeled()`, `create_heatmap_robust()` |
| 8 | Lipid expression plots | `create_lipid_expression_barplot()` |
| 9 | Enrichment analysis | `perform_enrichment_analysis()` |

The same workflow is available without any coding via the Shiny application:

```{r launch-app, eval=FALSE}
library(LIPIDIFy)
launch_lipidomics_app()
```

***

## Step 1 - Load the Example Dataset

`generate_example_data()` creates a reproducible synthetic dataset with
realistic lipid names (PC, PE, TG, SM, Cer, etc.) and four sample groups
(GroupA, GroupB, GroupC, GroupD; 5 replicates each; 301 lipids total).

```{r load-data}
library(LIPIDIFy)

# Generate the built-in synthetic dataset (seed = 42 for reproducibility)
raw_df   <- generate_example_data()
data_obj <- load_lipidomics_data_from_df(raw_df)

cat("Samples  :", nrow(data_obj$numeric_data), "\n")
cat("Lipids   :", ncol(data_obj$numeric_data), "\n")
cat("Groups   :", paste(unique(data_obj$metadata$`Sample Group`),
                        collapse = ", "), "\n")
```

`load_lipidomics_data_from_df()` returns a **named list** with three elements:

- `$numeric_data` - matrix of lipid abundances (samples x lipids)
- `$metadata` - data frame of sample metadata
- `$data` - the original full data frame

```{r peek-data}
# Metadata of the first 4 samples
head(data_obj$metadata[, c("Sample Name", "Sample Group", "Tumour ID")], 4)

# First 3 lipids of the first 4 samples
data_obj$numeric_data[1:4, 1:3]
```

***

## Step 2 - Lipid Classification

`classify_lipids()` parses lipid names and assigns three classification levels:

- **LipidGroup** - broad category (e.g., Glycerophospholipids, Sphingolipids)
- **LipidType** - specific subclass (e.g., Phosphatidylcholine)
- **Saturation** - SFA (0 double bonds), MUFA (1), PUFA (> 1)

```{r classify}
lipid_names    <- colnames(data_obj$numeric_data)
classification <- classify_lipids(lipid_names)

head(classification, 8)
```

```{r classify-summary}
cat("Lipid groups:\n")
print(sort(table(classification$LipidGroup), decreasing = TRUE))

cat("\nFatty-acid saturation:\n")
print(table(classification$Saturation))
```


***

## Step 3 - Raw Data Visualization

Examining distributions **before normalization** helps detect outlier samples,
batch effects, and scaling differences.

```{r raw-boxplot, fig.cap="Per-sample intensity distributions before normalization. Boxes are coloured by group."}
raw_list <- list(numeric_data = data_obj$numeric_data,
                 metadata     = data_obj$metadata)

visualize_raw_data_improved(
  raw_list,
  plot_type    = "boxplot",
  view_mode    = "sample",
  metadata     = data_obj$metadata,
  group_column = "Sample Group"
)
```

```{r raw-density, fig.cap="Intensity density curves before normalization, one curve per sample."}
visualize_raw_data_improved(
  raw_list,
  plot_type    = "density",
  view_mode    = "sample",
  metadata     = data_obj$metadata,
  group_column = "Sample Group"
)
```

***

## Step 4 - Normalization

### What methods are available?

```{r norm-methods}
get_normalization_methods()
```

### Compare two pipelines side-by-side

It is good practice to compare normalization strategies visually before
committing to one. Below we compare TIC+Log2 (the most common starting point)
against PQN+Log2 (more robust when large fold-changes are expected).

```{r norm-compare, fig.cap="Side-by-side comparison of two normalization pipelines applied to the same raw data.", fig.height=9}
pipeline1 <- apply_normalizations(data_obj$numeric_data, c("TIC", "Log2"))
pipeline2 <- apply_normalizations(data_obj$numeric_data, c("PQN", "Log2"))

p1 <- create_pipeline_plot(pipeline1,
                            title     = "Pipeline 1: TIC + Log2",
                            metadata  = data_obj$metadata,
                            plot_type = "boxplot")

p2 <- create_pipeline_plot(pipeline2,
                            title     = "Pipeline 2: PQN + Log2",
                            metadata  = data_obj$metadata,
                            plot_type = "boxplot")

gridExtra::grid.arrange(p1, p2, ncol = 1)
```

### Apply the chosen pipeline

```{r normalize}
norm_data <- apply_normalizations(
  data_obj$numeric_data,
  methods = c("TIC", "Log2")
)

dim(norm_data)  # samples x lipids - same dimensions as raw data
```

```{r norm-boxplot, fig.cap="Per-sample distributions after TIC + Log2 normalization. Medians are now aligned across groups."}
norm_list <- list(numeric_data = norm_data, metadata = data_obj$metadata)

visualize_raw_data_improved(
  norm_list,
  plot_type    = "boxplot",
  view_mode    = "sample",
  metadata     = data_obj$metadata,
  group_column = "Sample Group"
)
```

***

## Step 4.5 - Missing Value Imputation

Imputing **after** normalisation ensures that imputed values sit on the same
scale as observed values, and that KNN imputation operates on data where
samples are already comparable.

```{r impute}
# Inspect missingness in the normalised data (none here -- this is synthetic)
n_miss <- sum(is.na(norm_data))
cat("Missing values in normalised data:", n_miss, "\n")

# Available methods
cat("Imputation methods available:\n")
print(get_imputation_methods())
```

```{r impute-apply}
# For demonstration: introduce 5% artificial missing values
set.seed(42)
demo_norm <- norm_data
demo_norm[sample(length(demo_norm), size = round(0.05 * length(demo_norm)))] <- NA
cat("Missingness introduced:", sum(is.na(demo_norm)),
    sprintf("(%.1f%%)\n", 100 * mean(is.na(demo_norm))))

# Apply half-minimum imputation on normalised data
imputed_norm <- impute_missing_values(demo_norm, method = "half_min")
cat("Missing after imputation:", sum(is.na(imputed_norm)), "\n")
```

**Choosing a method:**

| Method | Best for |
|--------|----------|
| `half_min` | Most lipidomics data; models the LOD on normalised scale |
| `knn` | Data missing at random (MAR); requires `impute` Bioconductor package |
| `zero` | Only when absence of signal truly means zero |
| `mean` / `median` | Small proportions of missing values |

For the remainder of this vignette we use the complete normalised data without
artificial missingness.

***

## Step 4.6 - Batch Effect Correction

When samples were processed across multiple runs, instruments, or operators,
technical batch effects can obscure biological differences. Batch correction
should be applied after normalisation and imputation, and verified with PCA.

Two methods are available:

| Method | When to use |
|--------|------------|
| `"limma"` | Default; fast and reliable for balanced designs |
| `"combat"` | Better for large or unbalanced batch effects; requires `sva` package |

```{r batch-correction, eval=FALSE}
# Simulate batch labels (real data would have this in the metadata file)
data_obj$metadata$Batch <- rep(c("Run1", "Run2"), each = 10)

# Apply limma batch correction after normalisation
norm_corrected <- correct_batch_effects(
  data_matrix  = norm_data,
  metadata     = data_obj$metadata,
  batch_column = "Batch",
  group_column = "Sample Group",
  method       = "limma"
)

# Verify: PCA should now show clustering by biology, not by batch
pca_corrected <- perform_pca(norm_corrected, data_obj$metadata, "Sample Group")
create_pca_plot_with_ellipses(
  pca_data           = pca_corrected$pca_data,
  variance_explained = pca_corrected$variance_explained,
  ellipse_type       = "confidence",
  title              = "PCA after Batch Correction"
)
```

For the remainder of this vignette we proceed with the uncorrected
normalised data (the example dataset has no batch structure).

***

## Step 5 - Quality Control with PCA

PCA on normalized data should show samples clustering by biological group
rather than by technical artefact.

```{r pca, fig.cap="PCA of TIC + Log2 normalized data. Ellipses are 95% confidence regions."}
pca_res <- perform_pca(norm_data, data_obj$metadata, "Sample Group")

create_pca_plot_with_ellipses(
  pca_data           = pca_res$pca_data,
  variance_explained = pca_res$variance_explained,
  ellipse_type       = "confidence",
  show_sample_labels = TRUE,
  title              = "PCA - TIC + Log2 Normalized Data"
)
```

***

## Step 6 - Differential Analysis

### Automatic pairwise contrasts

`create_default_contrasts()` generates all pairwise comparisons from the
groups present in your data. You can also specify contrasts manually.

```{r contrasts}
groups    <- unique(data_obj$metadata$`Sample Group`)
contrasts <- create_default_contrasts(groups)

cat("All pairwise contrasts (", length(contrasts), "total ):\n")
print(contrasts)
```

### Run limma

```{r diff-analysis}
diff_res <- perform_differential_analysis(
  data_matrix    = norm_data,
  metadata       = data_obj$metadata,
  group_column   = "Sample Group",
  contrasts_list = contrasts,
  method         = "limma"
)
```

```{r diff-summary}
cat(sprintf("%-28s  %5s  %5s  %4s  %5s\n",
            "Contrast", "Total", "Sig", "Up", "Down"))
cat(strrep("-", 55), "\n")

for (nm in names(diff_res$results)) {
  res    <- diff_res$results[[nm]]
  n_sig  <- sum(res$adj.P.Val < 0.05, na.rm = TRUE)
  n_up   <- sum(res$adj.P.Val < 0.05 & res$logFC > 0, na.rm = TRUE)
  n_down <- sum(res$adj.P.Val < 0.05 & res$logFC < 0, na.rm = TRUE)
  cat(sprintf("%-28s  %5d  %5d  %4d  %5d\n", nm, nrow(res), n_sig, n_up, n_down))
}
```

### Top results from the first contrast

```{r top-results}
first_contrast <- names(diff_res$results)[1]
res_df <- diff_res$results[[first_contrast]]
res_df <- res_df[order(res_df$adj.P.Val), ]

top10 <- utils::head(res_df[, c("logFC", "AveExpr", "adj.P.Val")], 10)
top10$logFC     <- round(top10$logFC, 3)
top10$AveExpr   <- round(top10$AveExpr, 2)
top10$adj.P.Val <- signif(top10$adj.P.Val, 3)

cat("Top 10 features in:", first_contrast, "\n\n")
print(top10)
```

***

## Step 7 - Results Visualization

### Volcano plot

Significant lipids (|logFC| > 1 and adj.P.Val < 0.05) are coloured by lipid
class. The top 10 most significant are labelled.

```{r volcano, fig.cap="Volcano plot coloured by lipid class. Vertical dashed lines mark logFC = +/-1; horizontal dashed line marks adj.P.Val = 0.05."}
create_volcano_plot_labeled(
  results             = diff_res$results[[first_contrast]],
  title               = paste("Volcano:", first_contrast),
  logfc_threshold     = 1,
  pval_threshold      = 0.05,
  top_labels          = 10,
  classification_data = classification,
  color_by            = "LipidGroup"
)
```

### Heatmap of significant features

```{r heatmap, fig.cap="Heatmap of up to 30 significant lipids. Rows (lipids) and columns (samples) are hierarchically clustered. Colour scale is z-scored per lipid.", fig.height=7}
sig_lipids <- rownames(res_df)[
  !is.na(res_df$adj.P.Val) &
  res_df$adj.P.Val < 0.05 &
  abs(res_df$logFC) > 1
]

if (length(sig_lipids) >= 5) {
  top_n  <- min(30, length(sig_lipids))
  avail  <- intersect(sig_lipids[seq_len(top_n)], colnames(norm_data))
  hm_mat <- t(norm_data[, avail, drop = FALSE])

  create_heatmap_robust(
    data_matrix  = hm_mat,
    metadata     = data_obj$metadata,
    group_column = "Sample Group",
    top_n        = top_n,
    title        = paste("Significant lipids:", first_contrast)
  )
} else {
  message("Fewer than 5 significant lipids at these thresholds.")
}
```

***

## Step 8 - Lipid Expression

`create_lipid_expression_barplot()` shows per-sample abundance of individual
lipids, sorted by group. It returns a single ggplot object when one lipid is
selected, or a named list of ggplot objects when multiple lipids are selected.

```{r lipid-expression, fig.cap="Per-sample abundance of the 3 most significant lipids. Samples are automatically sorted by group."}
# Take the 3 most significantly changed lipids
top_lipids <- utils::head(
  rownames(res_df)[!is.na(res_df$adj.P.Val)], 3
)

cat("Selected lipids:\n")
print(top_lipids)

expr_result <- create_lipid_expression_barplot(
  data_matrix     = norm_data,
  metadata        = data_obj$metadata,
  selected_lipids = top_lipids,
  group_column    = "Sample Group",
  data_type       = "normalized"
)

# The function returns a single ggplot (1 lipid) or a named list (multiple)
if (inherits(expr_result, "ggplot")) {
  print(expr_result)
} else {
  for (lipid_plot in expr_result) print(lipid_plot)
}
```

***

## Step 9 - Enrichment Analysis

Enrichment analysis tests whether lipids in a particular class are collectively
more increased or decreased than expected by chance, using the fgsea algorithm.

```{r enrichment}
enrich_res <- perform_enrichment_analysis(
  results_list        = diff_res$results,
  classification_data = classification,
  min_set_size        = 3,
  max_set_size        = 500
)

cat("Contrasts with enrichment results:", length(enrich_res), "\n")
cat("Categories per contrast          :", paste(names(enrich_res[[1]]),
                                                collapse = ", "), "\n")
```

```{r enrichment-table}
# Show top enrichment results for the first contrast, by LipidGroup
grp_enrich <- enrich_res[[first_contrast]][["LipidGroup"]]

if (!is.null(grp_enrich) && nrow(grp_enrich) > 0) {
  grp_enrich <- grp_enrich[order(grp_enrich$pval), ]

  sig <- grp_enrich[grp_enrich$padj < 0.25, ]

  if (nrow(sig) > 0) {
    cat("Significantly enriched lipid groups (padj < 0.25) in:", first_contrast, "\n\n")
    out        <- sig[, c("pathway", "NES", "pval", "padj", "size")]
    out$NES    <- round(out$NES, 3)
    out$pval   <- signif(out$pval, 3)
    out$padj   <- signif(out$padj, 3)
    print(out, row.names = FALSE)
  } else {
    cat("No lipid groups reach padj < 0.25 in this contrast.\n")
    cat("Top 5 by p-value (", first_contrast, "):\n\n")
    out        <- utils::head(grp_enrich[, c("pathway", "NES", "pval", "padj", "size")], 5)
    out$NES    <- round(out$NES, 3)
    out$pval   <- signif(out$pval, 3)
    out$padj   <- signif(out$padj, 3)
    print(out, row.names = FALSE)
  }
} else {
  cat("No enrichment results available.\n")
}
```

```{r enrichment-plot, fig.cap="Enrichment dot plot for the first contrast. Dot size reflects set size; colour reflects statistical significance."}
if (!is.null(grp_enrich) && nrow(grp_enrich) > 0) {
  create_enrichment_dotplot(
    enrichment_data = grp_enrich,
    title           = paste("Lipid Group Enrichment:", first_contrast),
    max_pathways    = 10
  )
}
```

***

## Session Information

```{r session-info}
sessionInfo()
```