LIPIDIFy: A Complete Lipidomics Analysis Workflow

Overview

LIPIDIFy provides an end-to-end pipeline for mass-spectrometry lipidomics data:

Step	What it does	Key function(s)
1	Load data	`generate_example_data()`, `load_lipidomics_data_from_df()`
2	Classify lipids	`classify_lipids()`
3	Visualize raw data	`visualize_raw_data_improved()`
4	Compare and apply normalization	`apply_normalizations()`, `create_pipeline_plot()`
4.5	Impute missing values	`impute_missing_values()`
4.6	Correct batch effects	`correct_batch_effects()`
5	Quality control (PCA)	`perform_pca()`, `create_pca_plot_with_ellipses()`
6	Differential analysis	`perform_differential_analysis()`
7	Visualize results	`create_volcano_plot_labeled()`, `create_heatmap_robust()`
8	Lipid expression plots	`create_lipid_expression_barplot()`
9	Enrichment analysis	`perform_enrichment_analysis()`

The same workflow is available without any coding via the Shiny application:

library(LIPIDIFy)
launch_lipidomics_app()

Step 1 - Load the Example Dataset

generate_example_data() creates a reproducible synthetic dataset with realistic lipid names (PC, PE, TG, SM, Cer, etc.) and four sample groups (GroupA, GroupB, GroupC, GroupD; 5 replicates each; 301 lipids total).

library(LIPIDIFy)

# Generate the built-in synthetic dataset (seed = 42 for reproducibility)
raw_df   <- generate_example_data()
data_obj <- load_lipidomics_data_from_df(raw_df)

cat("Samples  :", nrow(data_obj$numeric_data), "\n")
#> Samples  : 20
cat("Lipids   :", ncol(data_obj$numeric_data), "\n")
#> Lipids   : 301
cat("Groups   :", paste(unique(data_obj$metadata$`Sample Group`),
                        collapse = ", "), "\n")
#> Groups   : GroupA, GroupB, GroupC, GroupD

load_lipidomics_data_from_df() returns a named list with three elements:

$numeric_data - matrix of lipid abundances (samples x lipids)
$metadata - data frame of sample metadata
$data - the original full data frame

# Metadata of the first 4 samples
head(data_obj$metadata[, c("Sample Name", "Sample Group", "Tumour ID")], 4)
#>          Sample Name Sample Group Tumour ID
#> Sample_1    Sample_1       GroupA         1
#> Sample_2    Sample_2       GroupA         1
#> Sample_3    Sample_3       GroupA         2
#> Sample_4    Sample_4       GroupA         2

# First 3 lipids of the first 4 samples
data_obj$numeric_data[1:4, 1:3]
#>          PC 18:1_24:1 PC 21:0_24:6 PC 14:0_18:4
#> Sample_1      663.143     5815.674     2227.466
#> Sample_2    10415.648    10528.302     7180.788
#> Sample_3     3275.850     3662.961     7618.975
#> Sample_4     2993.241    12752.211     1606.933

Step 2 - Lipid Classification

classify_lipids() parses lipid names and assigns three classification levels:

LipidGroup - broad category (e.g., Glycerophospholipids, Sphingolipids)
LipidType - specific subclass (e.g., Phosphatidylcholine)
Saturation - SFA (0 double bonds), MUFA (1), PUFA (> 1)

lipid_names    <- colnames(data_obj$numeric_data)
classification <- classify_lipids(lipid_names)

head(classification, 8)
#>          Lipid           LipidGroup           LipidType Saturation
#> 1 PC 18:1_24:1 Glycerophospholipids Phosphatidylcholine       PUFA
#> 2 PC 21:0_24:6 Glycerophospholipids Phosphatidylcholine       PUFA
#> 3 PC 14:0_18:4 Glycerophospholipids Phosphatidylcholine       PUFA
#> 4 PC 22:2_24:1 Glycerophospholipids Phosphatidylcholine       PUFA
#> 5 PC 16:2_20:1 Glycerophospholipids Phosphatidylcholine       PUFA
#> 6 PC 21:1_20:2 Glycerophospholipids Phosphatidylcholine       PUFA
#> 7 PC 17:2_16:4 Glycerophospholipids Phosphatidylcholine       PUFA
#> 8 PC 14:1_14:4 Glycerophospholipids Phosphatidylcholine       PUFA

cat("Lipid groups:\n")
#> Lipid groups:
print(sort(table(classification$LipidGroup), decreasing = TRUE))
#> 
#> Glycerophospholipids        Glycerolipids        Sphingolipids 
#>                  117                   65                   50 
#>    Lysophospholipids        Sterol Lipids 
#>                   44                   25

cat("\nFatty-acid saturation:\n")
#> 
#> Fatty-acid saturation:
print(table(classification$Saturation))
#> 
#> MUFA PUFA  SFA 
#>   43  211   47

Step 3 - Raw Data Visualization

Examining distributions before normalization helps detect outlier samples, batch effects, and scaling differences.

raw_list <- list(numeric_data = data_obj$numeric_data,
                 metadata     = data_obj$metadata)

visualize_raw_data_improved(
  raw_list,
  plot_type    = "boxplot",
  view_mode    = "sample",
  metadata     = data_obj$metadata,
  group_column = "Sample Group"
)

Per-sample intensity distributions before normalization. Boxes are coloured by group.

visualize_raw_data_improved(
  raw_list,
  plot_type    = "density",
  view_mode    = "sample",
  metadata     = data_obj$metadata,
  group_column = "Sample Group"
)

Intensity density curves before normalization, one curve per sample.

Step 4 - Normalization

What methods are available?

get_normalization_methods()
#>  [1] "TIC"        "PQN"        "Quantile"   "Log2Median" "Median"    
#>  [6] "Mean"       "Log2"       "Log10"      "Sqrt"       "None"

Compare two pipelines side-by-side

It is good practice to compare normalization strategies visually before committing to one. Below we compare TIC+Log2 (the most common starting point) against PQN+Log2 (more robust when large fold-changes are expected).

pipeline1 <- apply_normalizations(data_obj$numeric_data, c("TIC", "Log2"))
pipeline2 <- apply_normalizations(data_obj$numeric_data, c("PQN", "Log2"))

p1 <- create_pipeline_plot(pipeline1,
                            title     = "Pipeline 1: TIC + Log2",
                            metadata  = data_obj$metadata,
                            plot_type = "boxplot")

p2 <- create_pipeline_plot(pipeline2,
                            title     = "Pipeline 2: PQN + Log2",
                            metadata  = data_obj$metadata,
                            plot_type = "boxplot")

gridExtra::grid.arrange(p1, p2, ncol = 1)

Side-by-side comparison of two normalization pipelines applied to the same raw data.

Apply the chosen pipeline

norm_data <- apply_normalizations(
  data_obj$numeric_data,
  methods = c("TIC", "Log2")
)

dim(norm_data)  # samples x lipids - same dimensions as raw data
#> [1]  20 301

norm_list <- list(numeric_data = norm_data, metadata = data_obj$metadata)

visualize_raw_data_improved(
  norm_list,
  plot_type    = "boxplot",
  view_mode    = "sample",
  metadata     = data_obj$metadata,
  group_column = "Sample Group"
)

Per-sample distributions after TIC + Log2 normalization. Medians are now aligned across groups.

Step 4.5 - Missing Value Imputation

Imputing after normalisation ensures that imputed values sit on the same scale as observed values, and that KNN imputation operates on data where samples are already comparable.

# Inspect missingness in the normalised data (none here -- this is synthetic)
n_miss <- sum(is.na(norm_data))
cat("Missing values in normalised data:", n_miss, "\n")
#> Missing values in normalised data: 0

# Available methods
cat("Imputation methods available:\n")
#> Imputation methods available:
print(get_imputation_methods())
#> [1] "half_min" "min"      "zero"     "mean"     "median"   "knn"

# For demonstration: introduce 5% artificial missing values
set.seed(42)
demo_norm <- norm_data
demo_norm[sample(length(demo_norm), size = round(0.05 * length(demo_norm)))] <- NA
cat("Missingness introduced:", sum(is.na(demo_norm)),
    sprintf("(%.1f%%)\n", 100 * mean(is.na(demo_norm))))
#> Missingness introduced: 301 (5.0%)

# Apply half-minimum imputation on normalised data
imputed_norm <- impute_missing_values(demo_norm, method = "half_min")
cat("Missing after imputation:", sum(is.na(imputed_norm)), "\n")
#> Missing after imputation: 0

Choosing a method:

Method	Best for
`half_min`	Most lipidomics data; models the LOD on normalised scale
`knn`	Data missing at random (MAR); requires `impute` Bioconductor package
`zero`	Only when absence of signal truly means zero
`mean` / `median`	Small proportions of missing values

For the remainder of this vignette we use the complete normalised data without artificial missingness.

Step 4.6 - Batch Effect Correction

When samples were processed across multiple runs, instruments, or operators, technical batch effects can obscure biological differences. Batch correction should be applied after normalisation and imputation, and verified with PCA.

Two methods are available:

Method	When to use
`"limma"`	Default; fast and reliable for balanced designs
`"combat"`	Better for large or unbalanced batch effects; requires `sva` package

# Simulate batch labels (real data would have this in the metadata file)
data_obj$metadata$Batch <- rep(c("Run1", "Run2"), each = 10)

# Apply limma batch correction after normalisation
norm_corrected <- correct_batch_effects(
  data_matrix  = norm_data,
  metadata     = data_obj$metadata,
  batch_column = "Batch",
  group_column = "Sample Group",
  method       = "limma"
)

# Verify: PCA should now show clustering by biology, not by batch
pca_corrected <- perform_pca(norm_corrected, data_obj$metadata, "Sample Group")
create_pca_plot_with_ellipses(
  pca_data           = pca_corrected$pca_data,
  variance_explained = pca_corrected$variance_explained,
  ellipse_type       = "confidence",
  title              = "PCA after Batch Correction"
)

For the remainder of this vignette we proceed with the uncorrected normalised data (the example dataset has no batch structure).

Step 5 - Quality Control with PCA

PCA on normalized data should show samples clustering by biological group rather than by technical artefact.

pca_res <- perform_pca(norm_data, data_obj$metadata, "Sample Group")

create_pca_plot_with_ellipses(
  pca_data           = pca_res$pca_data,
  variance_explained = pca_res$variance_explained,
  ellipse_type       = "confidence",
  show_sample_labels = TRUE,
  title              = "PCA - TIC + Log2 Normalized Data"
)

PCA of TIC + Log2 normalized data. Ellipses are 95% confidence regions.

Step 6 - Differential Analysis

Automatic pairwise contrasts

create_default_contrasts() generates all pairwise comparisons from the groups present in your data. You can also specify contrasts manually.

groups    <- unique(data_obj$metadata$`Sample Group`)
contrasts <- create_default_contrasts(groups)

cat("All pairwise contrasts (", length(contrasts), "total ):\n")
#> All pairwise contrasts ( 6 total ):
print(contrasts)
#> [1] "GroupB - GroupA" "GroupC - GroupA" "GroupD - GroupA" "GroupC - GroupB"
#> [5] "GroupD - GroupB" "GroupD - GroupC"

Run limma

diff_res <- perform_differential_analysis(
  data_matrix    = norm_data,
  metadata       = data_obj$metadata,
  group_column   = "Sample Group",
  contrasts_list = contrasts,
  method         = "limma"
)

cat(sprintf("%-28s  %5s  %5s  %4s  %5s\n",
            "Contrast", "Total", "Sig", "Up", "Down"))
#> Contrast                      Total    Sig    Up   Down
cat(strrep("-", 55), "\n")
#> -------------------------------------------------------

for (nm in names(diff_res$results)) {
  res    <- diff_res$results[[nm]]
  n_sig  <- sum(res$adj.P.Val < 0.05, na.rm = TRUE)
  n_up   <- sum(res$adj.P.Val < 0.05 & res$logFC > 0, na.rm = TRUE)
  n_down <- sum(res$adj.P.Val < 0.05 & res$logFC < 0, na.rm = TRUE)
  cat(sprintf("%-28s  %5d  %5d  %4d  %5d\n", nm, nrow(res), n_sig, n_up, n_down))
}
#> GroupB - GroupA                 301      0     0      0
#> GroupC - GroupA                 301      1     1      0
#> GroupD - GroupA                 301      0     0      0
#> GroupC - GroupB                 301      2     2      0
#> GroupD - GroupB                 301      2     2      0
#> GroupD - GroupC                 301      0     0      0

Top results from the first contrast

first_contrast <- names(diff_res$results)[1]
res_df <- diff_res$results[[first_contrast]]
res_df <- res_df[order(res_df$adj.P.Val), ]

top10 <- utils::head(res_df[, c("logFC", "AveExpr", "adj.P.Val")], 10)
top10$logFC     <- round(top10$logFC, 3)
top10$AveExpr   <- round(top10$AveExpr, 2)
top10$adj.P.Val <- signif(top10$adj.P.Val, 3)

cat("Top 10 features in:", first_contrast, "\n\n")
#> Top 10 features in: GroupB - GroupA
print(top10)
#>               logFC AveExpr adj.P.Val
#> PC 16:0_19:4  3.204   12.10    0.0966
#> TG 20:0_20:3 -3.013   11.86    0.0966
#> DG 15:2_23:4  2.961   11.76    0.0966
#> TG 17:0_21:0  2.926   11.67    0.0966
#> PI 15:2_18:2 -2.743   11.31    0.1370
#> TG 15:0_24:6 -2.724   11.37    0.1370
#> PC 16:0_24:2 -2.602   11.30    0.1640
#> CE 18:2_24:6 -2.591   11.29    0.1640
#> TG 20:2_16:4  2.528   11.83    0.1690
#> PE 21:1_20:3  2.505   12.20    0.1690

Step 7 - Results Visualization

Volcano plot

Significant lipids (|logFC| > 1 and adj.P.Val < 0.05) are coloured by lipid class. The top 10 most significant are labelled.

create_volcano_plot_labeled(
  results             = diff_res$results[[first_contrast]],
  title               = paste("Volcano:", first_contrast),
  logfc_threshold     = 1,
  pval_threshold      = 0.05,
  top_labels          = 10,
  classification_data = classification,
  color_by            = "LipidGroup"
)

Volcano plot coloured by lipid class. Vertical dashed lines mark logFC = +/-1; horizontal dashed line marks adj.P.Val = 0.05.

Heatmap of significant features

sig_lipids <- rownames(res_df)[
  !is.na(res_df$adj.P.Val) &
  res_df$adj.P.Val < 0.05 &
  abs(res_df$logFC) > 1
]

if (length(sig_lipids) >= 5) {
  top_n  <- min(30, length(sig_lipids))
  avail  <- intersect(sig_lipids[seq_len(top_n)], colnames(norm_data))
  hm_mat <- t(norm_data[, avail, drop = FALSE])

  create_heatmap_robust(
    data_matrix  = hm_mat,
    metadata     = data_obj$metadata,
    group_column = "Sample Group",
    top_n        = top_n,
    title        = paste("Significant lipids:", first_contrast)
  )
} else {
  message("Fewer than 5 significant lipids at these thresholds.")
}

Step 8 - Lipid Expression

create_lipid_expression_barplot() shows per-sample abundance of individual lipids, sorted by group. It returns a single ggplot object when one lipid is selected, or a named list of ggplot objects when multiple lipids are selected.

# Take the 3 most significantly changed lipids
top_lipids <- utils::head(
  rownames(res_df)[!is.na(res_df$adj.P.Val)], 3
)

cat("Selected lipids:\n")
#> Selected lipids:
print(top_lipids)
#> [1] "PC 16:0_19:4" "TG 20:0_20:3" "DG 15:2_23:4"

expr_result <- create_lipid_expression_barplot(
  data_matrix     = norm_data,
  metadata        = data_obj$metadata,
  selected_lipids = top_lipids,
  group_column    = "Sample Group",
  data_type       = "normalized"
)

# The function returns a single ggplot (1 lipid) or a named list (multiple)
if (inherits(expr_result, "ggplot")) {
  print(expr_result)
} else {
  for (lipid_plot in expr_result) print(lipid_plot)
}

Per-sample abundance of the 3 most significant lipids. Samples are automatically sorted by group.

Step 9 - Enrichment Analysis

Enrichment analysis tests whether lipids in a particular class are collectively more increased or decreased than expected by chance, using the fgsea algorithm.

enrich_res <- perform_enrichment_analysis(
  results_list        = diff_res$results,
  classification_data = classification,
  min_set_size        = 3,
  max_set_size        = 500
)

cat("Contrasts with enrichment results:", length(enrich_res), "\n")
#> Contrasts with enrichment results: 6
cat("Categories per contrast          :", paste(names(enrich_res[[1]]),
                                                collapse = ", "), "\n")
#> Categories per contrast          : LipidGroup, LipidType, Saturation

# Show top enrichment results for the first contrast, by LipidGroup
grp_enrich <- enrich_res[[first_contrast]][["LipidGroup"]]

if (!is.null(grp_enrich) && nrow(grp_enrich) > 0) {
  grp_enrich <- grp_enrich[order(grp_enrich$pval), ]

  sig <- grp_enrich[grp_enrich$padj < 0.25, ]

  if (nrow(sig) > 0) {
    cat("Significantly enriched lipid groups (padj < 0.25) in:", first_contrast, "\n\n")
    out        <- sig[, c("pathway", "NES", "pval", "padj", "size")]
    out$NES    <- round(out$NES, 3)
    out$pval   <- signif(out$pval, 3)
    out$padj   <- signif(out$padj, 3)
    print(out, row.names = FALSE)
  } else {
    cat("No lipid groups reach padj < 0.25 in this contrast.\n")
    cat("Top 5 by p-value (", first_contrast, "):\n\n")
    out        <- utils::head(grp_enrich[, c("pathway", "NES", "pval", "padj", "size")], 5)
    out$NES    <- round(out$NES, 3)
    out$pval   <- signif(out$pval, 3)
    out$padj   <- signif(out$padj, 3)
    print(out, row.names = FALSE)
  }
} else {
  cat("No enrichment results available.\n")
}
#> No lipid groups reach padj < 0.25 in this contrast.
#> Top 5 by p-value ( GroupB - GroupA ):
#> 
#>               pathway    NES  pval  padj  size
#>                <char>  <num> <num> <num> <int>
#>         Glycerolipids  1.210 0.128 0.609    65
#>  Glycerophospholipids -1.119 0.244 0.609   117
#>         Sterol Lipids  0.878 0.662 0.985    25
#>         Sphingolipids -0.770 0.862 0.985    50
#>     Lysophospholipids -0.625 0.985 0.985    44

if (!is.null(grp_enrich) && nrow(grp_enrich) > 0) {
  create_enrichment_dotplot(
    enrichment_data = grp_enrich,
    title           = paste("Lipid Group Enrichment:", first_contrast),
    max_pathways    = 10
  )
}

Enrichment dot plot for the first contrast. Dot size reflects set size; colour reflects statistical significance.

Session Information

sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] rmarkdown_2.31   LIPIDIFy_0.99.0  BiocStyle_2.41.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.10          generics_0.1.4       tidyr_1.3.2         
#>  [4] fgsea_1.39.0         lattice_0.22-9       stringi_1.8.7       
#>  [7] digest_0.6.39        magrittr_2.0.5       evaluate_1.0.5      
#> [10] grid_4.6.0           estimability_1.5.1   RColorBrewer_1.1-3  
#> [13] mvtnorm_1.4-1        fastmap_1.2.0        Matrix_1.7-5        
#> [16] jsonlite_2.0.0       ggrepel_0.9.8        limma_3.69.2        
#> [19] gridExtra_2.3        BiocManager_1.30.27  purrr_1.2.2         
#> [22] scales_1.4.0         codetools_0.2-20     jquerylib_0.1.4     
#> [25] cli_3.6.6            rlang_1.2.0          scatterplot3d_0.3-45
#> [28] cowplot_1.2.0        leaps_3.2            withr_3.0.2         
#> [31] cachem_1.1.0         yaml_2.3.12          otel_0.2.0          
#> [34] FactoMineR_2.14      parallel_4.6.0       tools_4.6.0         
#> [37] multcompView_0.1-11  BiocParallel_1.47.0  dplyr_1.2.1         
#> [40] ggplot2_4.0.3        fastmatch_1.1-8      DT_0.34.0           
#> [43] flashClust_1.1-4     buildtools_1.0.0     vctrs_0.7.3         
#> [46] R6_2.6.1             lifecycle_1.0.5      emmeans_2.0.3       
#> [49] stringr_1.6.0        htmlwidgets_1.6.4    MASS_7.3-65         
#> [52] cluster_2.1.8.2      pkgconfig_2.0.3      pillar_1.11.1       
#> [55] bslib_0.11.0         gtable_0.3.6         data.table_1.18.4   
#> [58] Rcpp_1.1.1-1.1       glue_1.8.1           statmod_1.5.2       
#> [61] xfun_0.58            tibble_3.3.1         tidyselect_1.2.1    
#> [64] sys_3.4.3            knitr_1.51           farver_2.1.2        
#> [67] xtable_1.8-8         htmltools_0.5.9      maketools_1.3.2     
#> [70] labeling_0.4.3       compiler_4.6.0       S7_0.2.2