LIPIDIFy provides an end-to-end pipeline for mass-spectrometry lipidomics data:
| Step | What it does | Key function(s) |
|---|---|---|
| 1 | Load data | generate_example_data(),
load_lipidomics_data_from_df() |
| 2 | Classify lipids | classify_lipids() |
| 3 | Visualize raw data | visualize_raw_data_improved() |
| 4 | Compare and apply normalization | apply_normalizations(),
create_pipeline_plot() |
| 4.5 | Impute missing values | impute_missing_values() |
| 4.6 | Correct batch effects | correct_batch_effects() |
| 5 | Quality control (PCA) | perform_pca(),
create_pca_plot_with_ellipses() |
| 6 | Differential analysis | perform_differential_analysis() |
| 7 | Visualize results | create_volcano_plot_labeled(),
create_heatmap_robust() |
| 8 | Lipid expression plots | create_lipid_expression_barplot() |
| 9 | Enrichment analysis | perform_enrichment_analysis() |
The same workflow is available without any coding via the Shiny application:
generate_example_data() creates a reproducible synthetic
dataset with realistic lipid names (PC, PE, TG, SM, Cer, etc.) and four
sample groups (GroupA, GroupB, GroupC, GroupD; 5 replicates each; 301
lipids total).
library(LIPIDIFy)
# Generate the built-in synthetic dataset (seed = 42 for reproducibility)
raw_df <- generate_example_data()
data_obj <- load_lipidomics_data_from_df(raw_df)
cat("Samples :", nrow(data_obj$numeric_data), "\n")
#> Samples : 20
cat("Lipids :", ncol(data_obj$numeric_data), "\n")
#> Lipids : 301
cat("Groups :", paste(unique(data_obj$metadata$`Sample Group`),
collapse = ", "), "\n")
#> Groups : GroupA, GroupB, GroupC, GroupDload_lipidomics_data_from_df() returns a named
list with three elements:
$numeric_data - matrix of lipid abundances (samples x
lipids)$metadata - data frame of sample metadata$data - the original full data frame# Metadata of the first 4 samples
head(data_obj$metadata[, c("Sample Name", "Sample Group", "Tumour ID")], 4)
#> Sample Name Sample Group Tumour ID
#> Sample_1 Sample_1 GroupA 1
#> Sample_2 Sample_2 GroupA 1
#> Sample_3 Sample_3 GroupA 2
#> Sample_4 Sample_4 GroupA 2
# First 3 lipids of the first 4 samples
data_obj$numeric_data[1:4, 1:3]
#> PC 18:1_24:1 PC 21:0_24:6 PC 14:0_18:4
#> Sample_1 663.143 5815.674 2227.466
#> Sample_2 10415.648 10528.302 7180.788
#> Sample_3 3275.850 3662.961 7618.975
#> Sample_4 2993.241 12752.211 1606.933classify_lipids() parses lipid names and assigns three
classification levels:
lipid_names <- colnames(data_obj$numeric_data)
classification <- classify_lipids(lipid_names)
head(classification, 8)
#> Lipid LipidGroup LipidType Saturation
#> 1 PC 18:1_24:1 Glycerophospholipids Phosphatidylcholine PUFA
#> 2 PC 21:0_24:6 Glycerophospholipids Phosphatidylcholine PUFA
#> 3 PC 14:0_18:4 Glycerophospholipids Phosphatidylcholine PUFA
#> 4 PC 22:2_24:1 Glycerophospholipids Phosphatidylcholine PUFA
#> 5 PC 16:2_20:1 Glycerophospholipids Phosphatidylcholine PUFA
#> 6 PC 21:1_20:2 Glycerophospholipids Phosphatidylcholine PUFA
#> 7 PC 17:2_16:4 Glycerophospholipids Phosphatidylcholine PUFA
#> 8 PC 14:1_14:4 Glycerophospholipids Phosphatidylcholine PUFAcat("Lipid groups:\n")
#> Lipid groups:
print(sort(table(classification$LipidGroup), decreasing = TRUE))
#>
#> Glycerophospholipids Glycerolipids Sphingolipids
#> 117 65 50
#> Lysophospholipids Sterol Lipids
#> 44 25
cat("\nFatty-acid saturation:\n")
#>
#> Fatty-acid saturation:
print(table(classification$Saturation))
#>
#> MUFA PUFA SFA
#> 43 211 47Examining distributions before normalization helps detect outlier samples, batch effects, and scaling differences.
raw_list <- list(numeric_data = data_obj$numeric_data,
metadata = data_obj$metadata)
visualize_raw_data_improved(
raw_list,
plot_type = "boxplot",
view_mode = "sample",
metadata = data_obj$metadata,
group_column = "Sample Group"
)Per-sample intensity distributions before normalization. Boxes are coloured by group.
visualize_raw_data_improved(
raw_list,
plot_type = "density",
view_mode = "sample",
metadata = data_obj$metadata,
group_column = "Sample Group"
)Intensity density curves before normalization, one curve per sample.
It is good practice to compare normalization strategies visually before committing to one. Below we compare TIC+Log2 (the most common starting point) against PQN+Log2 (more robust when large fold-changes are expected).
pipeline1 <- apply_normalizations(data_obj$numeric_data, c("TIC", "Log2"))
pipeline2 <- apply_normalizations(data_obj$numeric_data, c("PQN", "Log2"))
p1 <- create_pipeline_plot(pipeline1,
title = "Pipeline 1: TIC + Log2",
metadata = data_obj$metadata,
plot_type = "boxplot")
p2 <- create_pipeline_plot(pipeline2,
title = "Pipeline 2: PQN + Log2",
metadata = data_obj$metadata,
plot_type = "boxplot")
gridExtra::grid.arrange(p1, p2, ncol = 1)Side-by-side comparison of two normalization pipelines applied to the same raw data.
norm_data <- apply_normalizations(
data_obj$numeric_data,
methods = c("TIC", "Log2")
)
dim(norm_data) # samples x lipids - same dimensions as raw data
#> [1] 20 301norm_list <- list(numeric_data = norm_data, metadata = data_obj$metadata)
visualize_raw_data_improved(
norm_list,
plot_type = "boxplot",
view_mode = "sample",
metadata = data_obj$metadata,
group_column = "Sample Group"
)Per-sample distributions after TIC + Log2 normalization. Medians are now aligned across groups.
Imputing after normalisation ensures that imputed values sit on the same scale as observed values, and that KNN imputation operates on data where samples are already comparable.
# Inspect missingness in the normalised data (none here -- this is synthetic)
n_miss <- sum(is.na(norm_data))
cat("Missing values in normalised data:", n_miss, "\n")
#> Missing values in normalised data: 0
# Available methods
cat("Imputation methods available:\n")
#> Imputation methods available:
print(get_imputation_methods())
#> [1] "half_min" "min" "zero" "mean" "median" "knn"# For demonstration: introduce 5% artificial missing values
set.seed(42)
demo_norm <- norm_data
demo_norm[sample(length(demo_norm), size = round(0.05 * length(demo_norm)))] <- NA
cat("Missingness introduced:", sum(is.na(demo_norm)),
sprintf("(%.1f%%)\n", 100 * mean(is.na(demo_norm))))
#> Missingness introduced: 301 (5.0%)
# Apply half-minimum imputation on normalised data
imputed_norm <- impute_missing_values(demo_norm, method = "half_min")
cat("Missing after imputation:", sum(is.na(imputed_norm)), "\n")
#> Missing after imputation: 0Choosing a method:
| Method | Best for |
|---|---|
half_min |
Most lipidomics data; models the LOD on normalised scale |
knn |
Data missing at random (MAR); requires impute
Bioconductor package |
zero |
Only when absence of signal truly means zero |
mean / median |
Small proportions of missing values |
For the remainder of this vignette we use the complete normalised data without artificial missingness.
When samples were processed across multiple runs, instruments, or operators, technical batch effects can obscure biological differences. Batch correction should be applied after normalisation and imputation, and verified with PCA.
Two methods are available:
| Method | When to use |
|---|---|
"limma" |
Default; fast and reliable for balanced designs |
"combat" |
Better for large or unbalanced batch effects; requires
sva package |
# Simulate batch labels (real data would have this in the metadata file)
data_obj$metadata$Batch <- rep(c("Run1", "Run2"), each = 10)
# Apply limma batch correction after normalisation
norm_corrected <- correct_batch_effects(
data_matrix = norm_data,
metadata = data_obj$metadata,
batch_column = "Batch",
group_column = "Sample Group",
method = "limma"
)
# Verify: PCA should now show clustering by biology, not by batch
pca_corrected <- perform_pca(norm_corrected, data_obj$metadata, "Sample Group")
create_pca_plot_with_ellipses(
pca_data = pca_corrected$pca_data,
variance_explained = pca_corrected$variance_explained,
ellipse_type = "confidence",
title = "PCA after Batch Correction"
)For the remainder of this vignette we proceed with the uncorrected normalised data (the example dataset has no batch structure).
PCA on normalized data should show samples clustering by biological group rather than by technical artefact.
pca_res <- perform_pca(norm_data, data_obj$metadata, "Sample Group")
create_pca_plot_with_ellipses(
pca_data = pca_res$pca_data,
variance_explained = pca_res$variance_explained,
ellipse_type = "confidence",
show_sample_labels = TRUE,
title = "PCA - TIC + Log2 Normalized Data"
)PCA of TIC + Log2 normalized data. Ellipses are 95% confidence regions.
create_default_contrasts() generates all pairwise
comparisons from the groups present in your data. You can also specify
contrasts manually.
groups <- unique(data_obj$metadata$`Sample Group`)
contrasts <- create_default_contrasts(groups)
cat("All pairwise contrasts (", length(contrasts), "total ):\n")
#> All pairwise contrasts ( 6 total ):
print(contrasts)
#> [1] "GroupB - GroupA" "GroupC - GroupA" "GroupD - GroupA" "GroupC - GroupB"
#> [5] "GroupD - GroupB" "GroupD - GroupC"diff_res <- perform_differential_analysis(
data_matrix = norm_data,
metadata = data_obj$metadata,
group_column = "Sample Group",
contrasts_list = contrasts,
method = "limma"
)cat(sprintf("%-28s %5s %5s %4s %5s\n",
"Contrast", "Total", "Sig", "Up", "Down"))
#> Contrast Total Sig Up Down
cat(strrep("-", 55), "\n")
#> -------------------------------------------------------
for (nm in names(diff_res$results)) {
res <- diff_res$results[[nm]]
n_sig <- sum(res$adj.P.Val < 0.05, na.rm = TRUE)
n_up <- sum(res$adj.P.Val < 0.05 & res$logFC > 0, na.rm = TRUE)
n_down <- sum(res$adj.P.Val < 0.05 & res$logFC < 0, na.rm = TRUE)
cat(sprintf("%-28s %5d %5d %4d %5d\n", nm, nrow(res), n_sig, n_up, n_down))
}
#> GroupB - GroupA 301 0 0 0
#> GroupC - GroupA 301 1 1 0
#> GroupD - GroupA 301 0 0 0
#> GroupC - GroupB 301 2 2 0
#> GroupD - GroupB 301 2 2 0
#> GroupD - GroupC 301 0 0 0first_contrast <- names(diff_res$results)[1]
res_df <- diff_res$results[[first_contrast]]
res_df <- res_df[order(res_df$adj.P.Val), ]
top10 <- utils::head(res_df[, c("logFC", "AveExpr", "adj.P.Val")], 10)
top10$logFC <- round(top10$logFC, 3)
top10$AveExpr <- round(top10$AveExpr, 2)
top10$adj.P.Val <- signif(top10$adj.P.Val, 3)
cat("Top 10 features in:", first_contrast, "\n\n")
#> Top 10 features in: GroupB - GroupA
print(top10)
#> logFC AveExpr adj.P.Val
#> PC 16:0_19:4 3.204 12.10 0.0966
#> TG 20:0_20:3 -3.013 11.86 0.0966
#> DG 15:2_23:4 2.961 11.76 0.0966
#> TG 17:0_21:0 2.926 11.67 0.0966
#> PI 15:2_18:2 -2.743 11.31 0.1370
#> TG 15:0_24:6 -2.724 11.37 0.1370
#> PC 16:0_24:2 -2.602 11.30 0.1640
#> CE 18:2_24:6 -2.591 11.29 0.1640
#> TG 20:2_16:4 2.528 11.83 0.1690
#> PE 21:1_20:3 2.505 12.20 0.1690Significant lipids (|logFC| > 1 and adj.P.Val < 0.05) are coloured by lipid class. The top 10 most significant are labelled.
create_volcano_plot_labeled(
results = diff_res$results[[first_contrast]],
title = paste("Volcano:", first_contrast),
logfc_threshold = 1,
pval_threshold = 0.05,
top_labels = 10,
classification_data = classification,
color_by = "LipidGroup"
)Volcano plot coloured by lipid class. Vertical dashed lines mark logFC = +/-1; horizontal dashed line marks adj.P.Val = 0.05.
sig_lipids <- rownames(res_df)[
!is.na(res_df$adj.P.Val) &
res_df$adj.P.Val < 0.05 &
abs(res_df$logFC) > 1
]
if (length(sig_lipids) >= 5) {
top_n <- min(30, length(sig_lipids))
avail <- intersect(sig_lipids[seq_len(top_n)], colnames(norm_data))
hm_mat <- t(norm_data[, avail, drop = FALSE])
create_heatmap_robust(
data_matrix = hm_mat,
metadata = data_obj$metadata,
group_column = "Sample Group",
top_n = top_n,
title = paste("Significant lipids:", first_contrast)
)
} else {
message("Fewer than 5 significant lipids at these thresholds.")
}create_lipid_expression_barplot() shows per-sample
abundance of individual lipids, sorted by group. It returns a single
ggplot object when one lipid is selected, or a named list of ggplot
objects when multiple lipids are selected.
# Take the 3 most significantly changed lipids
top_lipids <- utils::head(
rownames(res_df)[!is.na(res_df$adj.P.Val)], 3
)
cat("Selected lipids:\n")
#> Selected lipids:
print(top_lipids)
#> [1] "PC 16:0_19:4" "TG 20:0_20:3" "DG 15:2_23:4"
expr_result <- create_lipid_expression_barplot(
data_matrix = norm_data,
metadata = data_obj$metadata,
selected_lipids = top_lipids,
group_column = "Sample Group",
data_type = "normalized"
)
# The function returns a single ggplot (1 lipid) or a named list (multiple)
if (inherits(expr_result, "ggplot")) {
print(expr_result)
} else {
for (lipid_plot in expr_result) print(lipid_plot)
}Per-sample abundance of the 3 most significant lipids. Samples are automatically sorted by group.
Per-sample abundance of the 3 most significant lipids. Samples are automatically sorted by group.
Per-sample abundance of the 3 most significant lipids. Samples are automatically sorted by group.
Enrichment analysis tests whether lipids in a particular class are collectively more increased or decreased than expected by chance, using the fgsea algorithm.
enrich_res <- perform_enrichment_analysis(
results_list = diff_res$results,
classification_data = classification,
min_set_size = 3,
max_set_size = 500
)
cat("Contrasts with enrichment results:", length(enrich_res), "\n")
#> Contrasts with enrichment results: 6
cat("Categories per contrast :", paste(names(enrich_res[[1]]),
collapse = ", "), "\n")
#> Categories per contrast : LipidGroup, LipidType, Saturation# Show top enrichment results for the first contrast, by LipidGroup
grp_enrich <- enrich_res[[first_contrast]][["LipidGroup"]]
if (!is.null(grp_enrich) && nrow(grp_enrich) > 0) {
grp_enrich <- grp_enrich[order(grp_enrich$pval), ]
sig <- grp_enrich[grp_enrich$padj < 0.25, ]
if (nrow(sig) > 0) {
cat("Significantly enriched lipid groups (padj < 0.25) in:", first_contrast, "\n\n")
out <- sig[, c("pathway", "NES", "pval", "padj", "size")]
out$NES <- round(out$NES, 3)
out$pval <- signif(out$pval, 3)
out$padj <- signif(out$padj, 3)
print(out, row.names = FALSE)
} else {
cat("No lipid groups reach padj < 0.25 in this contrast.\n")
cat("Top 5 by p-value (", first_contrast, "):\n\n")
out <- utils::head(grp_enrich[, c("pathway", "NES", "pval", "padj", "size")], 5)
out$NES <- round(out$NES, 3)
out$pval <- signif(out$pval, 3)
out$padj <- signif(out$padj, 3)
print(out, row.names = FALSE)
}
} else {
cat("No enrichment results available.\n")
}
#> No lipid groups reach padj < 0.25 in this contrast.
#> Top 5 by p-value ( GroupB - GroupA ):
#>
#> pathway NES pval padj size
#> <char> <num> <num> <num> <int>
#> Glycerolipids 1.210 0.128 0.609 65
#> Glycerophospholipids -1.119 0.244 0.609 117
#> Sterol Lipids 0.878 0.662 0.985 25
#> Sphingolipids -0.770 0.862 0.985 50
#> Lysophospholipids -0.625 0.985 0.985 44if (!is.null(grp_enrich) && nrow(grp_enrich) > 0) {
create_enrichment_dotplot(
enrichment_data = grp_enrich,
title = paste("Lipid Group Enrichment:", first_contrast),
max_pathways = 10
)
}Enrichment dot plot for the first contrast. Dot size reflects set size; colour reflects statistical significance.
sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] rmarkdown_2.31 LIPIDIFy_0.99.0 BiocStyle_2.41.0
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.10 generics_0.1.4 tidyr_1.3.2
#> [4] fgsea_1.39.0 lattice_0.22-9 stringi_1.8.7
#> [7] digest_0.6.39 magrittr_2.0.5 evaluate_1.0.5
#> [10] grid_4.6.0 estimability_1.5.1 RColorBrewer_1.1-3
#> [13] mvtnorm_1.4-1 fastmap_1.2.0 Matrix_1.7-5
#> [16] jsonlite_2.0.0 ggrepel_0.9.8 limma_3.69.2
#> [19] gridExtra_2.3 BiocManager_1.30.27 purrr_1.2.2
#> [22] scales_1.4.0 codetools_0.2-20 jquerylib_0.1.4
#> [25] cli_3.6.6 rlang_1.2.0 scatterplot3d_0.3-45
#> [28] cowplot_1.2.0 leaps_3.2 withr_3.0.2
#> [31] cachem_1.1.0 yaml_2.3.12 otel_0.2.0
#> [34] FactoMineR_2.14 parallel_4.6.0 tools_4.6.0
#> [37] multcompView_0.1-11 BiocParallel_1.47.0 dplyr_1.2.1
#> [40] ggplot2_4.0.3 fastmatch_1.1-8 DT_0.34.0
#> [43] flashClust_1.1-4 buildtools_1.0.0 vctrs_0.7.3
#> [46] R6_2.6.1 lifecycle_1.0.5 emmeans_2.0.3
#> [49] stringr_1.6.0 htmlwidgets_1.6.4 MASS_7.3-65
#> [52] cluster_2.1.8.2 pkgconfig_2.0.3 pillar_1.11.1
#> [55] bslib_0.11.0 gtable_0.3.6 data.table_1.18.4
#> [58] Rcpp_1.1.1-1.1 glue_1.8.1 statmod_1.5.2
#> [61] xfun_0.58 tibble_3.3.1 tidyselect_1.2.1
#> [64] sys_3.4.3 knitr_1.51 farver_2.1.2
#> [67] xtable_1.8-8 htmltools_0.5.9 maketools_1.3.2
#> [70] labeling_0.4.3 compiler_4.6.0 S7_0.2.2