LIPIDIFy: Comprehensive Lipidomics Data Analysis

Introduction

LIPIDIFy is a comprehensive R package designed for the analysis of lipidomics data. It provides tools for:

  • Data preprocessing and quality control
  • Multiple normalization strategies
  • Differential abundance analysis (limma and EdgeR)
  • Enrichment analysis with flexible classification systems
  • Extensive visualization capabilities
  • Interactive Shiny interface for non-programmers

Installation

# From BioConductor (once submitted)
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("LIPIDIFy")

# Or from GitHub
# devtools::install_github("yourusername/LIPIDIFy")
library(LIPIDIFy)

Quick Start

Using the Shiny Interface

For biologists and those preferring a graphical interface:

launch_lipidomics_app()

This launches an interactive web application with tabs for each analysis step.

Using R Functions

For bioinformaticians and reproducible analyses:

# Load data
data <- load_lipidomics_data("path/to/your/data.csv")

# Classify lipids
classification <- get_lipid_classification(colnames(data$numeric_data))

# Normalize
normalized <- normalize_lipidomics_data(data$numeric_data, 
                                        methods = c("TIC", "Log2"))

# Differential analysis
diff_results <- perform_differential_analysis(normalized, data$metadata,
                                              method = "limma")

# Enrichment analysis
enrichment <- perform_enrichment_analysis(diff_results$results, 
                                          classification)

Data Input and Format

Required Data Structure

Your input CSV file should have:

  1. Metadata columns (first few columns):
    • Sample Name (required)
    • Sample Group (required)
    • Additional metadata (optional: Tumour ID, Weight, etc.)
  2. Lipid abundance columns (remaining columns):
    • Column names = Lipid names (e.g., “PC 16:0_18:1”)
    • Values = Abundance/intensity measurements

Example Data

The package includes example data:

example_data <- generate_example_data()
head(example_data[, 1:8])
#>          Sample Name Sample Group Tumour ID Weight (mg) PC 21:1_20:3
#> Sample_1    Sample_1       GroupA         1    44.13763     2997.316
#> Sample_2    Sample_2       GroupA         1    73.50917     4304.851
#> Sample_3    Sample_3       GroupA         2    39.81207     1058.812
#> Sample_4    Sample_4       GroupA         2    37.48295     1772.537
#> Sample_5    Sample_5       GroupA         3    49.59876     3576.774
#> Sample_6    Sample_6       GroupB         3    31.53722     1523.911
#>          PC 19:0_16:0 PC 21:2_22:0 PC 22:2_24:1
#> Sample_1     7669.117    1832.0385    3599.9755
#> Sample_2     2019.737     825.7991    4123.0720
#> Sample_3     3242.445    4097.9638     694.2239
#> Sample_4     6193.583    3328.1584    3298.3352
#> Sample_5     2529.651    5679.0787   22619.5831
#> Sample_6     5554.072   18798.9605    2951.1282

This generates synthetic data with 4 groups (GroupA-D) and realistic lipid names.

Loading Your Own Data

# Load from CSV file
my_data <- load_lipidomics_data("path/to/mydata.csv",
                                metadata_columns = c("Sample Name", 
                                                     "Sample Group",
                                                     "Treatment"))

# Components returned:
# $data - full data frame
# $metadata - metadata only
# $numeric_data - lipid abundances only

Lipid Classification

Automatic Classification

Lipids are automatically classified based on their names:

# Using example data
lipid_names <- colnames(example_data)[5:ncol(example_data)]
classification <- get_lipid_classification(lipid_names[1:10])
print(classification)
#>           Lipid           LipidGroup           LipidType Saturation
#> 1  PC 21:1_20:3 Glycerophospholipids Phosphatidylcholine       PUFA
#> 2  PC 19:0_16:0 Glycerophospholipids Phosphatidylcholine        SFA
#> 3  PC 21:2_22:0 Glycerophospholipids Phosphatidylcholine       PUFA
#> 4  PC 22:2_24:1 Glycerophospholipids Phosphatidylcholine       PUFA
#> 5  PC 15:1_19:6 Glycerophospholipids Phosphatidylcholine       PUFA
#> 6  PC 21:2_21:4 Glycerophospholipids Phosphatidylcholine       PUFA
#> 7  PC 16:1_21:6 Glycerophospholipids Phosphatidylcholine       PUFA
#> 8  PC 17:0_23:2 Glycerophospholipids Phosphatidylcholine       PUFA
#> 9  PC 20:0_24:6 Glycerophospholipids Phosphatidylcholine       PUFA
#> 10 PC 21:1_21:3 Glycerophospholipids Phosphatidylcholine       PUFA

Classification includes: - LipidGroup: Main category (Glycerophospholipids, Sphingolipids, etc.) - LipidType: Specific class (Phosphatidylcholine, Ceramide, etc.) - Saturation: Fatty acid saturation (SFA, MUFA, PUFA)

Custom Classification

You can provide your own classification with flexible hierarchy:

# Create custom classification CSV:
# Lipid, MainClass, SubClass, Function
# PC 16:0_18:1, Phospholipids, Choline-containing, Membrane
# ...

custom_class <- load_custom_classification("my_classification.csv")

Export Classification

# Save classification for later use or sharing
export_classification(classification, "lipid_classification.csv")

Data Visualization

Raw Data Exploration

# Sample-wise boxplot
plot <- visualize_raw_data_improved(list(numeric_data = example_data[, 5:ncol(example_data)]),
                                    plot_type = "boxplot",
                                    view_mode = "sample")
print(plot)

# Lipid-wise distribution for top variable lipids
plot2 <- visualize_raw_data_improved(list(numeric_data = example_data[, 5:ncol(example_data)]),
                                     plot_type = "boxplot",
                                     view_mode = "lipid",
                                     top_n = 30)
print(plot2)

Individual Lipid Expression

Visualize specific lipids across samples:

# Prepare data
data_matrix <- as.matrix(example_data[, 5:ncol(example_data)])
rownames(data_matrix) <- example_data$`Sample Name`
metadata <- example_data[, 1:4]

# Create barplot for selected lipids
plots <- create_lipid_expression_barplot(
  data_matrix = data_matrix,
  metadata = metadata,
  selected_lipids = c("PC 16:0_18:1", "PE 18:0_20:4"),
  data_type = "raw"
)

# Display plots
if (is.list(plots)) {
  print(plots[[1]])
  print(plots[[2]])
} else {
  print(plots)
}

Normalization

Missing Value Imputation

Missing values in lipidomics data arise when a lipid’s signal falls below the instrument’s limit of detection. LIPIDIFy provides six imputation strategies via impute_missing_values(). Imputing after normalisation ensures that imputed values are on the same scale as observed values.

Available Methods

print(get_imputation_methods())
#> [1] "half_min" "min"      "zero"     "mean"     "median"   "knn"
# Check missingness in normalised data
normalized <- normalize_lipidomics_data(data$numeric_data, methods = c("TIC", "Log2"))
cat("Missing values:", sum(is.na(normalized)), "\n")

# Half-minimum imputation on the normalised matrix
imputed <- impute_missing_values(normalized, method = "half_min")

# KNN imputation (requires Bioconductor impute package)
# imputed <- impute_missing_values(normalized, method = "knn", k = 5)
Method Best for
half_min MS lipidomics (models LOD on normalised scale) – recommended default
knn Data missing at random; requires impute Bioconductor package
zero Only when absence of signal truly means zero
mean / median Small proportions of missing values

Batch Effect Correction

Technical batch effects arise from differences in sample preparation or instrument performance across analytical runs. correct_batch_effects() removes these effects while preserving biological signal.

# Assumes your metadata has a "Batch" column
corrected <- correct_batch_effects(
  data_matrix  = normalized,
  metadata     = data$metadata,
  batch_column = "Batch",
  group_column = "Sample Group",
  method       = "limma"           # or "combat" (requires sva package)
)

# Verify with PCA: samples should cluster by biology, not batch
pca_post <- perform_pca(corrected, data$metadata, "Sample Group")
create_pca_plot_with_ellipses(pca_post$pca_data, pca_post$variance_explained,
                              ellipse_type = "confidence",
                              title = "PCA after Batch Correction")

Multiple normalization methods are available:

# Get list of available methods
methods <- get_normalization_methods()
print(methods)
#>  [1] "TIC"        "PQN"        "Quantile"   "Log2Median" "Median"    
#>  [6] "Mean"       "Log2"       "Log10"      "Sqrt"       "None"

# Apply normalization pipeline
normalized <- normalize_lipidomics_data(
  data = as.matrix(example_data[, 5:ncol(example_data)]),
  methods = c("TIC", "Log2")
)

# Compare different pipelines
pipeline1 <- normalize_lipidomics_data(example_data[, 5:ncol(example_data)], 
                                       c("TIC", "Log2"))
pipeline2 <- normalize_lipidomics_data(example_data[, 5:ncol(example_data)], 
                                       c("PQN", "Log2"))

Normalization Methods Available

  • TIC: Total Ion Current
  • PQN: Probabilistic Quotient Normalization
  • Quantile: Quantile normalization
  • Log2Median: Log2 transformation followed by median centering
  • Median/Mean: Centering methods
  • Log2/Log10: Log transformations
  • Sqrt: Square root transformation

Principal Component Analysis

PCA with optional confidence ellipses:

# Prepare data
metadata <- example_data[, 1:4]
data_matrix <- t(as.matrix(example_data[, 5:ncol(example_data)]))

# Perform PCA
pca_result <- perform_pca(data_matrix, metadata, group_column = "Sample Group")

# Plot without ellipses
plot1 <- create_pca_plot_with_ellipses(
  pca_data = pca_result$pca_data,
  variance_explained = pca_result$variance_explained,
  ellipse_type = "none"
)

# Plot with confidence ellipses (95%)
plot2 <- create_pca_plot_with_ellipses(
  pca_data = pca_result$pca_data,
  variance_explained = pca_result$variance_explained,
  ellipse_type = "confidence",
  confidence_level = 0.95
)

# Plot with visual circles
plot3 <- create_pca_plot_with_ellipses(
  pca_data = pca_result$pca_data,
  variance_explained = pca_result$variance_explained,
  ellipse_type = "visual"
)

Ellipse Types: - none: Raw points only - confidence: Statistical confidence regions (default 95%) - visual: Simple visual circles to aid group identification

Differential Abundance Analysis

Using limma

# Perform differential analysis with limma
diff_results_limma <- perform_differential_analysis(
  data_matrix = normalized,
  metadata = metadata,
  group_column = "Sample Group",
  contrasts_list = c("GroupB - GroupA", "GroupC - GroupA"),
  method = "limma"
)

# View results for first contrast
head(diff_results_limma$results[[1]])

Using EdgeR

# Perform differential analysis with EdgeR
diff_results_edger <- perform_differential_analysis(
  data_matrix = normalized,
  metadata = metadata,
  group_column = "Sample Group",
  contrasts_list = c("GroupB - GroupA", "GroupC - GroupA"),
  method = "edger"
)

# View results
head(diff_results_edger$results[[1]])

Comparing Methods

Both methods provide similar outputs: - logFC: Log fold change - P.Value: Raw p-value - adj.P.Val: Adjusted p-value (FDR) - AveExpr: Average expression

When to use which: - limma: Default choice for lipidomics, fast and robust - EdgeR: Alternative method, originally for count data but adaptable

Contrast Selection

Specify only the contrasts of interest:

# Create all pairwise contrasts
groups <- c("GroupA", "GroupB", "GroupC", "GroupD")
all_contrasts <- create_default_contrasts(groups)
print(all_contrasts)
#> [1] "GroupB - GroupA" "GroupC - GroupA" "GroupD - GroupA" "GroupC - GroupB"
#> [5] "GroupD - GroupB" "GroupD - GroupC"

# Or specify custom contrasts
custom_contrasts <- c("GroupB - GroupA", 
                     "GroupC - GroupA", 
                     "GroupD - GroupA")

Visualization of Results

Volcano Plots

# Create volcano plot
volcano <- create_volcano_plot_labeled(
  results = diff_results_limma$results[[1]],
  title = "GroupB vs GroupA",
  top_labels = 10,
  pval_threshold = 0.05,
  logfc_threshold = 1
)
print(volcano)

Heatmaps

# create_heatmap_robust expects features as rows, samples as columns
heatmap <- create_heatmap_robust(
  data_matrix  = t(normalized[, sig_lipids, drop = FALSE]),
  metadata     = metadata,
  group_column = "Sample Group",
  top_n        = length(sig_lipids),
  title        = "Significant lipids"
)

Enrichment Analysis

Standard Enrichment

Using automatic lipid classification:

enrichment_results <- perform_enrichment_analysis(
  results_list = diff_results_limma$results,
  classification_data = classification,
  min_set_size = 5,
  max_set_size = 500
)

# View results for first contrast, lipid group enrichment
head(enrichment_results[[1]]$LipidGroup)

Custom Enrichment Sets

Define your own functional sets:

# Create custom sets CSV:
# Lipid,Set_Name
# PC 16:0_18:1,Membrane_Lipids
# PE 18:0_20:4,Membrane_Lipids
# TG 16:0_18:1_20:4,Storage_Lipids
# ...

custom_sets <- load_custom_enrichment_sets("my_sets.csv")

# Run enrichment with custom sets
enrichment_custom <- perform_enrichment_analysis(
  results_list = diff_results_limma$results,
  classification_data = classification,
  custom_sets = custom_sets
)

Enrichment Visualization

# Dotplot
dotplot <- create_enrichment_dotplot(
  enrichment_data = enrichment_results[[1]]$LipidGroup,
  title = "Lipid Group Enrichment"
)

# Barplot
barplot <- create_enrichment_barplot(
  enrichment_data = enrichment_results[[1]]$LipidGroup,
  title = "Lipid Group Enrichment",
  top_n = 15
)

Complete Analysis Workflow

Here’s a complete example workflow:

# 1. Load data
data <- load_lipidomics_data("your_data.csv")

# 2. Classify lipids
classification <- get_lipid_classification(colnames(data$numeric_data))
export_classification(classification, "classification.csv")

# 3. Visualize raw data
raw_plot <- visualize_raw_data_improved(data, plot_type = "boxplot")

# 4. Normalize
normalized <- normalize_lipidomics_data(data$numeric_data, 
                                        methods = c("TIC", "Log2"))

# 5. PCA
pca_result <- perform_pca(t(normalized), data$metadata)
pca_plot <- create_pca_plot_with_ellipses(pca_result$pca_data,
                                          pca_result$variance_explained,
                                          ellipse_type = "confidence")

# 6. Differential analysis
diff_results <- perform_differential_analysis(
  normalized, data$metadata,
  contrasts_list = c("Treatment - Control"),
  method = "limma"
)

# 7. Volcano plot
volcano <- create_volcano_plot_labeled(diff_results$results[[1]])

# 8. Enrichment
enrichment <- perform_enrichment_analysis(diff_results$results, 
                                          classification)

# 9. Enrichment visualization
enrich_plot <- create_enrichment_dotplot(enrichment[[1]]$LipidGroup)

# 10. Save results
write.csv(diff_results$results[[1]], "differential_results.csv")
write.csv(enrichment[[1]]$LipidGroup, "enrichment_results.csv")

Advanced Topics

Batch Effect Correction

If you have batch information, incorporate it into your design:

# Add batch to metadata
data$metadata$Batch <- rep(c("Batch1", "Batch2"), each = 10)

# Include in differential analysis design
# This requires custom design matrix - see limma documentation

Integration with Other Tools

Export results for use in other software:

# Export for MetaboAnalyst
write.csv(normalized, "normalized_for_metaboanalyst.csv")

# Export for Cytoscape
write.csv(enrichment[[1]]$LipidGroup, "enrichment_for_cytoscape.csv")

Quality Control

# Check for outliers in PCA
pca_result <- perform_pca(t(normalized), data$metadata)

# Samples with PC1 > 3 SD might be outliers
outliers <- pca_result$pca_data[abs(pca_result$pca_data$PC1) > 
                                3 * sd(pca_result$pca_data$PC1), ]
print(outliers)

# Check coefficient of variation
cv <- apply(data$numeric_data, 2, function(x) sd(x) / mean(x) * 100)
high_cv_lipids <- names(cv)[cv > 50]

Session Information

sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] LIPIDIFy_0.99.0  BiocStyle_2.41.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] vctrs_0.7.3         cli_3.6.6           knitr_1.51         
#>  [4] rlang_1.2.0         xfun_0.58           stringi_1.8.7      
#>  [7] otel_0.2.0          jsonlite_2.0.0      glue_1.8.1         
#> [10] buildtools_1.0.0    htmltools_0.5.9     maketools_1.3.2    
#> [13] sys_3.4.3           sass_0.4.10         rmarkdown_2.31     
#> [16] evaluate_1.0.5      jquerylib_0.1.4     fastmap_1.2.0      
#> [19] yaml_2.3.12         lifecycle_1.0.5     BiocManager_1.30.27
#> [22] stringr_1.6.0       compiler_4.6.0      digest_0.6.39      
#> [25] R6_2.6.1            magrittr_2.0.5      bslib_0.11.0       
#> [28] tools_4.6.0         cachem_1.1.0

References

  • Ritchie ME, et al. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research.
  • Robinson MD, McCarthy DJ, Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics.
  • Korotkevich G, et al. (2021). Fast gene set enrichment analysis. bioRxiv.

Appendix: Supported Lipid Nomenclature

The package recognizes standard lipid nomenclature:

  • Glycerophospholipids: PC, PE, PI, PS, PG, PA
  • Lysophospholipids: LPC, LPE, LPI, LPS, LPG, LPA
  • Sphingolipids: SM, Cer, dhCer, HexCer, Gangliosides
  • Glycerolipids: DG, TG, MG
  • Sterol Lipids: CE, Cholesterol
  • Acylcarnitines: CAR

Format examples: - PC 16:0_18:1 (individual fatty acids) - PC 34:1 (total carbon:double bonds) - LPC 18:2 (single fatty acid) - TG 16:0_18:1_20:4 (three fatty acids)