--- title: "LIPIDIFy: A Complete Lipidomics Analysis Workflow" author: "Fayrouz Hammal" date: "`r Sys.Date()`" package: LIPIDIFy output: rmarkdown::html_document: toc: true toc_float: true toc_depth: 3 theme: flatly highlight: tango fig_width: 9 fig_height: 5 vignette: > %\VignetteIndexEntry{LIPIDIFy: A Complete Lipidomics Analysis Workflow} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 9, fig.height = 5, fig.align = "center", warning = FALSE, message = FALSE ) ``` ## Overview **LIPIDIFy** provides an end-to-end pipeline for mass-spectrometry lipidomics data: | Step | What it does | Key function(s) | |------|-------------|-----------------| | 1 | Load data | `generate_example_data()`, `load_lipidomics_data_from_df()` | | 2 | Classify lipids | `classify_lipids()` | | 3 | Visualize raw data | `visualize_raw_data_improved()` | | 4 | Compare and apply normalization | `apply_normalizations()`, `create_pipeline_plot()` | | 4.5 | Impute missing values | `impute_missing_values()` | | 4.6 | Correct batch effects | `correct_batch_effects()` | | 5 | Quality control (PCA) | `perform_pca()`, `create_pca_plot_with_ellipses()` | | 6 | Differential analysis | `perform_differential_analysis()` | | 7 | Visualize results | `create_volcano_plot_labeled()`, `create_heatmap_robust()` | | 8 | Lipid expression plots | `create_lipid_expression_barplot()` | | 9 | Enrichment analysis | `perform_enrichment_analysis()` | The same workflow is available without any coding via the Shiny application: ```{r launch-app, eval=FALSE} library(LIPIDIFy) launch_lipidomics_app() ``` *** ## Step 1 - Load the Example Dataset `generate_example_data()` creates a reproducible synthetic dataset with realistic lipid names (PC, PE, TG, SM, Cer, etc.) and four sample groups (GroupA, GroupB, GroupC, GroupD; 5 replicates each; 301 lipids total). ```{r load-data} library(LIPIDIFy) # Generate the built-in synthetic dataset (seed = 42 for reproducibility) raw_df <- generate_example_data() data_obj <- load_lipidomics_data_from_df(raw_df) cat("Samples :", nrow(data_obj$numeric_data), "\n") cat("Lipids :", ncol(data_obj$numeric_data), "\n") cat("Groups :", paste(unique(data_obj$metadata$`Sample Group`), collapse = ", "), "\n") ``` `load_lipidomics_data_from_df()` returns a **named list** with three elements: - `$numeric_data` - matrix of lipid abundances (samples x lipids) - `$metadata` - data frame of sample metadata - `$data` - the original full data frame ```{r peek-data} # Metadata of the first 4 samples head(data_obj$metadata[, c("Sample Name", "Sample Group", "Tumour ID")], 4) # First 3 lipids of the first 4 samples data_obj$numeric_data[1:4, 1:3] ``` *** ## Step 2 - Lipid Classification `classify_lipids()` parses lipid names and assigns three classification levels: - **LipidGroup** - broad category (e.g., Glycerophospholipids, Sphingolipids) - **LipidType** - specific subclass (e.g., Phosphatidylcholine) - **Saturation** - SFA (0 double bonds), MUFA (1), PUFA (> 1) ```{r classify} lipid_names <- colnames(data_obj$numeric_data) classification <- classify_lipids(lipid_names) head(classification, 8) ``` ```{r classify-summary} cat("Lipid groups:\n") print(sort(table(classification$LipidGroup), decreasing = TRUE)) cat("\nFatty-acid saturation:\n") print(table(classification$Saturation)) ``` *** ## Step 3 - Raw Data Visualization Examining distributions **before normalization** helps detect outlier samples, batch effects, and scaling differences. ```{r raw-boxplot, fig.cap="Per-sample intensity distributions before normalization. Boxes are coloured by group."} raw_list <- list(numeric_data = data_obj$numeric_data, metadata = data_obj$metadata) visualize_raw_data_improved( raw_list, plot_type = "boxplot", view_mode = "sample", metadata = data_obj$metadata, group_column = "Sample Group" ) ``` ```{r raw-density, fig.cap="Intensity density curves before normalization, one curve per sample."} visualize_raw_data_improved( raw_list, plot_type = "density", view_mode = "sample", metadata = data_obj$metadata, group_column = "Sample Group" ) ``` *** ## Step 4 - Normalization ### What methods are available? ```{r norm-methods} get_normalization_methods() ``` ### Compare two pipelines side-by-side It is good practice to compare normalization strategies visually before committing to one. Below we compare TIC+Log2 (the most common starting point) against PQN+Log2 (more robust when large fold-changes are expected). ```{r norm-compare, fig.cap="Side-by-side comparison of two normalization pipelines applied to the same raw data.", fig.height=9} pipeline1 <- apply_normalizations(data_obj$numeric_data, c("TIC", "Log2")) pipeline2 <- apply_normalizations(data_obj$numeric_data, c("PQN", "Log2")) p1 <- create_pipeline_plot(pipeline1, title = "Pipeline 1: TIC + Log2", metadata = data_obj$metadata, plot_type = "boxplot") p2 <- create_pipeline_plot(pipeline2, title = "Pipeline 2: PQN + Log2", metadata = data_obj$metadata, plot_type = "boxplot") gridExtra::grid.arrange(p1, p2, ncol = 1) ``` ### Apply the chosen pipeline ```{r normalize} norm_data <- apply_normalizations( data_obj$numeric_data, methods = c("TIC", "Log2") ) dim(norm_data) # samples x lipids - same dimensions as raw data ``` ```{r norm-boxplot, fig.cap="Per-sample distributions after TIC + Log2 normalization. Medians are now aligned across groups."} norm_list <- list(numeric_data = norm_data, metadata = data_obj$metadata) visualize_raw_data_improved( norm_list, plot_type = "boxplot", view_mode = "sample", metadata = data_obj$metadata, group_column = "Sample Group" ) ``` *** ## Step 4.5 - Missing Value Imputation Imputing **after** normalisation ensures that imputed values sit on the same scale as observed values, and that KNN imputation operates on data where samples are already comparable. ```{r impute} # Inspect missingness in the normalised data (none here -- this is synthetic) n_miss <- sum(is.na(norm_data)) cat("Missing values in normalised data:", n_miss, "\n") # Available methods cat("Imputation methods available:\n") print(get_imputation_methods()) ``` ```{r impute-apply} # For demonstration: introduce 5% artificial missing values set.seed(42) demo_norm <- norm_data demo_norm[sample(length(demo_norm), size = round(0.05 * length(demo_norm)))] <- NA cat("Missingness introduced:", sum(is.na(demo_norm)), sprintf("(%.1f%%)\n", 100 * mean(is.na(demo_norm)))) # Apply half-minimum imputation on normalised data imputed_norm <- impute_missing_values(demo_norm, method = "half_min") cat("Missing after imputation:", sum(is.na(imputed_norm)), "\n") ``` **Choosing a method:** | Method | Best for | |--------|----------| | `half_min` | Most lipidomics data; models the LOD on normalised scale | | `knn` | Data missing at random (MAR); requires `impute` Bioconductor package | | `zero` | Only when absence of signal truly means zero | | `mean` / `median` | Small proportions of missing values | For the remainder of this vignette we use the complete normalised data without artificial missingness. *** ## Step 4.6 - Batch Effect Correction When samples were processed across multiple runs, instruments, or operators, technical batch effects can obscure biological differences. Batch correction should be applied after normalisation and imputation, and verified with PCA. Two methods are available: | Method | When to use | |--------|------------| | `"limma"` | Default; fast and reliable for balanced designs | | `"combat"` | Better for large or unbalanced batch effects; requires `sva` package | ```{r batch-correction, eval=FALSE} # Simulate batch labels (real data would have this in the metadata file) data_obj$metadata$Batch <- rep(c("Run1", "Run2"), each = 10) # Apply limma batch correction after normalisation norm_corrected <- correct_batch_effects( data_matrix = norm_data, metadata = data_obj$metadata, batch_column = "Batch", group_column = "Sample Group", method = "limma" ) # Verify: PCA should now show clustering by biology, not by batch pca_corrected <- perform_pca(norm_corrected, data_obj$metadata, "Sample Group") create_pca_plot_with_ellipses( pca_data = pca_corrected$pca_data, variance_explained = pca_corrected$variance_explained, ellipse_type = "confidence", title = "PCA after Batch Correction" ) ``` For the remainder of this vignette we proceed with the uncorrected normalised data (the example dataset has no batch structure). *** ## Step 5 - Quality Control with PCA PCA on normalized data should show samples clustering by biological group rather than by technical artefact. ```{r pca, fig.cap="PCA of TIC + Log2 normalized data. Ellipses are 95% confidence regions."} pca_res <- perform_pca(norm_data, data_obj$metadata, "Sample Group") create_pca_plot_with_ellipses( pca_data = pca_res$pca_data, variance_explained = pca_res$variance_explained, ellipse_type = "confidence", show_sample_labels = TRUE, title = "PCA - TIC + Log2 Normalized Data" ) ``` *** ## Step 6 - Differential Analysis ### Automatic pairwise contrasts `create_default_contrasts()` generates all pairwise comparisons from the groups present in your data. You can also specify contrasts manually. ```{r contrasts} groups <- unique(data_obj$metadata$`Sample Group`) contrasts <- create_default_contrasts(groups) cat("All pairwise contrasts (", length(contrasts), "total ):\n") print(contrasts) ``` ### Run limma ```{r diff-analysis} diff_res <- perform_differential_analysis( data_matrix = norm_data, metadata = data_obj$metadata, group_column = "Sample Group", contrasts_list = contrasts, method = "limma" ) ``` ```{r diff-summary} cat(sprintf("%-28s %5s %5s %4s %5s\n", "Contrast", "Total", "Sig", "Up", "Down")) cat(strrep("-", 55), "\n") for (nm in names(diff_res$results)) { res <- diff_res$results[[nm]] n_sig <- sum(res$adj.P.Val < 0.05, na.rm = TRUE) n_up <- sum(res$adj.P.Val < 0.05 & res$logFC > 0, na.rm = TRUE) n_down <- sum(res$adj.P.Val < 0.05 & res$logFC < 0, na.rm = TRUE) cat(sprintf("%-28s %5d %5d %4d %5d\n", nm, nrow(res), n_sig, n_up, n_down)) } ``` ### Top results from the first contrast ```{r top-results} first_contrast <- names(diff_res$results)[1] res_df <- diff_res$results[[first_contrast]] res_df <- res_df[order(res_df$adj.P.Val), ] top10 <- utils::head(res_df[, c("logFC", "AveExpr", "adj.P.Val")], 10) top10$logFC <- round(top10$logFC, 3) top10$AveExpr <- round(top10$AveExpr, 2) top10$adj.P.Val <- signif(top10$adj.P.Val, 3) cat("Top 10 features in:", first_contrast, "\n\n") print(top10) ``` *** ## Step 7 - Results Visualization ### Volcano plot Significant lipids (|logFC| > 1 and adj.P.Val < 0.05) are coloured by lipid class. The top 10 most significant are labelled. ```{r volcano, fig.cap="Volcano plot coloured by lipid class. Vertical dashed lines mark logFC = +/-1; horizontal dashed line marks adj.P.Val = 0.05."} create_volcano_plot_labeled( results = diff_res$results[[first_contrast]], title = paste("Volcano:", first_contrast), logfc_threshold = 1, pval_threshold = 0.05, top_labels = 10, classification_data = classification, color_by = "LipidGroup" ) ``` ### Heatmap of significant features ```{r heatmap, fig.cap="Heatmap of up to 30 significant lipids. Rows (lipids) and columns (samples) are hierarchically clustered. Colour scale is z-scored per lipid.", fig.height=7} sig_lipids <- rownames(res_df)[ !is.na(res_df$adj.P.Val) & res_df$adj.P.Val < 0.05 & abs(res_df$logFC) > 1 ] if (length(sig_lipids) >= 5) { top_n <- min(30, length(sig_lipids)) avail <- intersect(sig_lipids[seq_len(top_n)], colnames(norm_data)) hm_mat <- t(norm_data[, avail, drop = FALSE]) create_heatmap_robust( data_matrix = hm_mat, metadata = data_obj$metadata, group_column = "Sample Group", top_n = top_n, title = paste("Significant lipids:", first_contrast) ) } else { message("Fewer than 5 significant lipids at these thresholds.") } ``` *** ## Step 8 - Lipid Expression `create_lipid_expression_barplot()` shows per-sample abundance of individual lipids, sorted by group. It returns a single ggplot object when one lipid is selected, or a named list of ggplot objects when multiple lipids are selected. ```{r lipid-expression, fig.cap="Per-sample abundance of the 3 most significant lipids. Samples are automatically sorted by group."} # Take the 3 most significantly changed lipids top_lipids <- utils::head( rownames(res_df)[!is.na(res_df$adj.P.Val)], 3 ) cat("Selected lipids:\n") print(top_lipids) expr_result <- create_lipid_expression_barplot( data_matrix = norm_data, metadata = data_obj$metadata, selected_lipids = top_lipids, group_column = "Sample Group", data_type = "normalized" ) # The function returns a single ggplot (1 lipid) or a named list (multiple) if (inherits(expr_result, "ggplot")) { print(expr_result) } else { for (lipid_plot in expr_result) print(lipid_plot) } ``` *** ## Step 9 - Enrichment Analysis Enrichment analysis tests whether lipids in a particular class are collectively more increased or decreased than expected by chance, using the fgsea algorithm. ```{r enrichment} enrich_res <- perform_enrichment_analysis( results_list = diff_res$results, classification_data = classification, min_set_size = 3, max_set_size = 500 ) cat("Contrasts with enrichment results:", length(enrich_res), "\n") cat("Categories per contrast :", paste(names(enrich_res[[1]]), collapse = ", "), "\n") ``` ```{r enrichment-table} # Show top enrichment results for the first contrast, by LipidGroup grp_enrich <- enrich_res[[first_contrast]][["LipidGroup"]] if (!is.null(grp_enrich) && nrow(grp_enrich) > 0) { grp_enrich <- grp_enrich[order(grp_enrich$pval), ] sig <- grp_enrich[grp_enrich$padj < 0.25, ] if (nrow(sig) > 0) { cat("Significantly enriched lipid groups (padj < 0.25) in:", first_contrast, "\n\n") out <- sig[, c("pathway", "NES", "pval", "padj", "size")] out$NES <- round(out$NES, 3) out$pval <- signif(out$pval, 3) out$padj <- signif(out$padj, 3) print(out, row.names = FALSE) } else { cat("No lipid groups reach padj < 0.25 in this contrast.\n") cat("Top 5 by p-value (", first_contrast, "):\n\n") out <- utils::head(grp_enrich[, c("pathway", "NES", "pval", "padj", "size")], 5) out$NES <- round(out$NES, 3) out$pval <- signif(out$pval, 3) out$padj <- signif(out$padj, 3) print(out, row.names = FALSE) } } else { cat("No enrichment results available.\n") } ``` ```{r enrichment-plot, fig.cap="Enrichment dot plot for the first contrast. Dot size reflects set size; colour reflects statistical significance."} if (!is.null(grp_enrich) && nrow(grp_enrich) > 0) { create_enrichment_dotplot( enrichment_data = grp_enrich, title = paste("Lipid Group Enrichment:", first_contrast), max_pathways = 10 ) } ``` *** ## Session Information ```{r session-info} sessionInfo() ```