--- title: "LIPIDIFy: Comprehensive Lipidomics Data Analysis" author: "Your Name" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true toc_float: true toc_depth: 3 number_sections: true code_folding: show vignette: > %\VignetteIndexEntry{LIPIDIFy: Comprehensive Lipidomics Data Analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 8, fig.height = 6, warning = FALSE, message = FALSE ) ``` # Introduction **LIPIDIFy** is a comprehensive R package designed for the analysis of lipidomics data. It provides tools for: - Data preprocessing and quality control - Multiple normalization strategies - Differential abundance analysis (limma and EdgeR) - Enrichment analysis with flexible classification systems - Extensive visualization capabilities - Interactive Shiny interface for non-programmers ## Installation ```{r installation, eval=FALSE} # From BioConductor (once submitted) if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("LIPIDIFy") # Or from GitHub # devtools::install_github("yourusername/LIPIDIFy") ``` ```{r load_package} library(LIPIDIFy) ``` # Quick Start ## Using the Shiny Interface For biologists and those preferring a graphical interface: ```{r launch_shiny, eval=FALSE} launch_lipidomics_app() ``` This launches an interactive web application with tabs for each analysis step. ## Using R Functions For bioinformaticians and reproducible analyses: ```{r example_workflow, eval=FALSE} # Load data data <- load_lipidomics_data("path/to/your/data.csv") # Classify lipids classification <- get_lipid_classification(colnames(data$numeric_data)) # Normalize normalized <- normalize_lipidomics_data(data$numeric_data, methods = c("TIC", "Log2")) # Differential analysis diff_results <- perform_differential_analysis(normalized, data$metadata, method = "limma") # Enrichment analysis enrichment <- perform_enrichment_analysis(diff_results$results, classification) ``` # Data Input and Format ## Required Data Structure Your input CSV file should have: 1. **Metadata columns** (first few columns): - `Sample Name` (required) - `Sample Group` (required) - Additional metadata (optional: Tumour ID, Weight, etc.) 2. **Lipid abundance columns** (remaining columns): - Column names = Lipid names (e.g., "PC 16:0_18:1") - Values = Abundance/intensity measurements ## Example Data The package includes example data: ```{r generate_example} example_data <- generate_example_data() head(example_data[, 1:8]) ``` This generates synthetic data with 4 groups (GroupA-D) and realistic lipid names. ## Loading Your Own Data ```{r load_data, eval=FALSE} # Load from CSV file my_data <- load_lipidomics_data("path/to/mydata.csv", metadata_columns = c("Sample Name", "Sample Group", "Treatment")) # Components returned: # $data - full data frame # $metadata - metadata only # $numeric_data - lipid abundances only ``` # Lipid Classification ## Automatic Classification Lipids are automatically classified based on their names: ```{r classification} # Using example data lipid_names <- colnames(example_data)[5:ncol(example_data)] classification <- get_lipid_classification(lipid_names[1:10]) print(classification) ``` Classification includes: - **LipidGroup**: Main category (Glycerophospholipids, Sphingolipids, etc.) - **LipidType**: Specific class (Phosphatidylcholine, Ceramide, etc.) - **Saturation**: Fatty acid saturation (SFA, MUFA, PUFA) ## Custom Classification You can provide your own classification with flexible hierarchy: ```{r custom_classification, eval=FALSE} # Create custom classification CSV: # Lipid, MainClass, SubClass, Function # PC 16:0_18:1, Phospholipids, Choline-containing, Membrane # ... custom_class <- load_custom_classification("my_classification.csv") ``` ## Export Classification ```{r export_classification, eval=FALSE} # Save classification for later use or sharing export_classification(classification, "lipid_classification.csv") ``` # Data Visualization ## Raw Data Exploration ```{r raw_viz, eval=FALSE} # Sample-wise boxplot plot <- visualize_raw_data_improved(list(numeric_data = example_data[, 5:ncol(example_data)]), plot_type = "boxplot", view_mode = "sample") print(plot) # Lipid-wise distribution for top variable lipids plot2 <- visualize_raw_data_improved(list(numeric_data = example_data[, 5:ncol(example_data)]), plot_type = "boxplot", view_mode = "lipid", top_n = 30) print(plot2) ``` ## Individual Lipid Expression Visualize specific lipids across samples: ```{r lipid_expression, eval=FALSE} # Prepare data data_matrix <- as.matrix(example_data[, 5:ncol(example_data)]) rownames(data_matrix) <- example_data$`Sample Name` metadata <- example_data[, 1:4] # Create barplot for selected lipids plots <- create_lipid_expression_barplot( data_matrix = data_matrix, metadata = metadata, selected_lipids = c("PC 16:0_18:1", "PE 18:0_20:4"), data_type = "raw" ) # Display plots if (is.list(plots)) { print(plots[[1]]) print(plots[[2]]) } else { print(plots) } ``` # Normalization # Missing Value Imputation Missing values in lipidomics data arise when a lipid's signal falls below the instrument's limit of detection. LIPIDIFy provides six imputation strategies via `impute_missing_values()`. Imputing **after normalisation** ensures that imputed values are on the same scale as observed values. ## Available Methods ```{r imputation_methods} print(get_imputation_methods()) ``` ```{r imputation_apply, eval=FALSE} # Check missingness in normalised data normalized <- normalize_lipidomics_data(data$numeric_data, methods = c("TIC", "Log2")) cat("Missing values:", sum(is.na(normalized)), "\n") # Half-minimum imputation on the normalised matrix imputed <- impute_missing_values(normalized, method = "half_min") # KNN imputation (requires Bioconductor impute package) # imputed <- impute_missing_values(normalized, method = "knn", k = 5) ``` | Method | Best for | |--------|----------| | `half_min` | MS lipidomics (models LOD on normalised scale) -- recommended default | | `knn` | Data missing at random; requires `impute` Bioconductor package | | `zero` | Only when absence of signal truly means zero | | `mean` / `median` | Small proportions of missing values | # Batch Effect Correction Technical batch effects arise from differences in sample preparation or instrument performance across analytical runs. `correct_batch_effects()` removes these effects while preserving biological signal. ```{r batch_correction_apply, eval=FALSE} # Assumes your metadata has a "Batch" column corrected <- correct_batch_effects( data_matrix = normalized, metadata = data$metadata, batch_column = "Batch", group_column = "Sample Group", method = "limma" # or "combat" (requires sva package) ) # Verify with PCA: samples should cluster by biology, not batch pca_post <- perform_pca(corrected, data$metadata, "Sample Group") create_pca_plot_with_ellipses(pca_post$pca_data, pca_post$variance_explained, ellipse_type = "confidence", title = "PCA after Batch Correction") ``` Multiple normalization methods are available: ```{r normalization} # Get list of available methods methods <- get_normalization_methods() print(methods) # Apply normalization pipeline normalized <- normalize_lipidomics_data( data = as.matrix(example_data[, 5:ncol(example_data)]), methods = c("TIC", "Log2") ) # Compare different pipelines pipeline1 <- normalize_lipidomics_data(example_data[, 5:ncol(example_data)], c("TIC", "Log2")) pipeline2 <- normalize_lipidomics_data(example_data[, 5:ncol(example_data)], c("PQN", "Log2")) ``` ## Normalization Methods Available - **TIC**: Total Ion Current - **PQN**: Probabilistic Quotient Normalization - **Quantile**: Quantile normalization - **Log2Median**: Log2 transformation followed by median centering - **Median/Mean**: Centering methods - **Log2/Log10**: Log transformations - **Sqrt**: Square root transformation # Principal Component Analysis PCA with optional confidence ellipses: ```{r pca, eval=FALSE} # Prepare data metadata <- example_data[, 1:4] data_matrix <- t(as.matrix(example_data[, 5:ncol(example_data)])) # Perform PCA pca_result <- perform_pca(data_matrix, metadata, group_column = "Sample Group") # Plot without ellipses plot1 <- create_pca_plot_with_ellipses( pca_data = pca_result$pca_data, variance_explained = pca_result$variance_explained, ellipse_type = "none" ) # Plot with confidence ellipses (95%) plot2 <- create_pca_plot_with_ellipses( pca_data = pca_result$pca_data, variance_explained = pca_result$variance_explained, ellipse_type = "confidence", confidence_level = 0.95 ) # Plot with visual circles plot3 <- create_pca_plot_with_ellipses( pca_data = pca_result$pca_data, variance_explained = pca_result$variance_explained, ellipse_type = "visual" ) ``` **Ellipse Types:** - **none**: Raw points only - **confidence**: Statistical confidence regions (default 95%) - **visual**: Simple visual circles to aid group identification # Differential Abundance Analysis ## Using limma ```{r differential_limma, eval=FALSE} # Perform differential analysis with limma diff_results_limma <- perform_differential_analysis( data_matrix = normalized, metadata = metadata, group_column = "Sample Group", contrasts_list = c("GroupB - GroupA", "GroupC - GroupA"), method = "limma" ) # View results for first contrast head(diff_results_limma$results[[1]]) ``` ## Using EdgeR ```{r differential_edger, eval=FALSE} # Perform differential analysis with EdgeR diff_results_edger <- perform_differential_analysis( data_matrix = normalized, metadata = metadata, group_column = "Sample Group", contrasts_list = c("GroupB - GroupA", "GroupC - GroupA"), method = "edger" ) # View results head(diff_results_edger$results[[1]]) ``` ## Comparing Methods Both methods provide similar outputs: - `logFC`: Log fold change - `P.Value`: Raw p-value - `adj.P.Val`: Adjusted p-value (FDR) - `AveExpr`: Average expression **When to use which:** - **limma**: Default choice for lipidomics, fast and robust - **EdgeR**: Alternative method, originally for count data but adaptable ## Contrast Selection Specify only the contrasts of interest: ```{r contrasts} # Create all pairwise contrasts groups <- c("GroupA", "GroupB", "GroupC", "GroupD") all_contrasts <- create_default_contrasts(groups) print(all_contrasts) # Or specify custom contrasts custom_contrasts <- c("GroupB - GroupA", "GroupC - GroupA", "GroupD - GroupA") ``` # Visualization of Results ## Volcano Plots ```{r volcano, eval=FALSE} # Create volcano plot volcano <- create_volcano_plot_labeled( results = diff_results_limma$results[[1]], title = "GroupB vs GroupA", top_labels = 10, pval_threshold = 0.05, logfc_threshold = 1 ) print(volcano) ``` ## Heatmaps ```{r heatmap, eval=FALSE} # create_heatmap_robust expects features as rows, samples as columns heatmap <- create_heatmap_robust( data_matrix = t(normalized[, sig_lipids, drop = FALSE]), metadata = metadata, group_column = "Sample Group", top_n = length(sig_lipids), title = "Significant lipids" ) ``` # Enrichment Analysis ## Standard Enrichment Using automatic lipid classification: ```{r enrichment, eval=FALSE} enrichment_results <- perform_enrichment_analysis( results_list = diff_results_limma$results, classification_data = classification, min_set_size = 5, max_set_size = 500 ) # View results for first contrast, lipid group enrichment head(enrichment_results[[1]]$LipidGroup) ``` ## Custom Enrichment Sets Define your own functional sets: ```{r custom_enrichment, eval=FALSE} # Create custom sets CSV: # Lipid,Set_Name # PC 16:0_18:1,Membrane_Lipids # PE 18:0_20:4,Membrane_Lipids # TG 16:0_18:1_20:4,Storage_Lipids # ... custom_sets <- load_custom_enrichment_sets("my_sets.csv") # Run enrichment with custom sets enrichment_custom <- perform_enrichment_analysis( results_list = diff_results_limma$results, classification_data = classification, custom_sets = custom_sets ) ``` ## Enrichment Visualization ```{r enrichment_viz, eval=FALSE} # Dotplot dotplot <- create_enrichment_dotplot( enrichment_data = enrichment_results[[1]]$LipidGroup, title = "Lipid Group Enrichment" ) # Barplot barplot <- create_enrichment_barplot( enrichment_data = enrichment_results[[1]]$LipidGroup, title = "Lipid Group Enrichment", top_n = 15 ) ``` # Complete Analysis Workflow Here's a complete example workflow: ```{r complete_workflow, eval=FALSE} # 1. Load data data <- load_lipidomics_data("your_data.csv") # 2. Classify lipids classification <- get_lipid_classification(colnames(data$numeric_data)) export_classification(classification, "classification.csv") # 3. Visualize raw data raw_plot <- visualize_raw_data_improved(data, plot_type = "boxplot") # 4. Normalize normalized <- normalize_lipidomics_data(data$numeric_data, methods = c("TIC", "Log2")) # 5. PCA pca_result <- perform_pca(t(normalized), data$metadata) pca_plot <- create_pca_plot_with_ellipses(pca_result$pca_data, pca_result$variance_explained, ellipse_type = "confidence") # 6. Differential analysis diff_results <- perform_differential_analysis( normalized, data$metadata, contrasts_list = c("Treatment - Control"), method = "limma" ) # 7. Volcano plot volcano <- create_volcano_plot_labeled(diff_results$results[[1]]) # 8. Enrichment enrichment <- perform_enrichment_analysis(diff_results$results, classification) # 9. Enrichment visualization enrich_plot <- create_enrichment_dotplot(enrichment[[1]]$LipidGroup) # 10. Save results write.csv(diff_results$results[[1]], "differential_results.csv") write.csv(enrichment[[1]]$LipidGroup, "enrichment_results.csv") ``` # Advanced Topics ## Batch Effect Correction If you have batch information, incorporate it into your design: ```{r batch_correction, eval=FALSE} # Add batch to metadata data$metadata$Batch <- rep(c("Batch1", "Batch2"), each = 10) # Include in differential analysis design # This requires custom design matrix - see limma documentation ``` ## Integration with Other Tools Export results for use in other software: ```{r export, eval=FALSE} # Export for MetaboAnalyst write.csv(normalized, "normalized_for_metaboanalyst.csv") # Export for Cytoscape write.csv(enrichment[[1]]$LipidGroup, "enrichment_for_cytoscape.csv") ``` ## Quality Control ```{r qc, eval=FALSE} # Check for outliers in PCA pca_result <- perform_pca(t(normalized), data$metadata) # Samples with PC1 > 3 SD might be outliers outliers <- pca_result$pca_data[abs(pca_result$pca_data$PC1) > 3 * sd(pca_result$pca_data$PC1), ] print(outliers) # Check coefficient of variation cv <- apply(data$numeric_data, 2, function(x) sd(x) / mean(x) * 100) high_cv_lipids <- names(cv)[cv > 50] ``` # Session Information ```{r session_info} sessionInfo() ``` # References - Ritchie ME, et al. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research. - Robinson MD, McCarthy DJ, Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. - Korotkevich G, et al. (2021). Fast gene set enrichment analysis. bioRxiv. # Appendix: Supported Lipid Nomenclature The package recognizes standard lipid nomenclature: - **Glycerophospholipids**: PC, PE, PI, PS, PG, PA - **Lysophospholipids**: LPC, LPE, LPI, LPS, LPG, LPA - **Sphingolipids**: SM, Cer, dhCer, HexCer, Gangliosides - **Glycerolipids**: DG, TG, MG - **Sterol Lipids**: CE, Cholesterol - **Acylcarnitines**: CAR Format examples: - `PC 16:0_18:1` (individual fatty acids) - `PC 34:1` (total carbon:double bonds) - `LPC 18:2` (single fatty acid) - `TG 16:0_18:1_20:4` (three fatty acids)