---
title: "LIPIDIFy: Comprehensive Lipidomics Data Analysis"
author: "Your Name"
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc: true
    toc_float: true
    toc_depth: 3
    number_sections: true
    code_folding: show
vignette: >
  %\VignetteIndexEntry{LIPIDIFy: Comprehensive Lipidomics Data Analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 8,
  fig.height = 6,
  warning = FALSE,
  message = FALSE
)
```

# Introduction

**LIPIDIFy** is a comprehensive R package designed for the analysis of lipidomics data. It provides tools for:

- Data preprocessing and quality control
- Multiple normalization strategies
- Differential abundance analysis (limma and EdgeR)
- Enrichment analysis with flexible classification systems
- Extensive visualization capabilities
- Interactive Shiny interface for non-programmers

## Installation

```{r installation, eval=FALSE}
# From BioConductor (once submitted)
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("LIPIDIFy")

# Or from GitHub
# devtools::install_github("yourusername/LIPIDIFy")
```

```{r load_package}
library(LIPIDIFy)
```

# Quick Start

## Using the Shiny Interface

For biologists and those preferring a graphical interface:

```{r launch_shiny, eval=FALSE}
launch_lipidomics_app()
```

This launches an interactive web application with tabs for each analysis step.

## Using R Functions

For bioinformaticians and reproducible analyses:

```{r example_workflow, eval=FALSE}
# Load data
data <- load_lipidomics_data("path/to/your/data.csv")

# Classify lipids
classification <- get_lipid_classification(colnames(data$numeric_data))

# Normalize
normalized <- normalize_lipidomics_data(data$numeric_data, 
                                        methods = c("TIC", "Log2"))

# Differential analysis
diff_results <- perform_differential_analysis(normalized, data$metadata,
                                              method = "limma")

# Enrichment analysis
enrichment <- perform_enrichment_analysis(diff_results$results, 
                                          classification)
```

# Data Input and Format

## Required Data Structure

Your input CSV file should have:

1. **Metadata columns** (first few columns):
   - `Sample Name` (required)
   - `Sample Group` (required)
   - Additional metadata (optional: Tumour ID, Weight, etc.)

2. **Lipid abundance columns** (remaining columns):
   - Column names = Lipid names (e.g., "PC 16:0_18:1")
   - Values = Abundance/intensity measurements

## Example Data

The package includes example data:

```{r generate_example}
example_data <- generate_example_data()
head(example_data[, 1:8])
```

This generates synthetic data with 4 groups (GroupA-D) and realistic lipid names.

## Loading Your Own Data

```{r load_data, eval=FALSE}
# Load from CSV file
my_data <- load_lipidomics_data("path/to/mydata.csv",
                                metadata_columns = c("Sample Name", 
                                                     "Sample Group",
                                                     "Treatment"))

# Components returned:
# $data - full data frame
# $metadata - metadata only
# $numeric_data - lipid abundances only
```

# Lipid Classification

## Automatic Classification

Lipids are automatically classified based on their names:

```{r classification}
# Using example data
lipid_names <- colnames(example_data)[5:ncol(example_data)]
classification <- get_lipid_classification(lipid_names[1:10])
print(classification)
```

Classification includes:
- **LipidGroup**: Main category (Glycerophospholipids, Sphingolipids, etc.)
- **LipidType**: Specific class (Phosphatidylcholine, Ceramide, etc.)
- **Saturation**: Fatty acid saturation (SFA, MUFA, PUFA)

## Custom Classification

You can provide your own classification with flexible hierarchy:

```{r custom_classification, eval=FALSE}
# Create custom classification CSV:
# Lipid, MainClass, SubClass, Function
# PC 16:0_18:1, Phospholipids, Choline-containing, Membrane
# ...

custom_class <- load_custom_classification("my_classification.csv")
```

## Export Classification

```{r export_classification, eval=FALSE}
# Save classification for later use or sharing
export_classification(classification, "lipid_classification.csv")
```

# Data Visualization

## Raw Data Exploration

```{r raw_viz, eval=FALSE}
# Sample-wise boxplot
plot <- visualize_raw_data_improved(list(numeric_data = example_data[, 5:ncol(example_data)]),
                                    plot_type = "boxplot",
                                    view_mode = "sample")
print(plot)

# Lipid-wise distribution for top variable lipids
plot2 <- visualize_raw_data_improved(list(numeric_data = example_data[, 5:ncol(example_data)]),
                                     plot_type = "boxplot",
                                     view_mode = "lipid",
                                     top_n = 30)
print(plot2)
```

## Individual Lipid Expression

Visualize specific lipids across samples:

```{r lipid_expression, eval=FALSE}
# Prepare data
data_matrix <- as.matrix(example_data[, 5:ncol(example_data)])
rownames(data_matrix) <- example_data$`Sample Name`
metadata <- example_data[, 1:4]

# Create barplot for selected lipids
plots <- create_lipid_expression_barplot(
  data_matrix = data_matrix,
  metadata = metadata,
  selected_lipids = c("PC 16:0_18:1", "PE 18:0_20:4"),
  data_type = "raw"
)

# Display plots
if (is.list(plots)) {
  print(plots[[1]])
  print(plots[[2]])
} else {
  print(plots)
}
```

# Normalization

# Missing Value Imputation

Missing values in lipidomics data arise when a lipid's signal falls below
the instrument's limit of detection. LIPIDIFy provides six imputation
strategies via `impute_missing_values()`. Imputing **after normalisation**
ensures that imputed values are on the same scale as observed values.

## Available Methods

```{r imputation_methods}
print(get_imputation_methods())
```

```{r imputation_apply, eval=FALSE}
# Check missingness in normalised data
normalized <- normalize_lipidomics_data(data$numeric_data, methods = c("TIC", "Log2"))
cat("Missing values:", sum(is.na(normalized)), "\n")

# Half-minimum imputation on the normalised matrix
imputed <- impute_missing_values(normalized, method = "half_min")

# KNN imputation (requires Bioconductor impute package)
# imputed <- impute_missing_values(normalized, method = "knn", k = 5)
```

| Method | Best for |
|--------|----------|
| `half_min` | MS lipidomics (models LOD on normalised scale) -- recommended default |
| `knn` | Data missing at random; requires `impute` Bioconductor package |
| `zero` | Only when absence of signal truly means zero |
| `mean` / `median` | Small proportions of missing values |

# Batch Effect Correction

Technical batch effects arise from differences in sample preparation or
instrument performance across analytical runs. `correct_batch_effects()`
removes these effects while preserving biological signal.

```{r batch_correction_apply, eval=FALSE}
# Assumes your metadata has a "Batch" column
corrected <- correct_batch_effects(
  data_matrix  = normalized,
  metadata     = data$metadata,
  batch_column = "Batch",
  group_column = "Sample Group",
  method       = "limma"           # or "combat" (requires sva package)
)

# Verify with PCA: samples should cluster by biology, not batch
pca_post <- perform_pca(corrected, data$metadata, "Sample Group")
create_pca_plot_with_ellipses(pca_post$pca_data, pca_post$variance_explained,
                              ellipse_type = "confidence",
                              title = "PCA after Batch Correction")
```

Multiple normalization methods are available:

```{r normalization}
# Get list of available methods
methods <- get_normalization_methods()
print(methods)

# Apply normalization pipeline
normalized <- normalize_lipidomics_data(
  data = as.matrix(example_data[, 5:ncol(example_data)]),
  methods = c("TIC", "Log2")
)

# Compare different pipelines
pipeline1 <- normalize_lipidomics_data(example_data[, 5:ncol(example_data)], 
                                       c("TIC", "Log2"))
pipeline2 <- normalize_lipidomics_data(example_data[, 5:ncol(example_data)], 
                                       c("PQN", "Log2"))
```

## Normalization Methods Available

- **TIC**: Total Ion Current
- **PQN**: Probabilistic Quotient Normalization
- **Quantile**: Quantile normalization
- **Log2Median**: Log2 transformation followed by median centering
- **Median/Mean**: Centering methods
- **Log2/Log10**: Log transformations
- **Sqrt**: Square root transformation

# Principal Component Analysis

PCA with optional confidence ellipses:

```{r pca, eval=FALSE}
# Prepare data
metadata <- example_data[, 1:4]
data_matrix <- t(as.matrix(example_data[, 5:ncol(example_data)]))

# Perform PCA
pca_result <- perform_pca(data_matrix, metadata, group_column = "Sample Group")

# Plot without ellipses
plot1 <- create_pca_plot_with_ellipses(
  pca_data = pca_result$pca_data,
  variance_explained = pca_result$variance_explained,
  ellipse_type = "none"
)

# Plot with confidence ellipses (95%)
plot2 <- create_pca_plot_with_ellipses(
  pca_data = pca_result$pca_data,
  variance_explained = pca_result$variance_explained,
  ellipse_type = "confidence",
  confidence_level = 0.95
)

# Plot with visual circles
plot3 <- create_pca_plot_with_ellipses(
  pca_data = pca_result$pca_data,
  variance_explained = pca_result$variance_explained,
  ellipse_type = "visual"
)
```

**Ellipse Types:**
- **none**: Raw points only
- **confidence**: Statistical confidence regions (default 95%)
- **visual**: Simple visual circles to aid group identification

# Differential Abundance Analysis

## Using limma

```{r differential_limma, eval=FALSE}
# Perform differential analysis with limma
diff_results_limma <- perform_differential_analysis(
  data_matrix = normalized,
  metadata = metadata,
  group_column = "Sample Group",
  contrasts_list = c("GroupB - GroupA", "GroupC - GroupA"),
  method = "limma"
)

# View results for first contrast
head(diff_results_limma$results[[1]])
```

## Using EdgeR

```{r differential_edger, eval=FALSE}
# Perform differential analysis with EdgeR
diff_results_edger <- perform_differential_analysis(
  data_matrix = normalized,
  metadata = metadata,
  group_column = "Sample Group",
  contrasts_list = c("GroupB - GroupA", "GroupC - GroupA"),
  method = "edger"
)

# View results
head(diff_results_edger$results[[1]])
```

## Comparing Methods

Both methods provide similar outputs:
- `logFC`: Log fold change
- `P.Value`: Raw p-value
- `adj.P.Val`: Adjusted p-value (FDR)
- `AveExpr`: Average expression

**When to use which:**
- **limma**: Default choice for lipidomics, fast and robust
- **EdgeR**: Alternative method, originally for count data but adaptable

## Contrast Selection

Specify only the contrasts of interest:

```{r contrasts}
# Create all pairwise contrasts
groups <- c("GroupA", "GroupB", "GroupC", "GroupD")
all_contrasts <- create_default_contrasts(groups)
print(all_contrasts)

# Or specify custom contrasts
custom_contrasts <- c("GroupB - GroupA", 
                     "GroupC - GroupA", 
                     "GroupD - GroupA")
```

# Visualization of Results

## Volcano Plots

```{r volcano, eval=FALSE}
# Create volcano plot
volcano <- create_volcano_plot_labeled(
  results = diff_results_limma$results[[1]],
  title = "GroupB vs GroupA",
  top_labels = 10,
  pval_threshold = 0.05,
  logfc_threshold = 1
)
print(volcano)
```

## Heatmaps

```{r heatmap, eval=FALSE}
# create_heatmap_robust expects features as rows, samples as columns
heatmap <- create_heatmap_robust(
  data_matrix  = t(normalized[, sig_lipids, drop = FALSE]),
  metadata     = metadata,
  group_column = "Sample Group",
  top_n        = length(sig_lipids),
  title        = "Significant lipids"
)
```

# Enrichment Analysis

## Standard Enrichment

Using automatic lipid classification:

```{r enrichment, eval=FALSE}
enrichment_results <- perform_enrichment_analysis(
  results_list = diff_results_limma$results,
  classification_data = classification,
  min_set_size = 5,
  max_set_size = 500
)

# View results for first contrast, lipid group enrichment
head(enrichment_results[[1]]$LipidGroup)
```

## Custom Enrichment Sets

Define your own functional sets:

```{r custom_enrichment, eval=FALSE}
# Create custom sets CSV:
# Lipid,Set_Name
# PC 16:0_18:1,Membrane_Lipids
# PE 18:0_20:4,Membrane_Lipids
# TG 16:0_18:1_20:4,Storage_Lipids
# ...

custom_sets <- load_custom_enrichment_sets("my_sets.csv")

# Run enrichment with custom sets
enrichment_custom <- perform_enrichment_analysis(
  results_list = diff_results_limma$results,
  classification_data = classification,
  custom_sets = custom_sets
)
```

## Enrichment Visualization

```{r enrichment_viz, eval=FALSE}
# Dotplot
dotplot <- create_enrichment_dotplot(
  enrichment_data = enrichment_results[[1]]$LipidGroup,
  title = "Lipid Group Enrichment"
)

# Barplot
barplot <- create_enrichment_barplot(
  enrichment_data = enrichment_results[[1]]$LipidGroup,
  title = "Lipid Group Enrichment",
  top_n = 15
)
```

# Complete Analysis Workflow

Here's a complete example workflow:

```{r complete_workflow, eval=FALSE}
# 1. Load data
data <- load_lipidomics_data("your_data.csv")

# 2. Classify lipids
classification <- get_lipid_classification(colnames(data$numeric_data))
export_classification(classification, "classification.csv")

# 3. Visualize raw data
raw_plot <- visualize_raw_data_improved(data, plot_type = "boxplot")

# 4. Normalize
normalized <- normalize_lipidomics_data(data$numeric_data, 
                                        methods = c("TIC", "Log2"))

# 5. PCA
pca_result <- perform_pca(t(normalized), data$metadata)
pca_plot <- create_pca_plot_with_ellipses(pca_result$pca_data,
                                          pca_result$variance_explained,
                                          ellipse_type = "confidence")

# 6. Differential analysis
diff_results <- perform_differential_analysis(
  normalized, data$metadata,
  contrasts_list = c("Treatment - Control"),
  method = "limma"
)

# 7. Volcano plot
volcano <- create_volcano_plot_labeled(diff_results$results[[1]])

# 8. Enrichment
enrichment <- perform_enrichment_analysis(diff_results$results, 
                                          classification)

# 9. Enrichment visualization
enrich_plot <- create_enrichment_dotplot(enrichment[[1]]$LipidGroup)

# 10. Save results
write.csv(diff_results$results[[1]], "differential_results.csv")
write.csv(enrichment[[1]]$LipidGroup, "enrichment_results.csv")
```

# Advanced Topics

## Batch Effect Correction

If you have batch information, incorporate it into your design:

```{r batch_correction, eval=FALSE}
# Add batch to metadata
data$metadata$Batch <- rep(c("Batch1", "Batch2"), each = 10)

# Include in differential analysis design
# This requires custom design matrix - see limma documentation
```

## Integration with Other Tools

Export results for use in other software:

```{r export, eval=FALSE}
# Export for MetaboAnalyst
write.csv(normalized, "normalized_for_metaboanalyst.csv")

# Export for Cytoscape
write.csv(enrichment[[1]]$LipidGroup, "enrichment_for_cytoscape.csv")
```

## Quality Control

```{r qc, eval=FALSE}
# Check for outliers in PCA
pca_result <- perform_pca(t(normalized), data$metadata)

# Samples with PC1 > 3 SD might be outliers
outliers <- pca_result$pca_data[abs(pca_result$pca_data$PC1) > 
                                3 * sd(pca_result$pca_data$PC1), ]
print(outliers)

# Check coefficient of variation
cv <- apply(data$numeric_data, 2, function(x) sd(x) / mean(x) * 100)
high_cv_lipids <- names(cv)[cv > 50]
```

# Session Information

```{r session_info}
sessionInfo()
```

# References

- Ritchie ME, et al. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research.
- Robinson MD, McCarthy DJ, Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics.
- Korotkevich G, et al. (2021). Fast gene set enrichment analysis. bioRxiv.

# Appendix: Supported Lipid Nomenclature

The package recognizes standard lipid nomenclature:

- **Glycerophospholipids**: PC, PE, PI, PS, PG, PA
- **Lysophospholipids**: LPC, LPE, LPI, LPS, LPG, LPA
- **Sphingolipids**: SM, Cer, dhCer, HexCer, Gangliosides
- **Glycerolipids**: DG, TG, MG
- **Sterol Lipids**: CE, Cholesterol
- **Acylcarnitines**: CAR

Format examples:
- `PC 16:0_18:1` (individual fatty acids)
- `PC 34:1` (total carbon:double bonds)
- `LPC 18:2` (single fatty acid)
- `TG 16:0_18:1_20:4` (three fatty acids)