---
title: "Decomposing pseudo-bulk DE into transcriptional and compositional
  components with scCompoundDE"
author:
  - name: "Subhadip Jana"
    email: "subhadipjana1409@gmail.com"
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc: true
    toc_float: true
    number_sections: true
vignette: >
  %\VignetteIndexEntry{Decomposing pseudo-bulk DE with scCompoundDE}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
    collapse  = TRUE,
    comment   = "#>",
    fig.width = 7,
    fig.height = 5,
    message   = TRUE,
    warning   = FALSE
)
```

# Introduction

## The hidden confound in pseudo-bulk DE

Multi-donor single-cell RNA-seq studies routinely use **pseudo-bulk
differential expression** to find disease or treatment-associated genes.
The standard approach aggregates counts per donor and fits a linear model
across donors. This is statistically sound — but it contains a critical
biological blind spot.

Consider T cells in a lupus dataset. Disease donors have mostly exhausted
T cells (high PDCD1, LAG3, TIGIT). Healthy donors have mostly naive T
cells (high IL7R, CCR7, SELL). A pseudo-bulk DE model will declare PDCD1
and IL7R as the top DE genes — and both would have small p-values and
large fold-changes. But:

> **Are these genes DE because T cells changed their transcriptome
> (transcriptional change)?**
>
> **Or because the relative abundance of exhausted vs naive T cells
> shifted between conditions (compositional change)?**

Standard tools cannot answer this question. They treat the pseudo-bulk
as a homogeneous entity. `scCompoundDE` decomposes every DE gene into
these two orthogonal components.

## Integration with Bioconductor workflows

`scCompoundDE` is designed to fit into standard Bioconductor single-cell
workflows without introducing a package-specific input object. The main
function, `compoundDE`, accepts a `SingleCellExperiment` object, reads raw
counts from an assay such as `assay(sce, "counts")`, and uses sample,
condition, broad cell type, and subtype annotations stored in `colData`.
This allows users to pass the same object through quality control,
annotation, pseudo-bulk differential expression, and decomposition steps.

A typical workflow is to perform batch-aware cell-level quality control
with `scBatchQC`, run donor-aware pseudo-bulk differential expression with
`scFastDE` or another Bioconductor DE workflow, and then use
`scCompoundDE` to decide whether the observed broad pseudo-bulk signal is
primarily transcriptional or compositional. The result is returned as a
`CDEResult` S4 object containing a `S4Vectors::DataFrame` of per-gene
statistics, a subtype proportion matrix, and per-subtype DE tables, so the
output can be filtered, plotted, or passed to downstream Bioconductor-style
analyses.

## The TC_ratio score

For each gene, `compoundDE` returns a `TC_ratio` in \[0, 1\]:

| TC_ratio | Interpretation |
|---|---|
| ≈ 1.0 | Transcriptional — cells intrinsically changed expression |
| ≈ 0.0 | Compositional — proportion of subtypes shifted |
| ≈ 0.5 | Mixed — both mechanisms contribute |

Genes with low TC_ratio should be treated with caution: they may be
DE in a standard analysis but reflect **composition** rather than
**biology**.

# Installation

```{r install, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("scCompoundDE")
```

# Simulated example: purely transcriptional signal

We first demonstrate the tool on a dataset where the DE signal is
deliberately **transcriptional**: the same proportions of TypeA and
TypeB cells appear in both conditions, but TypeA cells intrinsically
upregulate genes 1–10 in treatment.

```{r sim_transcriptional}
library(scCompoundDE)
library(SingleCellExperiment)

set.seed(42)
n_genes  <- 150
n_donors <- 6

counts <- matrix(rpois(n_genes * 240L, 8L),
                 nrow = n_genes, ncol = 240L)
rownames(counts) <- paste0("Gene", seq_len(n_genes))
colnames(counts) <- paste0("Cell", seq_len(240L))

# First 10 genes × 4 in TypeA treatment cells only
# TypeA cells: cols 61–120 in treatment
counts[seq_len(10), 61:120] <- counts[seq_len(10), 61:120] * 5L

sce <- SingleCellExperiment(assays = list(counts = counts))
sce$donor     <- rep(rep(paste0("D", seq_len(n_donors)), each = 10L), 4L)
sce$cell_type <- "T_cell"
sce$subtype   <- rep(rep(c("TypeA","TypeB"),
                         each = n_donors * 10L), 2L)
sce$condition <- rep(c("ctrl","treat"), each = 120L)

# Check the design
table(sce$subtype, sce$condition)
```

## Running compoundDE

```{r run_compoundDE}
result <- compoundDE(
    sce,
    broad_type   = "T_cell",
    subtype_col  = "subtype",
    broad_col    = "cell_type",
    donor        = "donor",
    condition    = "condition",
    min_cells    = 3L,
    min_subtypes = 2L,
    min_donors   = 2L
)
result
```

## Inspecting results

The `deTable` accessor returns one row per tested gene. In addition to the
broad pseudo-bulk DE statistics (`logFC`, `P.Value`, and `adj.P.Val`), it
contains the decomposition scores used by `scCompoundDE`: `T_score` for the
transcriptional component, `C_score` for the compositional component,
`TC_ratio` for their relative contribution, and `source` for the final
classification.

```{r inspect_results}
dt <- as.data.frame(deTable(result))
dt_sorted <- dt[order(dt$adj.P.Val), ]
head(dt_sorted[, c("gene","logFC","adj.P.Val",
                   "T_score","C_score","TC_ratio","source")], 15)
```

The injected genes (Gene1–Gene10) should have high TC_ratio values,
confirming they were detected as transcriptionally driven. In a real
analysis, genes classified as `transcriptional` are the strongest
candidates for cell-intrinsic biology, genes classified as `compositional`
should be interpreted as possible subtype-abundance effects, and genes
classified as `mixed` require inspection of both scores.

```{r tc_of_injected}
injected <- dt[dt$gene %in% paste0("Gene", seq_len(10)), ]
cat("Median TC_ratio of injected genes:",
    round(median(injected$TC_ratio, na.rm = TRUE), 3), "\n")
cat("All injected genes classified as transcriptional:",
    all(injected$source == "transcriptional", na.rm = TRUE), "\n")
```

## Visualizations

The plotting functions provide complementary diagnostics for the
decomposition. `plotDecomposition` shows whether significant genes are
driven mainly by the transcriptional or compositional axis, `plotTCRatio`
summarizes the distribution of gene classifications, and `plotProportion`
checks whether subtype proportions differ between conditions enough to
explain compositional signals.

```{r plot_decomp, fig.cap="Scatter of transcriptional vs compositional z-scores. Blue = transcriptional, red = compositional, orange = mixed. Injected genes appear with high T and low C scores."}
plotDecomposition(result, fdr_thresh = 0.05, top_n = 10)
```

In the transcriptional simulation, the top injected genes should have large
transcriptional scores and relatively small compositional scores. Points
near the compositional axis would instead indicate genes whose broad
pseudo-bulk fold-change is explained by subtype abundance shifts.

```{r plot_tc, fig.cap="Distribution of TC_ratio scores. Vertical lines show classification thresholds at 0.2 and 0.8."}
plotTCRatio(result, fdr_thresh = 0.05)
```

The TC_ratio histogram is useful as a quick summary of the experiment. A
distribution concentrated near 1 suggests that most significant genes are
cell-intrinsic, whereas a distribution concentrated near 0 suggests that
the broad DE result is strongly affected by composition.

```{r plot_prop, fig.cap="Subtype proportions per condition. Balanced design — equal TypeA and TypeB in both conditions."}
plotProportion(result)
```

The proportion plot should be read together with the decomposition plot. If
one subtype expands or contracts between conditions, genes expressed mainly
in that subtype are more likely to receive low TC_ratio values.

# Simulated example: purely compositional signal

Now we demonstrate the tool on a dataset where the **apparent** DE is
caused entirely by a subtype proportion shift. Gene1 is a marker of
TypeB cells. In control, most cells are TypeA (low Gene1). In
treatment, most cells become TypeB (high Gene1). **A standard DE tool
would call Gene1 as significantly upregulated — but it is a compositional
artefact.**

```{r sim_compositional}
set.seed(99)
n_genes <- 150

# Unequal proportions: ctrl = 75% TypeA, treat = 75% TypeB
n_a_ctrl  <- 72L;  n_b_ctrl  <- 24L
n_a_treat <- 24L;  n_b_treat <- 72L
n_cells   <- n_a_ctrl + n_b_ctrl + n_a_treat + n_b_treat

counts2 <- matrix(rpois(n_genes * n_cells, 5L),
                  nrow = n_genes, ncol = n_cells)
rownames(counts2) <- paste0("Gene", seq_len(n_genes))
colnames(counts2) <- paste0("Cell", seq_len(n_cells))

# Gene1 is a marker of TypeB only — no transcriptional change
typeB_idx <- c(
    seq(n_a_ctrl + 1L, n_a_ctrl + n_b_ctrl),
    seq(n_a_ctrl + n_b_ctrl + n_a_treat + 1L, n_cells)
)
counts2[1L, typeB_idx] <- counts2[1L, typeB_idx] * 8L

sce2 <- SingleCellExperiment(assays = list(counts = counts2))
sce2$donor <- c(
    rep(paste0("D", 1:6), each = 12L),
    rep(paste0("D", 1:6), each = 4L),
    rep(paste0("D", 1:6), each = 4L),
    rep(paste0("D", 1:6), each = 12L)
)
sce2$cell_type <- "T_cell"
sce2$subtype   <- c(rep("TypeA", n_a_ctrl), rep("TypeB", n_b_ctrl),
                    rep("TypeA", n_a_treat), rep("TypeB", n_b_treat))
sce2$condition <- c(rep("ctrl",  n_a_ctrl + n_b_ctrl),
                    rep("treat", n_a_treat + n_b_treat))

table(sce2$subtype, sce2$condition)
```

```{r run_compositional}
result2 <- compoundDE(
    sce2,
    broad_type   = "T_cell",
    subtype_col  = "subtype",
    broad_col    = "cell_type",
    donor        = "donor",
    condition    = "condition",
    min_cells    = 3L,
    min_subtypes = 2L,
    min_donors   = 2L
)
result2
```

```{r compositional_check}
dt2 <- as.data.frame(deTable(result2))
gene1 <- dt2[dt2$gene == "Gene1", c("logFC","adj.P.Val","TC_ratio","source")]
cat("Gene1 (compositional marker):\n")
print(gene1)
```

Gene1 will have a significant p-value (broad DE is strong) but a low
TC_ratio, correctly identifying it as a compositional artefact. This is the
case where `scCompoundDE` is most useful: the broad pseudo-bulk test
detects a real difference between conditions, but the decomposition shows
that the difference is caused by the changing mixture of subtypes rather
than a cell-intrinsic expression shift.

```{r plot_comp_decomp, fig.cap="With compositional signal, Gene1 sits at high C_score / low T_score."}
plotDecomposition(result2, fdr_thresh = 0.05, top_n = 5)
```

Here, Gene1 should move toward the compositional side of the plot. The
interpretation is different from an ordinary DE table: a small adjusted
p-value alone is not enough to call the gene transcriptionally altered.

```{r plot_comp_prop, fig.cap="Clear proportion shift: TypeB increases dramatically in treatment."}
plotProportion(result2)
```

The subtype proportion plot provides the biological reason for the low
TC_ratio. TypeB cells are more common in treatment, and Gene1 is a TypeB
marker, so the broad pseudo-bulk signal is explained by composition.

# Filtering by source

After running `compoundDE`, use `filterGenesBySource` to extract gene
lists by classification:

```{r filter_genes}
# Genes with transcriptionally-driven DE (real biology)
transcriptional_genes <- filterGenesBySource(result,
    source     = "transcriptional",
    fdr_thresh = 0.05)
cat("Transcriptional genes (FDR < 0.05):", nrow(transcriptional_genes), "\n")
head(transcriptional_genes[, c("gene","logFC","adj.P.Val","TC_ratio")])

# Genes that are artefactual (compositional)
compositional_genes <- filterGenesBySource(result,
    source     = "compositional",
    fdr_thresh = 1)    # include all, regardless of significance
cat("\nCompositional genes (any FDR):", nrow(compositional_genes), "\n")
```

# Recommended workflow

The same pattern applies to real Bioconductor workflows. `scBatchQC` can be
used first to add batch-aware QC calls to `colData`, `scFastDE` can then
perform fast donor-weighted pseudo-bulk DE on the cleaned
`SingleCellExperiment`, and `scCompoundDE` can decompose the resulting
broad signal for a selected cell population. Users may also substitute
other Bioconductor QC, annotation, or pseudo-bulk tools, provided the input
object contains counts and the required sample annotations.

```{r workflow, eval=FALSE}
# Step 1: QC with scBatchQC
library(scBatchQC)
sce <- batchAwareQCMetrics(sce, batch = "batch")
sce_clean <- sce[, !sce$scBatchQC_outlier]

# Step 2: Fast DE with scFastDE (or any pseudo-bulk tool)
library(scFastDE)
de_result <- fastDE(sce_clean,
                    donor       = "donor",
                    cell_type   = "cell_type",
                    condition   = "condition",
                    target_type = "T_cell")

# Step 3: Decompose with scCompoundDE
cde_result <- compoundDE(
    sce_clean,
    broad_type  = "T_cell",
    subtype_col = "cell_subtype",
    broad_col   = "cell_type",
    donor       = "donor",
    condition   = "condition"
)

# Step 4: Retain only transcriptionally-driven genes
real_biology <- filterGenesBySource(cde_result,
    source = "transcriptional", fdr_thresh = 0.05)
```

# Session information

```{r session}
sessionInfo()
```