--- title: "Decomposing pseudo-bulk DE into transcriptional and compositional components with scCompoundDE" author: - name: "Subhadip Jana" email: "subhadipjana1409@gmail.com" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true toc_float: true number_sections: true vignette: > %\VignetteIndexEntry{Decomposing pseudo-bulk DE with scCompoundDE} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, message = TRUE, warning = FALSE ) ``` # Introduction ## The hidden confound in pseudo-bulk DE Multi-donor single-cell RNA-seq studies routinely use **pseudo-bulk differential expression** to find disease or treatment-associated genes. The standard approach aggregates counts per donor and fits a linear model across donors. This is statistically sound — but it contains a critical biological blind spot. Consider T cells in a lupus dataset. Disease donors have mostly exhausted T cells (high PDCD1, LAG3, TIGIT). Healthy donors have mostly naive T cells (high IL7R, CCR7, SELL). A pseudo-bulk DE model will declare PDCD1 and IL7R as the top DE genes — and both would have small p-values and large fold-changes. But: > **Are these genes DE because T cells changed their transcriptome > (transcriptional change)?** > > **Or because the relative abundance of exhausted vs naive T cells > shifted between conditions (compositional change)?** Standard tools cannot answer this question. They treat the pseudo-bulk as a homogeneous entity. `scCompoundDE` decomposes every DE gene into these two orthogonal components. ## Integration with Bioconductor workflows `scCompoundDE` is designed to fit into standard Bioconductor single-cell workflows without introducing a package-specific input object. The main function, `compoundDE`, accepts a `SingleCellExperiment` object, reads raw counts from an assay such as `assay(sce, "counts")`, and uses sample, condition, broad cell type, and subtype annotations stored in `colData`. This allows users to pass the same object through quality control, annotation, pseudo-bulk differential expression, and decomposition steps. A typical workflow is to perform batch-aware cell-level quality control with `scBatchQC`, run donor-aware pseudo-bulk differential expression with `scFastDE` or another Bioconductor DE workflow, and then use `scCompoundDE` to decide whether the observed broad pseudo-bulk signal is primarily transcriptional or compositional. The result is returned as a `CDEResult` S4 object containing a `S4Vectors::DataFrame` of per-gene statistics, a subtype proportion matrix, and per-subtype DE tables, so the output can be filtered, plotted, or passed to downstream Bioconductor-style analyses. ## The TC_ratio score For each gene, `compoundDE` returns a `TC_ratio` in \[0, 1\]: | TC_ratio | Interpretation | |---|---| | ≈ 1.0 | Transcriptional — cells intrinsically changed expression | | ≈ 0.0 | Compositional — proportion of subtypes shifted | | ≈ 0.5 | Mixed — both mechanisms contribute | Genes with low TC_ratio should be treated with caution: they may be DE in a standard analysis but reflect **composition** rather than **biology**. # Installation ```{r install, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("scCompoundDE") ``` # Simulated example: purely transcriptional signal We first demonstrate the tool on a dataset where the DE signal is deliberately **transcriptional**: the same proportions of TypeA and TypeB cells appear in both conditions, but TypeA cells intrinsically upregulate genes 1–10 in treatment. ```{r sim_transcriptional} library(scCompoundDE) library(SingleCellExperiment) set.seed(42) n_genes <- 150 n_donors <- 6 counts <- matrix(rpois(n_genes * 240L, 8L), nrow = n_genes, ncol = 240L) rownames(counts) <- paste0("Gene", seq_len(n_genes)) colnames(counts) <- paste0("Cell", seq_len(240L)) # First 10 genes × 4 in TypeA treatment cells only # TypeA cells: cols 61–120 in treatment counts[seq_len(10), 61:120] <- counts[seq_len(10), 61:120] * 5L sce <- SingleCellExperiment(assays = list(counts = counts)) sce$donor <- rep(rep(paste0("D", seq_len(n_donors)), each = 10L), 4L) sce$cell_type <- "T_cell" sce$subtype <- rep(rep(c("TypeA","TypeB"), each = n_donors * 10L), 2L) sce$condition <- rep(c("ctrl","treat"), each = 120L) # Check the design table(sce$subtype, sce$condition) ``` ## Running compoundDE ```{r run_compoundDE} result <- compoundDE( sce, broad_type = "T_cell", subtype_col = "subtype", broad_col = "cell_type", donor = "donor", condition = "condition", min_cells = 3L, min_subtypes = 2L, min_donors = 2L ) result ``` ## Inspecting results The `deTable` accessor returns one row per tested gene. In addition to the broad pseudo-bulk DE statistics (`logFC`, `P.Value`, and `adj.P.Val`), it contains the decomposition scores used by `scCompoundDE`: `T_score` for the transcriptional component, `C_score` for the compositional component, `TC_ratio` for their relative contribution, and `source` for the final classification. ```{r inspect_results} dt <- as.data.frame(deTable(result)) dt_sorted <- dt[order(dt$adj.P.Val), ] head(dt_sorted[, c("gene","logFC","adj.P.Val", "T_score","C_score","TC_ratio","source")], 15) ``` The injected genes (Gene1–Gene10) should have high TC_ratio values, confirming they were detected as transcriptionally driven. In a real analysis, genes classified as `transcriptional` are the strongest candidates for cell-intrinsic biology, genes classified as `compositional` should be interpreted as possible subtype-abundance effects, and genes classified as `mixed` require inspection of both scores. ```{r tc_of_injected} injected <- dt[dt$gene %in% paste0("Gene", seq_len(10)), ] cat("Median TC_ratio of injected genes:", round(median(injected$TC_ratio, na.rm = TRUE), 3), "\n") cat("All injected genes classified as transcriptional:", all(injected$source == "transcriptional", na.rm = TRUE), "\n") ``` ## Visualizations The plotting functions provide complementary diagnostics for the decomposition. `plotDecomposition` shows whether significant genes are driven mainly by the transcriptional or compositional axis, `plotTCRatio` summarizes the distribution of gene classifications, and `plotProportion` checks whether subtype proportions differ between conditions enough to explain compositional signals. ```{r plot_decomp, fig.cap="Scatter of transcriptional vs compositional z-scores. Blue = transcriptional, red = compositional, orange = mixed. Injected genes appear with high T and low C scores."} plotDecomposition(result, fdr_thresh = 0.05, top_n = 10) ``` In the transcriptional simulation, the top injected genes should have large transcriptional scores and relatively small compositional scores. Points near the compositional axis would instead indicate genes whose broad pseudo-bulk fold-change is explained by subtype abundance shifts. ```{r plot_tc, fig.cap="Distribution of TC_ratio scores. Vertical lines show classification thresholds at 0.2 and 0.8."} plotTCRatio(result, fdr_thresh = 0.05) ``` The TC_ratio histogram is useful as a quick summary of the experiment. A distribution concentrated near 1 suggests that most significant genes are cell-intrinsic, whereas a distribution concentrated near 0 suggests that the broad DE result is strongly affected by composition. ```{r plot_prop, fig.cap="Subtype proportions per condition. Balanced design — equal TypeA and TypeB in both conditions."} plotProportion(result) ``` The proportion plot should be read together with the decomposition plot. If one subtype expands or contracts between conditions, genes expressed mainly in that subtype are more likely to receive low TC_ratio values. # Simulated example: purely compositional signal Now we demonstrate the tool on a dataset where the **apparent** DE is caused entirely by a subtype proportion shift. Gene1 is a marker of TypeB cells. In control, most cells are TypeA (low Gene1). In treatment, most cells become TypeB (high Gene1). **A standard DE tool would call Gene1 as significantly upregulated — but it is a compositional artefact.** ```{r sim_compositional} set.seed(99) n_genes <- 150 # Unequal proportions: ctrl = 75% TypeA, treat = 75% TypeB n_a_ctrl <- 72L; n_b_ctrl <- 24L n_a_treat <- 24L; n_b_treat <- 72L n_cells <- n_a_ctrl + n_b_ctrl + n_a_treat + n_b_treat counts2 <- matrix(rpois(n_genes * n_cells, 5L), nrow = n_genes, ncol = n_cells) rownames(counts2) <- paste0("Gene", seq_len(n_genes)) colnames(counts2) <- paste0("Cell", seq_len(n_cells)) # Gene1 is a marker of TypeB only — no transcriptional change typeB_idx <- c( seq(n_a_ctrl + 1L, n_a_ctrl + n_b_ctrl), seq(n_a_ctrl + n_b_ctrl + n_a_treat + 1L, n_cells) ) counts2[1L, typeB_idx] <- counts2[1L, typeB_idx] * 8L sce2 <- SingleCellExperiment(assays = list(counts = counts2)) sce2$donor <- c( rep(paste0("D", 1:6), each = 12L), rep(paste0("D", 1:6), each = 4L), rep(paste0("D", 1:6), each = 4L), rep(paste0("D", 1:6), each = 12L) ) sce2$cell_type <- "T_cell" sce2$subtype <- c(rep("TypeA", n_a_ctrl), rep("TypeB", n_b_ctrl), rep("TypeA", n_a_treat), rep("TypeB", n_b_treat)) sce2$condition <- c(rep("ctrl", n_a_ctrl + n_b_ctrl), rep("treat", n_a_treat + n_b_treat)) table(sce2$subtype, sce2$condition) ``` ```{r run_compositional} result2 <- compoundDE( sce2, broad_type = "T_cell", subtype_col = "subtype", broad_col = "cell_type", donor = "donor", condition = "condition", min_cells = 3L, min_subtypes = 2L, min_donors = 2L ) result2 ``` ```{r compositional_check} dt2 <- as.data.frame(deTable(result2)) gene1 <- dt2[dt2$gene == "Gene1", c("logFC","adj.P.Val","TC_ratio","source")] cat("Gene1 (compositional marker):\n") print(gene1) ``` Gene1 will have a significant p-value (broad DE is strong) but a low TC_ratio, correctly identifying it as a compositional artefact. This is the case where `scCompoundDE` is most useful: the broad pseudo-bulk test detects a real difference between conditions, but the decomposition shows that the difference is caused by the changing mixture of subtypes rather than a cell-intrinsic expression shift. ```{r plot_comp_decomp, fig.cap="With compositional signal, Gene1 sits at high C_score / low T_score."} plotDecomposition(result2, fdr_thresh = 0.05, top_n = 5) ``` Here, Gene1 should move toward the compositional side of the plot. The interpretation is different from an ordinary DE table: a small adjusted p-value alone is not enough to call the gene transcriptionally altered. ```{r plot_comp_prop, fig.cap="Clear proportion shift: TypeB increases dramatically in treatment."} plotProportion(result2) ``` The subtype proportion plot provides the biological reason for the low TC_ratio. TypeB cells are more common in treatment, and Gene1 is a TypeB marker, so the broad pseudo-bulk signal is explained by composition. # Filtering by source After running `compoundDE`, use `filterGenesBySource` to extract gene lists by classification: ```{r filter_genes} # Genes with transcriptionally-driven DE (real biology) transcriptional_genes <- filterGenesBySource(result, source = "transcriptional", fdr_thresh = 0.05) cat("Transcriptional genes (FDR < 0.05):", nrow(transcriptional_genes), "\n") head(transcriptional_genes[, c("gene","logFC","adj.P.Val","TC_ratio")]) # Genes that are artefactual (compositional) compositional_genes <- filterGenesBySource(result, source = "compositional", fdr_thresh = 1) # include all, regardless of significance cat("\nCompositional genes (any FDR):", nrow(compositional_genes), "\n") ``` # Recommended workflow The same pattern applies to real Bioconductor workflows. `scBatchQC` can be used first to add batch-aware QC calls to `colData`, `scFastDE` can then perform fast donor-weighted pseudo-bulk DE on the cleaned `SingleCellExperiment`, and `scCompoundDE` can decompose the resulting broad signal for a selected cell population. Users may also substitute other Bioconductor QC, annotation, or pseudo-bulk tools, provided the input object contains counts and the required sample annotations. ```{r workflow, eval=FALSE} # Step 1: QC with scBatchQC library(scBatchQC) sce <- batchAwareQCMetrics(sce, batch = "batch") sce_clean <- sce[, !sce$scBatchQC_outlier] # Step 2: Fast DE with scFastDE (or any pseudo-bulk tool) library(scFastDE) de_result <- fastDE(sce_clean, donor = "donor", cell_type = "cell_type", condition = "condition", target_type = "T_cell") # Step 3: Decompose with scCompoundDE cde_result <- compoundDE( sce_clean, broad_type = "T_cell", subtype_col = "cell_subtype", broad_col = "cell_type", donor = "donor", condition = "condition" ) # Step 4: Retain only transcriptionally-driven genes real_biology <- filterGenesBySource(cde_result, source = "transcriptional", fdr_thresh = 0.05) ``` # Session information ```{r session} sessionInfo() ```