Single-Cell RNA Analysis Pipeline in sciNOME

Introduction

The sciNOME package provides an ultra-fast, lightweight, and robust pipeline for processing Single-Cell RNA-seq data. This vignette demonstrates the complete standard workflow, from creating the RNA object to trajectory inference, using a lightweight simulated dataset.

library(sciNOME)

1. Data Preparation

To ensure this vignette compiles rapidly without requiring external downloads, we simulate a small single-cell expression matrix (50 genes x 50 cells) and matching metadata.

We artificially increase the expression of specific genes in the first 25 cells to simulate distinct cell populations.

set.seed(42)
# Simulate a 50x50 count matrix
mock_counts <- matrix(rpois(2500, lambda = 5), nrow = 50, ncol = 50)

# Inject artificial differential expression for distinct clusters
mock_counts[1:10, 1:25] <- mock_counts[1:10, 1:25] + 15 

rownames(mock_counts) <- c("MT-ND1", "MT-ND2", paste0("Gene_", 3:50))
colnames(mock_counts) <- paste0("Cell_", 1:50)

# Simulate basic metadata
mock_meta <- data.frame(
  Cell_ID = paste0("Cell_", 1:50),
  Batch = rep(c("Batch_1", "Batch_2"), times = 25)
)

2. Object Construction and QC

We build the RNA object and perform quality control (QC) and normalization in just two steps. We set the filtering thresholds to 0 here to accommodate the tiny mock dataset.

# Build the object
rna_obj <- Build_RNAObject(
  expr_mat = mock_counts,
  meta_data = mock_meta,
  meta_id_col = "Cell_ID",
  min_cells = 0,
  min_features = 0,
  mt_pattern = "^MT-"
)

# Process QC and LogNormalization (Skip scaling for speed in this example)
rna_obj <- ProcessQC_RNA(
  obj = rna_obj,
  mt_pattern = "^MT-",
  min_nCount = 0, 
  min_nFeature = 0, 
  max_mt = 100,
  norm_method = "LogNormalize",
  do_scale = FALSE 
)

# Check processed metadata
head(rna_obj$filter_meta.data, 3)
#>        nCount_RNA nFeature_RNA percent_mt   Batch
#> Cell_1        440           49   10.68182 Batch_1
#> Cell_2        376           48   10.10638 Batch_2
#> Cell_3        415           49    9.39759 Batch_1

3. Dimensionality Reduction

Next, we identify Highly Variable Genes (HVGs) and perform Principal Component Analysis (PCA).

rna_obj <- RunDimReduction_RNA(
  obj = rna_obj,
  method = "PCA",
  layer_name = "data",
  n_hvg = 20,       # Use top 20 HVGs for this tiny dataset
  pca_rank = 5      # Compute top 5 PCs
)

# Preview PCA coordinates
head(rna_obj$reductions$pca, 3)
#>             PC_1       PC_2      PC_3       PC_4       PC_5
#> Cell_1 0.2637178 -1.1234396 3.2673977  1.8825843  0.9826002
#> Cell_2 5.0227921  0.3069669 1.8647245 -4.5375292  0.9639599
#> Cell_3 2.0648461  1.8569399 0.1649815 -0.9681444 -2.4084489

4. Unsupervised Clustering

We use hierarchical clustering (which is extremely fast and stable for small data) to identify cell clusters based on the PCA space.

rna_obj <- RunClustering_RNA(
  obj = rna_obj,
  reduction = "pca",
  method = "hierarchical",
  cluster_k = 2      # We expect 2 clusters due to our artificial injection
)

# Check cluster distribution
table(rna_obj$filter_meta.data$Auto_Cluster)
#> 
#> Cluster_1 Cluster_2 
#>        27        23

5. Differential Expression Analysis (DEA)

We perform an extremely fast Wilcoxon Rank Sum Test to find marker genes between the identified clusters.

dea_res <- RunDEA_RNA(
  obj = rna_obj,
  layer_name = "data",
  group_col = "Auto_Cluster",
  ident_1 = "Cluster_1",
  ident_2 = "Cluster_2",
  min_pct = 0,          # Relaxed for mock data
  logfc_thresh = 0      # Relaxed for mock data
)

# Display top 5 differentially expressed genes
head(dea_res, 5)
#>     gene        p_val avg_log2FC pct.1 pct.2    p_val_adj   cluster
#> 1 MT-ND1 1.298735e-05 -0.8493024     1     1 0.0003077857 Cluster_1
#> 2 Gene_7 1.418637e-05 -0.9819662     1     1 0.0003077857 Cluster_1
#> 3 Gene_3 1.846714e-05 -0.8751727     1     1 0.0003077857 Cluster_1
#> 4 Gene_8 2.505259e-05 -0.8310393     1     1 0.0003131574 Cluster_1
#> 5 Gene_4 5.144484e-05 -0.7676672     1     1 0.0004287070 Cluster_1
#>               comparison
#> 1 Cluster_1 vs Cluster_2
#> 2 Cluster_1 vs Cluster_2
#> 3 Cluster_1 vs Cluster_2
#> 4 Cluster_1 vs Cluster_2
#> 5 Cluster_1 vs Cluster_2

6. Trajectory Inference (Pseudotime)

Finally, we infer the developmental trajectory (Pseudotime) using the PCA coordinates, starting from Cluster_1.

# We wrap this in a check to ensure 'princurve' is available
if (requireNamespace("princurve", quietly = TRUE)) {
  rna_obj <- RunPseudotime_RNA(
    obj = rna_obj,
    reduction = "pca",
    group_col = "Auto_Cluster",
    start_clus = "Cluster_1",
    algorithm = "cluster"
  )
  
  # Check assigned pseudotime values
  head(rna_obj$filter_meta.data$Pseudotime)
}
#> [1] 0.4990516 1.0000000 0.7495275 0.5378360 0.5072763 0.5969383

Session Information

sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggplot2_4.0.3     dplyr_1.2.1       data.table_1.18.4 sciNOME_0.99.0   
#> [5] BiocStyle_2.41.0 
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.6         xfun_0.58            bslib_0.11.0        
#>  [4] ggrepel_0.9.8        rstatix_0.7.3        lattice_0.22-9      
#>  [7] vctrs_0.7.3          tools_4.6.0          generics_0.1.4      
#> [10] stats4_4.6.0         parallel_4.6.0       tibble_3.3.1        
#> [13] pkgconfig_2.0.3      Matrix_1.7-5         RColorBrewer_1.1-3  
#> [16] S7_0.2.2             S4Vectors_0.51.3     ggridges_0.5.7      
#> [19] lifecycle_1.0.5      compiler_4.6.0       farver_2.1.2        
#> [22] Seqinfo_1.3.0        carData_3.0-6        htmltools_0.5.9     
#> [25] sys_3.4.3            buildtools_1.0.0     sass_0.4.10         
#> [28] yaml_2.3.12          Formula_1.2-5        pillar_1.11.1       
#> [31] car_3.1-5            ggpubr_0.6.3         jquerylib_0.1.4     
#> [34] tidyr_1.3.2          MASS_7.3-65          cachem_1.1.0        
#> [37] abind_1.4-8          princurve_2.1.6      nlme_3.1-169        
#> [40] tidyselect_1.2.1     digest_0.6.39        purrr_1.2.2         
#> [43] splines_4.6.0        maketools_1.3.2      labeling_0.4.3      
#> [46] fastmap_1.2.0        grid_4.6.0           cli_3.6.6           
#> [49] magrittr_2.0.5       patchwork_1.3.2      broom_1.0.13        
#> [52] withr_3.0.2          scales_1.4.0         backports_1.5.1     
#> [55] rmarkdown_2.31       igraph_2.3.2         otel_0.2.0          
#> [58] ggsignif_0.6.4       pbapply_1.7-4        evaluate_1.0.5      
#> [61] knitr_1.51           GenomicRanges_1.65.0 IRanges_2.47.2      
#> [64] irlba_2.3.7          viridisLite_0.4.3    mgcv_1.9-4          
#> [67] rlang_1.2.0          Rcpp_1.1.1-1.1       glue_1.8.1          
#> [70] BiocManager_1.30.27  BiocGenerics_0.59.7  jsonlite_2.0.0      
#> [73] R6_2.6.1