karioCaS: Kraken Confidence Scores Exploration

Introduction

Metagenomic classification using k-mer based tools like Kraken2 is incredibly fast, but it is notoriously prone to false-positive assignments. The standard approach of applying a single, global Confidence Score (CS) threshold across an entire sample is fundamentally flawed: one size does not fit all. Biological domains (Bacteria, Archaea, Viruses, and Eukaryota) have vastly different database representations, genome sizes, and k-mer uniqueness.

karioCaS (Kraken Confidence Scores for Reliable Domain-Specific Microbiota Inference) is an R package designed to solve this problem. It provides a robust analytical framework to evaluate taxonomic stability across multiple stringency levels, allowing researchers to find the exact inflection point where statistical noise is removed without losing true biological signal.

library(karioCaS)

1. Setting up the Environment

karioCaS relies on a strict, automated directory architecture. To analyze your data, you must place your Kraken2/Metaphlan-style outputs inside a base directory, specifically within a folder named 000_mpa_original.

Files must follow the naming convention: SAMPLENAME_CSXX.mpa (e.g., SAMPLE01_CS09.mpa for a Confidence Score of 0.9).

For this vignette, we will use the mock data bundled with the package, copying it to a temporary directory to comply with CRAN/Bioconductor file system policies.

# 1. Create a temporary project directory
proj_dir <- file.path(tempdir(), "karioCaS_Tutorial")
dir.create(file.path(proj_dir, "000_mpa_original"), recursive = TRUE)

# 2. Locate and copy the bundled mock data
mock_data_path <- system.file("extdata/your_project_name/000_mpa_original", package = "karioCaS")
mock_files <- list.files(mock_data_path, full.names = TRUE)
file.copy(mock_files, file.path(proj_dir, "000_mpa_original"))
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE

print(paste("Project Directory initialized at:", proj_dir))
#> [1] "Project Directory initialized at: /tmp/RtmpTs8t6o/karioCaS_Tutorial"

2. Data Harmonization

The first step is to import the multiple stringency outputs into a cohesive Bioconductor TreeSummarizedExperiment (TSE) object. This step parses the taxonomic strings, infers the phylogenetic ranks, and standardizes the matrices.

# Import data and generate the TSE object (Step 000)
tse_object <- import_karioCaS(project_dir = proj_dir)

# The object is now ready for Bioconductor-native downstream analysis if desired
tse_object
#> class: TreeSummarizedExperiment 
#> dim: 2030 6 
#> metadata(0):
#> assays(1): counts
#> rownames(2030): d__Archaea d__Archaea|k__Methanobacteriati ...
#>   d__Bacteria|p__Pseudomonadota|c__Betaproteobacteria|o__Burkholderiales|f__Comamonadaceae|g__Hydrogenophaga|s__Hydrogenophaga
#>   crassostreae
#>   d__Bacteria|p__Pseudomonadota|c__Alphaproteobacteria|o__Sphingomonadales|f__Sphingomonadaceae|g__Novosphingobium|s__Novosphingobium
#>   sp. KCTC 2891
#> rowData names(10): Taxonomy_Full Domain ... Species Rank
#> colnames(6): SAMPLE01_CS00 SAMPLE01_CS20 ... SAMPLE01_CS80
#>   SAMPLE01_CS90
#> colData names(2): Sample_ID Confidence_Score
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: NULL
#> rowTree: NULL
#> colLinks: NULL
#> colTree: NULL

Note: This function automatically creates a 001_imported_matrix folder containing the .rds object and a human-readable .tsv audit trail.

3. Visual Exploration (Steps 001 to 005)

karioCaS shines in its ability to generate publication-ready PDFs to visually inspect the behavior of taxa as stringency increases. All plotting functions automatically read from the harmonized data and save their outputs into dynamically generated subfolders.

By default, taxa_retention() and reads_per_taxa() produce a single group overlay per biological group rather than one set of PDFs per sample: every sample of the group is drawn as a faint line while the group mean (± SD) is highlighted, faceted by Domain. This keeps the output compact and makes group-level trends obvious. Groups are inferred from sample names by stripping trailing digits (e.g. SAMPLE33 and SAMPLE34 both belong to group SAMPLE).

# Note: In an active R session, these commands will generate high-quality
# PDF panels in your project directory.

# Group-overlay retention curves + optimal CS (see Section 4)
taxa_retention(project_dir = proj_dir)

# Need a closer look? Render detailed per-sample panels into a per_sample/
# subfolder. Use "all", or a comma-separated list of sample names.
taxa_retention(project_dir = proj_dir, detail_samples = "SAMPLE01")

# Saturation analysis: also computes the optimal MINIMUM READS per domain
# (elbow of the saturation curve) and writes Reads_Audit_Species.tsv/.rds
reads_per_taxa(project_dir = proj_dir)

# Parent-to-Child resolution at a chosen CS (the default source is the final
# mosaic from Section 5, so during exploration pass CS = ... explicitly)
taxa_resolution(project_dir = proj_dir, CS = 40, parent_level = "Genus", child_level = "Species")

# Generate abundance heatmaps showing taxa extinction patterns at a given CS
heatmaps_karioCaS(project_dir = proj_dir, analysis_rank = "Species", confidence_score = 40)

4. Objective Thresholding: The Stability Index (SI)

While visual exploration is crucial, choosing the optimal Confidence Score manually introduces subjectivity. So taxa_retention() (Step 001) does double duty: alongside the retention overlay it runs a Multi-Strategy Stability Index (SI) engine that analyzes the taxa-retention decay curve to mathematically identify the optimal threshold (Primary SI) for each domain, marks it on the group plot, and writes the SI_Audit_<rank> tables. Strategies via method=:

Kneedle (Default): Parameter-free elbow detection. Locates the inflection point where the steep noise-removal phase gives way to the stable signal floor, consistently across domains even when a curve begins with a plateau.
Post-cliff: A more conservative threshold that lands deeper in the stable plateau (the first stabilised CS after the steepest drop).
Segmented: A Broken-Stick regression model that finds regime shifts (e.g. ecology investigations and discovering biological dark matter).
Dynamic / Manual: First CS whose step-wise loss is within tail noise, or an expert-defined loss toll per domain.

taxa_retention() returns the SI audit invisibly, so you can inspect the chosen thresholds directly:

# taxa_retention() computes the Stability Index (default Kneedle) and returns
# the audit. SI_Audit_Species.tsv/.rds are also written to 002_taxa_retention/.
audit_df <- taxa_retention(project_dir = proj_dir, tax_level = "Species")

# View the mathematical assessment for the Primary SIs
head(audit_df[audit_df$SI_Type == "Primary_SI", c("Sample", "Domain", "CS", "Taxa_Count", "Step_Loss_Pct")])
#> # A tibble: 6 × 5
#>   Sample   Domain      CS Taxa_Count Step_Loss_Pct
#>   <chr>    <chr>    <int>      <int>         <dbl>
#> 1 <NA>     <NA>        NA         NA          NA  
#> 2 <NA>     <NA>        NA         NA          NA  
#> 3 <NA>     <NA>        NA         NA          NA  
#> 4 SAMPLE01 Bacteria    60         48          22.1
#> 5 <NA>     <NA>        NA         NA          NA  
#> 6 <NA>     <NA>        NA         NA          NA

taxa_retention() always saves an objective audit trail (SI_Audit_<rank>.rds and .tsv) covering every sample, and marks each domain’s median Primary SI as a dashed reference line on the group overlay.

The same idea applies to read depth: reads_per_taxa() finds the optimal minimum reads (the elbow of the saturation curve) per domain and writes a Reads_Audit, so the final mosaic can use both thresholds automatically.

# Optimal minimum reads per domain (writes Reads_Audit_Species.tsv/.rds)
reads_audit <- reads_per_taxa(project_dir = proj_dir, analysis_level = "Species")

5. The Ultimate Biological Mosaic

Once the optimal thresholds are defined mathematically, we extract the surviving taxa. Since Archaea might require a CS of 0.20 while Bacteria requires a CS of 0.40, karioCaS allows you to create a “Mosaic” .mpa file using specific thresholds for each domain.

Both knobs are data-driven. Setting a CS_* argument to "auto" pulls the optimal Primary SI from the audit written by taxa_retention(); setting a reads_min_* argument to "auto" pulls the optimal minimum reads from the Reads_Audit written by reads_per_taxa(), looked up at that domain’s resolved CS. Either can still be a manual number.

# Generate the final mosaic combining optimal CS and optimal minimum reads
retrieve_selected_taxa(
    project_dir = proj_dir,
    tax_level = "Species",
    CS_B = "auto", reads_min_B = "auto", # optimal CS + optimal reads for Bacteria
    CS_A = "secondary", reads_min_A = "auto", # conservative CS, optimal reads
    CS_E = 40, reads_min_E = 10, # manual overrides for Eukaryota
    CS_V = 0, reads_min_V = 0 # keep everything for Viruses
)

The resulting SAMPLENAME_karioCaS_Mosaic.mpa file, saved in the 004_final_mosaic folder, is a highly refined, high-confidence taxonomic profile, scrubbed of statistical noise and ready for downstream ecological or clinical analysis.

Finally, taxa_resolution(project_dir = proj_dir) (with no CS) inspects the Parent-to-Child resolution of this final mosaic. The mosaic always retains every taxonomic rank, so this works out of the box.

When a study has several samples per biological group (named with a shared prefix, e.g. CONTROL01, CONTROL02), group_upset() compares the taxa across the samples of each group and writes both an UpSet plot and a membership table separating core taxa (present in every sample) from unique/rare taxa (one or a few samples) - the expected signature of pathogens and false positives. Like taxa_resolution(), it uses the final mosaic by default, or a single CS:

# Needs >= 2 samples per group; uses the final mosaic by default.
group_upset(project_dir = proj_dir)

Session Info

#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] karioCaS_0.99.15 rmarkdown_2.31  
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1                dplyr_1.2.1                    
#>  [3] farver_2.1.2                    Biostrings_2.81.3              
#>  [5] S7_0.2.2                        fastmap_1.2.0                  
#>  [7] SingleCellExperiment_1.35.1     lazyeval_0.2.3                 
#>  [9] digest_0.6.39                   lifecycle_1.0.5                
#> [11] tidytree_0.4.7                  magrittr_2.0.5                 
#> [13] compiler_4.6.0                  rlang_1.2.0                    
#> [15] sass_0.4.10                     tools_4.6.0                    
#> [17] utf8_1.2.6                      yaml_2.3.12                    
#> [19] knitr_1.51                      S4Arrays_1.13.0                
#> [21] labeling_0.4.3                  bit_4.6.0                      
#> [23] DelayedArray_0.39.3             plyr_1.8.9                     
#> [25] xml2_1.5.2                      RColorBrewer_1.1-3             
#> [27] TreeSummarizedExperiment_2.21.0 abind_1.4-8                    
#> [29] BiocParallel_1.47.0             withr_3.0.3                    
#> [31] purrr_1.2.2                     BiocGenerics_0.59.7            
#> [33] sys_3.4.3                       grid_4.6.0                     
#> [35] stats4_4.6.0                    ggplot2_4.0.3                  
#> [37] scales_1.4.0                    SummarizedExperiment_1.43.0    
#> [39] cli_3.6.6                       UpSetR_1.4.1                   
#> [41] crayon_1.5.3                    treeio_1.37.0                  
#> [43] generics_0.1.4                  otel_0.2.0                     
#> [45] tzdb_0.5.0                      commonmark_2.0.0               
#> [47] ape_5.8-1                       cachem_1.1.0                   
#> [49] stringr_1.6.0                   parallel_4.6.0                 
#> [51] XVector_0.53.0                  matrixStats_1.5.0              
#> [53] yulab.utils_0.2.4               vctrs_0.7.3                    
#> [55] Matrix_1.7-5                    jsonlite_2.0.0                 
#> [57] litedown_0.9                    IRanges_2.47.2                 
#> [59] hms_1.1.4                       patchwork_1.3.2                
#> [61] S4Vectors_0.51.3                bit64_4.8.2                    
#> [63] maketools_1.3.2                 tidyr_1.3.2                    
#> [65] jquerylib_0.1.4                 glue_1.8.1                     
#> [67] codetools_0.2-20                ggtext_0.1.2                   
#> [69] stringi_1.8.7                   gtable_0.3.6                   
#> [71] GenomicRanges_1.65.0            tibble_3.3.1                   
#> [73] pillar_1.11.1                   rappdirs_0.3.4                 
#> [75] htmltools_0.5.9                 Seqinfo_1.3.0                  
#> [77] R6_2.6.1                        vroom_1.7.1                    
#> [79] evaluate_1.0.5                  lattice_0.22-9                 
#> [81] Biobase_2.73.1                  markdown_2.0                   
#> [83] readr_2.2.0                     gridtext_0.1.6                 
#> [85] bslib_0.11.0                    Rcpp_1.1.1-1.1                 
#> [87] gridExtra_2.3                   SparseArray_1.13.2             
#> [89] nlme_3.1-169                    xfun_0.59                      
#> [91] fs_2.1.0                        MatrixGenerics_1.25.0          
#> [93] forcats_1.0.1                   buildtools_1.0.0               
#> [95] pkgconfig_2.0.3