Metagenomic classification using k-mer based tools like Kraken2 is incredibly fast, but it is notoriously prone to false-positive assignments. The standard approach of applying a single, global Confidence Score (CS) threshold across an entire sample is fundamentally flawed: one size does not fit all. Biological domains (Bacteria, Archaea, Viruses, and Eukaryota) have vastly different database representations, genome sizes, and k-mer uniqueness.
karioCaS (Kraken Confidence Scores for Reliable Domain-Specific Microbiota Inference) is an R package designed to solve this problem. It provides a robust analytical framework to evaluate taxonomic stability across multiple stringency levels, allowing researchers to find the exact inflection point where statistical noise is removed without losing true biological signal.
karioCaS relies on a strict, automated directory architecture. To
analyze your data, you must place your Kraken2/Metaphlan-style outputs
inside a base directory, specifically within a folder named
000_mpa_original.
Files must follow the naming convention:
SAMPLENAME_CSXX.mpa (e.g., SAMPLE01_CS09.mpa
for a Confidence Score of 0.9).
For this vignette, we will use the mock data bundled with the package, copying it to a temporary directory to comply with CRAN/Bioconductor file system policies.
# 1. Create a temporary project directory
proj_dir <- file.path(tempdir(), "karioCaS_Tutorial")
dir.create(file.path(proj_dir, "000_mpa_original"), recursive = TRUE)
# 2. Locate and copy the bundled mock data
mock_data_path <- system.file("extdata/your_project_name/000_mpa_original", package = "karioCaS")
mock_files <- list.files(mock_data_path, full.names = TRUE)
file.copy(mock_files, file.path(proj_dir, "000_mpa_original"))
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE
print(paste("Project Directory initialized at:", proj_dir))
#> [1] "Project Directory initialized at: /tmp/RtmpTs8t6o/karioCaS_Tutorial"The first step is to import the multiple stringency outputs into a
cohesive Bioconductor TreeSummarizedExperiment (TSE)
object. This step parses the taxonomic strings, infers the phylogenetic
ranks, and standardizes the matrices.
# Import data and generate the TSE object (Step 000)
tse_object <- import_karioCaS(project_dir = proj_dir)
# The object is now ready for Bioconductor-native downstream analysis if desired
tse_object
#> class: TreeSummarizedExperiment
#> dim: 2030 6
#> metadata(0):
#> assays(1): counts
#> rownames(2030): d__Archaea d__Archaea|k__Methanobacteriati ...
#> d__Bacteria|p__Pseudomonadota|c__Betaproteobacteria|o__Burkholderiales|f__Comamonadaceae|g__Hydrogenophaga|s__Hydrogenophaga
#> crassostreae
#> d__Bacteria|p__Pseudomonadota|c__Alphaproteobacteria|o__Sphingomonadales|f__Sphingomonadaceae|g__Novosphingobium|s__Novosphingobium
#> sp. KCTC 2891
#> rowData names(10): Taxonomy_Full Domain ... Species Rank
#> colnames(6): SAMPLE01_CS00 SAMPLE01_CS20 ... SAMPLE01_CS80
#> SAMPLE01_CS90
#> colData names(2): Sample_ID Confidence_Score
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: NULL
#> rowTree: NULL
#> colLinks: NULL
#> colTree: NULLNote: This function automatically creates a
001_imported_matrix folder containing the .rds
object and a human-readable .tsv audit trail.
karioCaS shines in its ability to generate publication-ready PDFs to visually inspect the behavior of taxa as stringency increases. All plotting functions automatically read from the harmonized data and save their outputs into dynamically generated subfolders.
By default, taxa_retention() and
reads_per_taxa() produce a single group
overlay per biological group rather than one set of PDFs per
sample: every sample of the group is drawn as a faint line while the
group mean (± SD) is highlighted, faceted by Domain. This keeps the
output compact and makes group-level trends obvious. Groups are inferred
from sample names by stripping trailing digits
(e.g. SAMPLE33 and SAMPLE34 both belong to
group SAMPLE).
# Note: In an active R session, these commands will generate high-quality
# PDF panels in your project directory.
# Group-overlay retention curves + optimal CS (see Section 4)
taxa_retention(project_dir = proj_dir)
# Need a closer look? Render detailed per-sample panels into a per_sample/
# subfolder. Use "all", or a comma-separated list of sample names.
taxa_retention(project_dir = proj_dir, detail_samples = "SAMPLE01")
# Saturation analysis: also computes the optimal MINIMUM READS per domain
# (elbow of the saturation curve) and writes Reads_Audit_Species.tsv/.rds
reads_per_taxa(project_dir = proj_dir)
# Parent-to-Child resolution at a chosen CS (the default source is the final
# mosaic from Section 5, so during exploration pass CS = ... explicitly)
taxa_resolution(project_dir = proj_dir, CS = 40, parent_level = "Genus", child_level = "Species")
# Generate abundance heatmaps showing taxa extinction patterns at a given CS
heatmaps_karioCaS(project_dir = proj_dir, analysis_rank = "Species", confidence_score = 40)While visual exploration is crucial, choosing the optimal Confidence
Score manually introduces subjectivity. So taxa_retention()
(Step 001) does double duty: alongside the retention overlay it runs a
Multi-Strategy Stability Index (SI) engine that
analyzes the taxa-retention decay curve to mathematically identify the
optimal threshold (Primary SI) for each domain, marks it on the group
plot, and writes the SI_Audit_<rank> tables.
Strategies via method=:
taxa_retention() returns the SI audit invisibly, so you
can inspect the chosen thresholds directly:
# taxa_retention() computes the Stability Index (default Kneedle) and returns
# the audit. SI_Audit_Species.tsv/.rds are also written to 002_taxa_retention/.
audit_df <- taxa_retention(project_dir = proj_dir, tax_level = "Species")
# View the mathematical assessment for the Primary SIs
head(audit_df[audit_df$SI_Type == "Primary_SI", c("Sample", "Domain", "CS", "Taxa_Count", "Step_Loss_Pct")])
#> # A tibble: 6 × 5
#> Sample Domain CS Taxa_Count Step_Loss_Pct
#> <chr> <chr> <int> <int> <dbl>
#> 1 <NA> <NA> NA NA NA
#> 2 <NA> <NA> NA NA NA
#> 3 <NA> <NA> NA NA NA
#> 4 SAMPLE01 Bacteria 60 48 22.1
#> 5 <NA> <NA> NA NA NA
#> 6 <NA> <NA> NA NA NAtaxa_retention() always saves an objective audit trail
(SI_Audit_<rank>.rds and .tsv) covering
every sample, and marks each domain’s median Primary SI
as a dashed reference line on the group overlay.
The same idea applies to read depth: reads_per_taxa()
finds the optimal minimum reads (the elbow of the
saturation curve) per domain and writes a Reads_Audit, so
the final mosaic can use both thresholds automatically.
# Optimal minimum reads per domain (writes Reads_Audit_Species.tsv/.rds)
reads_audit <- reads_per_taxa(project_dir = proj_dir, analysis_level = "Species")Once the optimal thresholds are defined mathematically, we extract
the surviving taxa. Since Archaea might require a CS of 0.20 while
Bacteria requires a CS of 0.40, karioCaS allows you to
create a “Mosaic” .mpa file using specific thresholds for
each domain.
Both knobs are data-driven. Setting a CS_* argument to
"auto" pulls the optimal Primary SI from the audit
written by taxa_retention(); setting a
reads_min_* argument to "auto" pulls the
optimal minimum reads from the Reads_Audit written by
reads_per_taxa(), looked up at that domain’s resolved CS.
Either can still be a manual number.
# Generate the final mosaic combining optimal CS and optimal minimum reads
retrieve_selected_taxa(
project_dir = proj_dir,
tax_level = "Species",
CS_B = "auto", reads_min_B = "auto", # optimal CS + optimal reads for Bacteria
CS_A = "secondary", reads_min_A = "auto", # conservative CS, optimal reads
CS_E = 40, reads_min_E = 10, # manual overrides for Eukaryota
CS_V = 0, reads_min_V = 0 # keep everything for Viruses
)The resulting SAMPLENAME_karioCaS_Mosaic.mpa file, saved
in the 004_final_mosaic folder, is a highly refined,
high-confidence taxonomic profile, scrubbed of statistical noise and
ready for downstream ecological or clinical analysis.
Finally, taxa_resolution(project_dir = proj_dir) (with
no CS) inspects the Parent-to-Child resolution of this
final mosaic. The mosaic always retains every taxonomic rank, so this
works out of the box.
When a study has several samples per biological group (named with a
shared prefix, e.g. CONTROL01, CONTROL02),
group_upset() compares the taxa across the samples
of each group and writes both an UpSet plot and a membership
table separating core taxa (present in every sample)
from unique/rare taxa (one or a few samples) - the
expected signature of pathogens and false positives. Like
taxa_resolution(), it uses the final mosaic by default, or
a single CS:
# Needs >= 2 samples per group; uses the final mosaic by default.
group_upset(project_dir = proj_dir)#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] karioCaS_0.99.15 rmarkdown_2.31
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 dplyr_1.2.1
#> [3] farver_2.1.2 Biostrings_2.81.3
#> [5] S7_0.2.2 fastmap_1.2.0
#> [7] SingleCellExperiment_1.35.1 lazyeval_0.2.3
#> [9] digest_0.6.39 lifecycle_1.0.5
#> [11] tidytree_0.4.7 magrittr_2.0.5
#> [13] compiler_4.6.0 rlang_1.2.0
#> [15] sass_0.4.10 tools_4.6.0
#> [17] utf8_1.2.6 yaml_2.3.12
#> [19] knitr_1.51 S4Arrays_1.13.0
#> [21] labeling_0.4.3 bit_4.6.0
#> [23] DelayedArray_0.39.3 plyr_1.8.9
#> [25] xml2_1.5.2 RColorBrewer_1.1-3
#> [27] TreeSummarizedExperiment_2.21.0 abind_1.4-8
#> [29] BiocParallel_1.47.0 withr_3.0.3
#> [31] purrr_1.2.2 BiocGenerics_0.59.7
#> [33] sys_3.4.3 grid_4.6.0
#> [35] stats4_4.6.0 ggplot2_4.0.3
#> [37] scales_1.4.0 SummarizedExperiment_1.43.0
#> [39] cli_3.6.6 UpSetR_1.4.1
#> [41] crayon_1.5.3 treeio_1.37.0
#> [43] generics_0.1.4 otel_0.2.0
#> [45] tzdb_0.5.0 commonmark_2.0.0
#> [47] ape_5.8-1 cachem_1.1.0
#> [49] stringr_1.6.0 parallel_4.6.0
#> [51] XVector_0.53.0 matrixStats_1.5.0
#> [53] yulab.utils_0.2.4 vctrs_0.7.3
#> [55] Matrix_1.7-5 jsonlite_2.0.0
#> [57] litedown_0.9 IRanges_2.47.2
#> [59] hms_1.1.4 patchwork_1.3.2
#> [61] S4Vectors_0.51.3 bit64_4.8.2
#> [63] maketools_1.3.2 tidyr_1.3.2
#> [65] jquerylib_0.1.4 glue_1.8.1
#> [67] codetools_0.2-20 ggtext_0.1.2
#> [69] stringi_1.8.7 gtable_0.3.6
#> [71] GenomicRanges_1.65.0 tibble_3.3.1
#> [73] pillar_1.11.1 rappdirs_0.3.4
#> [75] htmltools_0.5.9 Seqinfo_1.3.0
#> [77] R6_2.6.1 vroom_1.7.1
#> [79] evaluate_1.0.5 lattice_0.22-9
#> [81] Biobase_2.73.1 markdown_2.0
#> [83] readr_2.2.0 gridtext_0.1.6
#> [85] bslib_0.11.0 Rcpp_1.1.1-1.1
#> [87] gridExtra_2.3 SparseArray_1.13.2
#> [89] nlme_3.1-169 xfun_0.59
#> [91] fs_2.1.0 MatrixGenerics_1.25.0
#> [93] forcats_1.0.1 buildtools_1.0.0
#> [95] pkgconfig_2.0.3