--- title: "karioCaS: Kraken Confidence Scores Exploration" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{karioCaS: Kraken Confidence Scores Exploration} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup_knitr, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE ) ``` *** ## Introduction Metagenomic classification using k-mer based tools like Kraken2 is incredibly fast, but it is notoriously prone to false-positive assignments. The standard approach of applying a single, global Confidence Score (CS) threshold across an entire sample is fundamentally flawed: **one size does not fit all**. Biological domains (Bacteria, Archaea, Viruses, and Eukaryota) have vastly different database representations, genome sizes, and k-mer uniqueness. **karioCaS** (Kraken Confidence Scores for Reliable Domain-Specific Microbiota Inference) is an R package designed to solve this problem. It provides a robust analytical framework to evaluate taxonomic stability across multiple stringency levels, allowing researchers to find the exact inflection point where statistical noise is removed without losing true biological signal. ```{r setup} library(karioCaS) ``` *** ## 1. Setting up the Environment karioCaS relies on a strict, automated directory architecture. To analyze your data, you must place your Kraken2/Metaphlan-style outputs inside a base directory, specifically within a folder named `000_mpa_original`. Files must follow the naming convention: `SAMPLENAME_CSXX.mpa` (e.g., `SAMPLE01_CS09.mpa` for a Confidence Score of 0.9). For this vignette, we will use the mock data bundled with the package, copying it to a temporary directory to comply with CRAN/Bioconductor file system policies. ```{r prep_data} # 1. Create a temporary project directory proj_dir <- file.path(tempdir(), "karioCaS_Tutorial") dir.create(file.path(proj_dir, "000_mpa_original"), recursive = TRUE) # 2. Locate and copy the bundled mock data mock_data_path <- system.file("extdata/your_project_name/000_mpa_original", package = "karioCaS") mock_files <- list.files(mock_data_path, full.names = TRUE) file.copy(mock_files, file.path(proj_dir, "000_mpa_original")) print(paste("Project Directory initialized at:", proj_dir)) ``` *** ## 2. Data Harmonization The first step is to import the multiple stringency outputs into a cohesive Bioconductor `TreeSummarizedExperiment` (TSE) object. This step parses the taxonomic strings, infers the phylogenetic ranks, and standardizes the matrices. ```{r import_data} # Import data and generate the TSE object (Step 000) tse_object <- import_karioCaS(project_dir = proj_dir) # The object is now ready for Bioconductor-native downstream analysis if desired tse_object ``` *Note: This function automatically creates a `001_imported_matrix` folder containing the `.rds` object and a human-readable `.tsv` audit trail.* *** ## 3. Visual Exploration (Steps 001 to 005) karioCaS shines in its ability to generate publication-ready PDFs to visually inspect the behavior of taxa as stringency increases. All plotting functions automatically read from the harmonized data and save their outputs into dynamically generated subfolders. By default, `taxa_retention()` and `reads_per_taxa()` produce a single **group overlay** per biological group rather than one set of PDFs per sample: every sample of the group is drawn as a faint line while the group mean (± SD) is highlighted, faceted by Domain. This keeps the output compact and makes group-level trends obvious. Groups are inferred from sample names by stripping trailing digits (e.g. `SAMPLE33` and `SAMPLE34` both belong to group `SAMPLE`). ```{r visual_exploration, eval=FALSE} # Note: In an active R session, these commands will generate high-quality # PDF panels in your project directory. # Group-overlay retention curves + optimal CS (see Section 4) taxa_retention(project_dir = proj_dir) # Need a closer look? Render detailed per-sample panels into a per_sample/ # subfolder. Use "all", or a comma-separated list of sample names. taxa_retention(project_dir = proj_dir, detail_samples = "SAMPLE01") # Saturation analysis: also computes the optimal MINIMUM READS per domain # (elbow of the saturation curve) and writes Reads_Audit_Species.tsv/.rds reads_per_taxa(project_dir = proj_dir) # Parent-to-Child resolution at a chosen CS (the default source is the final # mosaic from Section 5, so during exploration pass CS = ... explicitly) taxa_resolution(project_dir = proj_dir, CS = 40, parent_level = "Genus", child_level = "Species") # Generate abundance heatmaps showing taxa extinction patterns at a given CS heatmaps_karioCaS(project_dir = proj_dir, analysis_rank = "Species", confidence_score = 40) ``` *** ## 4. Objective Thresholding: The Stability Index (SI) While visual exploration is crucial, choosing the optimal Confidence Score manually introduces subjectivity. So `taxa_retention()` (Step 001) does double duty: alongside the retention overlay it runs a **Multi-Strategy Stability Index (SI)** engine that analyzes the taxa-retention decay curve to mathematically identify the optimal threshold (Primary SI) for each domain, marks it on the group plot, and writes the `SI_Audit_` tables. Strategies via `method=`: * **Kneedle (Default):** Parameter-free elbow detection. Locates the inflection point where the steep noise-removal phase gives way to the stable signal floor, consistently across domains even when a curve begins with a plateau. * **Post-cliff:** A more conservative threshold that lands deeper in the stable plateau (the first stabilised CS after the steepest drop). * **Segmented:** A Broken-Stick regression model that finds regime shifts (e.g. ecology investigations and discovering biological dark matter). * **Dynamic / Manual:** First CS whose step-wise loss is within tail noise, or an expert-defined loss toll per domain. `taxa_retention()` returns the SI audit invisibly, so you can inspect the chosen thresholds directly: ```{r optimize_cs} # taxa_retention() computes the Stability Index (default Kneedle) and returns # the audit. SI_Audit_Species.tsv/.rds are also written to 002_taxa_retention/. audit_df <- taxa_retention(project_dir = proj_dir, tax_level = "Species") # View the mathematical assessment for the Primary SIs head(audit_df[audit_df$SI_Type == "Primary_SI", c("Sample", "Domain", "CS", "Taxa_Count", "Step_Loss_Pct")]) ``` `taxa_retention()` always saves an objective audit trail (`SI_Audit_.rds` and `.tsv`) covering every sample, and marks each domain's **median Primary SI** as a dashed reference line on the group overlay. The same idea applies to read depth: `reads_per_taxa()` finds the optimal **minimum reads** (the elbow of the saturation curve) per domain and writes a `Reads_Audit`, so the final mosaic can use both thresholds automatically. ```{r reads_audit} # Optimal minimum reads per domain (writes Reads_Audit_Species.tsv/.rds) reads_audit <- reads_per_taxa(project_dir = proj_dir, analysis_level = "Species") ``` *** ## 5. The Ultimate Biological Mosaic Once the optimal thresholds are defined mathematically, we extract the surviving taxa. Since Archaea might require a CS of 0.20 while Bacteria requires a CS of 0.40, `karioCaS` allows you to create a "Mosaic" `.mpa` file using specific thresholds for each domain. Both knobs are data-driven. Setting a `CS_*` argument to `"auto"` pulls the optimal *Primary SI* from the audit written by `taxa_retention()`; setting a `reads_min_*` argument to `"auto"` pulls the optimal minimum reads from the `Reads_Audit` written by `reads_per_taxa()`, looked up at that domain's resolved CS. Either can still be a manual number. ```{r retrieve_taxa} # Generate the final mosaic combining optimal CS and optimal minimum reads retrieve_selected_taxa( project_dir = proj_dir, tax_level = "Species", CS_B = "auto", reads_min_B = "auto", # optimal CS + optimal reads for Bacteria CS_A = "secondary", reads_min_A = "auto", # conservative CS, optimal reads CS_E = 40, reads_min_E = 10, # manual overrides for Eukaryota CS_V = 0, reads_min_V = 0 # keep everything for Viruses ) ``` The resulting `SAMPLENAME_karioCaS_Mosaic.mpa` file, saved in the `004_final_mosaic` folder, is a highly refined, high-confidence taxonomic profile, scrubbed of statistical noise and ready for downstream ecological or clinical analysis. Finally, `taxa_resolution(project_dir = proj_dir)` (with no `CS`) inspects the Parent-to-Child resolution of this final mosaic. The mosaic always retains every taxonomic rank, so this works out of the box. When a study has several samples per biological group (named with a shared prefix, e.g. `CONTROL01`, `CONTROL02`), `group_upset()` compares the taxa **across the samples of each group** and writes both an UpSet plot and a membership table separating **core** taxa (present in every sample) from **unique**/rare taxa (one or a few samples) - the expected signature of pathogens and false positives. Like `taxa_resolution()`, it uses the final mosaic by default, or a single `CS`: ```{r group_upset, eval=FALSE} # Needs >= 2 samples per group; uses the final mosaic by default. group_upset(project_dir = proj_dir) ``` ```{r cleanup, include=FALSE} # Clean up temporary directory unlink(proj_dir, recursive = TRUE) ``` *** ## Session Info ```{r sessionInfo, echo=FALSE} sessionInfo() ```