| Title: | Kraken Confidence Scores for Reliable Domain-Specific Microbiota Inference and Discovery |
|---|---|
| Description: | Provides a comprehensive framework for analyzing Kraken2 metagenomic classification using multiple confidence scores. karioCaS implements a robust comparative approach to evaluate taxonomic stability across different stringency levels, facilitating the identification of reliable microbiota components and the discovery of potentially novel or unrepresented species. A key innovation is its domain-specific analysis, separately processing Bacteria, Archaea, Eukaryota, and Viruses to ensure balanced visibility of all biological domains. Includes high-quality visualization tools for retention rates, shared taxa (Upset plots), and stability heatmaps. |
| Authors: | Thiago Parente [aut, cre] (ORCID: <https://orcid.org/0000-0001-7501-726X>), Fiocruz [fnd], IOC-Fiocruz [fnd], CAPES [fnd] |
| Maintainer: | Thiago Parente <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 0.99.15 |
| Built: | 2026-06-22 03:09:48 UTC |
| Source: | https://github.com/BiocStaging/karioCaS |
Single Source of Truth for visual elements in karioCaS. Follows Nature/Science publication standards.
.kariocas_internal_colors.kariocas_internal_colors
An object of class list of length 5.
For each biological group (inferred from sample names by stripping trailing
digits, e.g. SAMPLE33, SAMPLE34 -> SAMPLE), draws an
UpSet plot comparing which taxa (at tax_level) are present across the
samples of the group, per Domain. This separates the core taxa
(present in every sample) from unique/rare taxa (present in one or a
few samples) - the expected pattern for pathogens and false positives. A
membership TSV (presence matrix plus N_Samples and a
Core/Shared/Unique Category) is written alongside each plot.
group_upset(project_dir, tax_level = "Species", CS = NULL)group_upset(project_dir, tax_level = "Species", CS = NULL)
project_dir |
Path to the project root. |
tax_level |
Taxonomic rank to compare (default: |
CS |
Confidence Score to analyse. |
By default the analysis uses the final mosaic from
retrieve_selected_taxa(); set CS to compare at a single
Confidence Score from the imported data instead.
Invisibly returns a data.frame with the full membership table.
UpSet PDFs and membership TSVs are saved per group to
<project_dir>/008_taxa_intersections_across_samples/.
toy_project <- system.file("extdata", "your_project_name", package = "karioCaS") # Core vs unique species across samples, from the final mosaic # group_upset(project_dir = toy_project) # Compare at a single Confidence Score instead # group_upset(project_dir = toy_project, CS = 40)toy_project <- system.file("extdata", "your_project_name", package = "karioCaS") # Core vs unique species across samples, from the final mosaic # group_upset(project_dir = toy_project) # Compare at a single Confidence Score instead # group_upset(project_dir = toy_project, CS = 40)
Creates Relative Abundance ( Confidence Score. "Elite" survivors are clustered by similarity, while lost taxa are aggregated into "Loss Groups" at the bottom. Handles domains with zero survivors at the target CS (showing only loss groups).
heatmaps_karioCaS( project_dir, analysis_rank = NULL, confidence_score = NULL, top_n = 20 )heatmaps_karioCaS( project_dir, analysis_rank = NULL, confidence_score = NULL, top_n = 20 )
project_dir |
Path to the project root. |
analysis_rank |
Taxonomic rank to analyze. Defaults to |
confidence_score |
Target CS to define survivors (e.g., 90). Defaults to the highest available CS. |
top_n |
Number of top survivors to display individually (default: 20). |
Invisibly returns NULL. PDF plots are saved to
<project_dir>/006_relative_abundance_across_CS/.
toy_project <- system.file("extdata", "your_project_name", package = "karioCaS") # Basic usage (defaults to Genus, highest CS, top 20 taxa) # heatmaps_karioCaS(project_dir = toy_project) # Advanced: Species level at CS 50, top 30 taxa # heatmaps_karioCaS( # project_dir = toy_project, # analysis_rank = "Species", # confidence_score = 50, # top_n = 30 # )toy_project <- system.file("extdata", "your_project_name", package = "karioCaS") # Basic usage (defaults to Genus, highest CS, top 20 taxa) # heatmaps_karioCaS(project_dir = toy_project) # Advanced: Species level at CS 50, top 30 taxa # heatmaps_karioCaS( # project_dir = toy_project, # analysis_rank = "Species", # confidence_score = 50, # top_n = 30 # )
Reads Kraken2 MPA-style reports from the 000_mpa_original folder.
Parses filenames like SAMPLE_CSXX.mpa (e.g., PILO_CS09.mpa).
Generates a detailed log file for traceability and saves a
TreeSummarizedExperiment object for downstream analysis.
import_karioCaS(project_dir)import_karioCaS(project_dir)
project_dir |
Path to the project root. Must contain a
|
Invisibly returns a TreeSummarizedExperiment object.
toy_project <- system.file("extdata", "your_project_name", package = "karioCaS") import_karioCaS(project_dir = toy_project)toy_project <- system.file("extdata", "your_project_name", package = "karioCaS") import_karioCaS(project_dir = toy_project)
Saturation analysis: progressively raises a per-taxon read-count cutoff and
tracks how many taxa survive, on a log read axis. By default it draws one
group overlay per biological group (every sample a faint line, group
mean highlighted) and marks each domain's median optimal minimum
reads - the elbow of the saturation curve, found with the same engine used
for the optimal CS - as a dashed line. The per-sample optimal-reads values are
written to Reads_Audit_<rank>.tsv/.rds, giving a quantitative
threshold for excluding low-abundance background/false-positive taxa.
reads_per_taxa( project_dir, analysis_level = "Species", method = c("kneedle", "postcliff", "segmented"), detail_samples = NULL )reads_per_taxa( project_dir, analysis_level = "Species", method = c("kneedle", "postcliff", "segmented"), detail_samples = NULL )
project_dir |
Path to the project root. |
analysis_level |
Taxonomic rank to analyze (default: |
method |
Elbow strategy for the optimal reads. One of |
detail_samples |
Which samples to also render as detailed per-sample
saturation panels. |
The optimal reads is computed per Confidence Score, since the saturation curve changes with CS; read it off at your chosen optimal CS (Step 002).
Invisibly returns a data.frame with the optimal-reads audit.
PDF plots and Reads_Audit_<rank> files are saved to
<project_dir>/003_reads_saturation/.
toy_project <- system.file("extdata", "your_project_name", package = "karioCaS") # Group saturation overlays + optimal minimum reads (default Kneedle) # reads_per_taxa(project_dir = toy_project)toy_project <- system.file("extdata", "your_project_name", package = "karioCaS") # Group saturation overlays + optimal minimum reads (default Kneedle) # reads_per_taxa(project_dir = toy_project)
Creates a "biological mosaic" for each sample using the
karioCaS_TSE.rds object from Step 001 and, optionally, the optimal
thresholds computed earlier: the optimal Confidence Score (Stability Index
audit from taxa_retention(), Step 002) and the optimal minimum reads
(Reads_Audit from reads_per_taxa(), Step 003). Both the
CS_* and reads_min_* arguments accept "auto",
"secondary", or a manual numeric value per domain. The optimal reads is
looked up at each domain's resolved CS, so the mosaic combines both data-driven
thresholds automatically. The mosaic always retains all taxonomic
ranks present in the input (as an MPA profile does); enforces a strict
> 0 read filter.
retrieve_selected_taxa( project_dir, tax_level = NULL, CS_A = "auto", reads_min_A = 0, CS_B = "auto", reads_min_B = 0, CS_E = "auto", reads_min_E = 0, CS_V = "auto", reads_min_V = 0 )retrieve_selected_taxa( project_dir, tax_level = NULL, CS_A = "auto", reads_min_A = 0, CS_B = "auto", reads_min_B = 0, CS_E = "auto", reads_min_E = 0, CS_V = "auto", reads_min_V = 0 )
project_dir |
Path to the project root. |
tax_level |
Which rank's optimization audit the |
CS_A |
Character or numeric. CS for Archaea:
|
reads_min_A |
Minimum reads for Archaea: |
CS_B |
Character or numeric. CS for Bacteria. Default: |
reads_min_B |
Minimum reads for Bacteria. Default: 0. |
CS_E |
Character or numeric. CS for Eukaryota. Default: |
reads_min_E |
Minimum reads for Eukaryota. Default: 0. |
CS_V |
Character or numeric. CS for Viruses. Default: |
reads_min_V |
Minimum reads for Viruses. Default: 0. |
Invisibly returns TRUE. Mosaic files are saved to
<project_dir>/004_final_mosaic/, with .mpa files under
mpa/ and .tsv files under tsv/.
toy_project <- system.file("extdata", "your_project_name", package = "karioCaS") # Fully data-driven mosaic: optimal CS and optimal min-reads per domain # retrieve_selected_taxa( # project_dir = toy_project, # tax_level = "Species", # CS_B = "auto", reads_min_B = "auto", # CS_A = "auto", reads_min_A = "auto", # CS_E = 40, reads_min_E = 10, # CS_V = 0, reads_min_V = 0 # )toy_project <- system.file("extdata", "your_project_name", package = "karioCaS") # Fully data-driven mosaic: optimal CS and optimal min-reads per domain # retrieve_selected_taxa( # project_dir = toy_project, # tax_level = "Species", # CS_B = "auto", reads_min_B = "auto", # CS_A = "auto", reads_min_A = "auto", # CS_E = 40, reads_min_E = 10, # CS_V = 0, reads_min_V = 0 # )
Creates stacked bar plots showing resolution efficiency between a Parent Rank
and a Child Rank (how much of each parent clade's reads are resolved down to
the child rank versus remaining parent-exclusive). Uses max() for
parent aggregation to prevent double-counting of children in cumulative data.
taxa_resolution( project_dir, parent_level = "Genus", child_level = "Species", CS = NULL, top_n = 10 )taxa_resolution( project_dir, parent_level = "Genus", child_level = "Species", CS = NULL, top_n = 10 )
project_dir |
Path to the project root. |
parent_level |
Name of the parent rank (default: |
child_level |
Name of the child rank (default: |
CS |
Confidence Score to analyse. |
top_n |
Number of top taxa to display per domain (default: 10). |
By default the analysis runs on the final mosaic produced by
retrieve_selected_taxa (004_final_mosaic/), i.e. the
data-driven high-confidence selection - one figure per sample. Alternatively,
set CS to analyse the raw imported data at a single Confidence Score.
Invisibly returns NULL. PDF plots are saved to
<project_dir>/007_taxa_resolution/.
toy_project <- system.file("extdata", "your_project_name", package = "karioCaS") # Default: resolution of the final mosaic (Genus vs Species, top 10) # taxa_resolution(project_dir = toy_project) # Analyse the imported data at a single Confidence Score # taxa_resolution(project_dir = toy_project, CS = 40)toy_project <- system.file("extdata", "your_project_name", package = "karioCaS") # Default: resolution of the final mosaic (Genus vs Species, top 10) # taxa_resolution(project_dir = toy_project) # Analyse the imported data at a single Confidence Score # taxa_resolution(project_dir = toy_project, CS = 40)
Executes taxa retention analysis based on Confidence Score (Kraken/Bracken)
and, in the same step, computes the objective optimal CS (Stability Index, SI)
for each domain. By default it produces a single, low-clutter group
overlay per biological group: every sample of the group is drawn as a faint
line with the group mean (SD) highlighted and each domain's median
optimal CS marked with a dashed line, faceted by Domain. It also writes the
machine-readable SI audit (SI_Audit_<rank>.tsv/.rds) used by
retrieve_selected_taxa. Detailed per-sample panels (all ranks,
taxa vs reads) are written only on request.
taxa_retention( project_dir, tax_level = "Species", method = c("kneedle", "postcliff", "segmented", "dynamic", "manual"), manual_toll = 1, detail_samples = NULL )taxa_retention( project_dir, tax_level = "Species", method = c("kneedle", "postcliff", "segmented", "dynamic", "manual"), manual_toll = 1, detail_samples = NULL )
project_dir |
Root path of the project. |
tax_level |
Taxonomic rank used for the group overlay and SI
(default: |
method |
Optimal-CS strategy. One of |
manual_toll |
Numeric or named list. Acceptable step-wise loss percentage,
used only when |
detail_samples |
Which samples to also render as detailed per-sample
panels. |
Groups are inferred from sample names by stripping trailing digits
(e.g. SAMPLE33, SAMPLE34 both belong to group SAMPLE).
The optimal CS is found with a multi-strategy engine: "kneedle"
(default, parameter-free elbow detection), "postcliff" (a more
conservative threshold past the steepest drop), "segmented"
(broken-stick regression), "dynamic", or "manual".
Invisibly returns a data.frame with the full SI audit trail.
PDF plots and SI_Audit_<rank> files are saved to
<project_dir>/002_taxa_retention/.
toy_project <- system.file("extdata", "your_project_name", package = "karioCaS") # Group retention overlay + optimal CS (default Kneedle method) # taxa_retention(project_dir = toy_project)toy_project <- system.file("extdata", "your_project_name", package = "karioCaS") # Group retention overlay + optimal CS (default Kneedle method) # taxa_retention(project_dir = toy_project)
"karioCaS never are upset!" Generates UpSet plots showing taxon persistence across Confidence Score levels, for a single taxonomic rank, with detailed logging. One plot per sample and domain is written to a per-sample subfolder.
upset_kariocas(project_dir, tax_level = "Species")upset_kariocas(project_dir, tax_level = "Species")
project_dir |
Path to the project root. |
tax_level |
Taxonomic rank to analyze (default: |
Invisibly returns NULL. PDF plots are saved to
<project_dir>/005_taxa_intersections_across_CS/.
toy_project <- system.file("extdata", "your_project_name", package = "karioCaS") # upset_kariocas(project_dir = toy_project) # upset_kariocas(project_dir = toy_project, tax_level = "Genus")toy_project <- system.file("extdata", "your_project_name", package = "karioCaS") # upset_kariocas(project_dir = toy_project) # upset_kariocas(project_dir = toy_project, tax_level = "Genus")