--- title: "Generating Consensus TADs with generate_tad_consensus" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Generating Consensus TADs with generate_tad_consensus} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(consensusTADs) ``` ## Introduction Topologically Associating Domains (TADs) are fundamental units of chromatin organization that play crucial roles in gene regulation. Multiple computational tools have been developed to predict TAD boundaries from Hi-C data, but their results often vary significantly. The `generate_tad_consensus` function provides a method to integrate predictions from multiple tools and generate a high-confidence consensus TAD set. ## Function Overview `generate_tad_consensus` creates consensus TADs through an iterative threshold approach that selects optimal non-overlapping TADs representing agreement across different prediction methods. It uses the Measure of Concordance (MoC) score to quantify the level of agreement between predictions from different tools. ## Parameters ```r generate_tad_consensus( df_tools, threshold = 0, step = -0.05 ) ``` * **df_tools**: A data frame containing TAD information with the following required columns: * `chr`: Chromosome name * `start`: TAD start position * `end`: TAD end position * `meta.tool`: Identifier for the prediction tool * **threshold**: A numeric value representing the minimum MoC threshold for filtering, default is 0. Higher thresholds require stronger agreement between different tools. * **step**: A numeric value used to generate the threshold sequence, default is -0.05. The function starts from 1 and decreases by this step value until reaching the threshold parameter. ## Return Value The function returns a data frame with the following columns: * **chr**: Chromosome name * **start**: TAD start position * **end**: TAD end position * **score_source**: A string containing information about the tools that contributed to this TAD and their individual MoC scores * **threshold**: The MoC threshold value at which this TAD was selected during the iterative selection process ## Usage Examples The following examples demonstrate how to use the `generate_tad_consensus` function: ```{r} # Prepare input data with predictions from multiple tools tad_data <- data.frame( chr = rep("chr1", 6), start = c(10000, 20000, 50000, 12000, 22000, 48000), end = c(30000, 45000, 65000, 32000, 43000, 67000), meta.tool = c(rep("tool1", 3), rep("tool2", 3)) ) # Generate consensus TADs with default parameters consensus_results <- generate_tad_consensus(tad_data) print(consensus_results) # Generate consensus TADs with custom threshold values custom_consensus <- generate_tad_consensus( tad_data, threshold = 0.3, step = -0.1 ) print(custom_consensus) ``` ## How It Works The `generate_tad_consensus` function follows these steps: 1. **Input validation**: Check if the input contains data from multiple prediction tools. If only one tool is present, the function returns the original data. 2. **Data preparation**: Split the input data by chromosome. 3. **Threshold sequence generation**: Create a sequence of threshold values from 1 down to the specified threshold parameter using the step size. 4. **Iterative TAD selection**: For each chromosome, apply the `select_tads_by_threshold_series` function, which: - Iterates through the threshold sequence from high to low - For each threshold, calculates MoC scores between TADs using `moc_score_filter` - Filters TADs that meet the current threshold - Uses dynamic programming (`select_global_optimal_tads`) to select an optimal set of non-overlapping TADs that maximize the total score - Records the threshold at which each TAD was selected 5. **Result compilation**: Combine results from all chromosomes and return a data frame with the consensus TADs. ## The Measure of Concordance (MoC) Score The MoC score quantifies the agreement between two TAD predictions and is calculated as: $$MoC = \frac{(intersection\_width)^2}{width1 \times width2}$$ Where: - `intersection_width` is the length of the overlap between two TADs - `width1` and `width2` are the lengths of the two TADs being compared Higher MoC scores indicate stronger agreement between predictions. ## Dynamic Programming for Optimal TAD Selection The algorithm uses dynamic programming to select a set of non-overlapping TADs that maximize the total MoC score. This ensures that the consensus TADs represent regions with the strongest evidence across multiple prediction tools while avoiding contradictory overlapping boundaries. ## Important Notes - Input data must contain predictions from at least two different tools (identified by the `meta.tool` column) - The threshold parameter defines the minimum required MoC score and can be adjusted based on analysis needs - The returned consensus TADs are guaranteed to be non-overlapping ```{r} sessionInfo() ```