---
title: "CMEnt Configuration"
author: "CMEnt Package"
date: "`r Sys.Date()`"
output: BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{CMEnt Configuration}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
suppressPackageStartupMessages({
    library(CMEnt)
})
knitr::opts_chunk$set(
    echo = TRUE,
    warning = FALSE,
    message = FALSE
)
```

# Overview

This vignette provides a comprehensive guide to the configuration available in CMEnt. Understanding these parameters will help you optimize the package for your specific analysis needs.

# Function Parameters

## Core Input Parameters

### `beta`
**Type:** Character, matrix, BetaHandler object, or BED file  
**Required:** Yes  
**Description:** Input methylation data. Can be:

- Path to a beta value file (tab-separated). The beta file will be loaded into memory if its size is below `getOption("CMEnt.beta_in_mem_threshold_mb")`Megabytes or genomically sorted and converted to tabix for faster access, if samtools tabix is installed.
- Path to a tabix-indexed file (.bed.gz with .tbi index)
- A beta matrix with site IDs as rownames and sample IDs as column names.
- A BetaHandler object (see `?BetaHandler`)
- A BED file with columns `bed_chrom_col` and `bed_start_col`, followed by sample columns, existing as row names in the provided pheno data

**Example:**
```{r beta_example}
loadExampleInputDataChr5And11("beta")
```

### `seeds`
**Type:** Character or data.frame  
**Required:** Yes  
**Description:** site sites to use as seeds for DMR detection. Can be:

- Path to a file with line separated site IDs
- A data.frame with DMP information

**Format requirements:**
- Row names or first column or the column given by `seeds_id_col` should contain site IDs
- The site IDs must match those in the beta data, either as Illumina IDs or genomic coordinates (chr:start). The latter is required when using BED files for beta.

**Example:**
```{r seeds_example}
loadExampleInputDataChr5And11("dmps")
seeds <- dmps
```

### `pheno`
**Type:** Data.frame  
**Required:** Yes  
**Description:** Sample phenotype information.

**Format requirements:**
- Row names should match column names in beta data
- Must contain sample group information column (specified by `sample_group_col`)
- May contain case/control status column (specified by `casecontrol_col`)

**Example:**
```{r pheno_example}
loadExampleInputDataChr5And11("pheno")
```

## Sample Grouping Parameters

### `sample_group_col`
**Type:** Character  
**Default:** `"Sample_Group"`  
**Description:** Column name in `pheno` that specifies sample groups (e.g., "case" vs "control", "treated" vs "untreated"). More than two groups are supported.

### `casecontrol_col`
**Type:** Character  
**Default:** `NULL`  
**Description:** Column name in `pheno` for case (TRUE/1) vs control (FALSE/0) status, for delta beta computations. If `NULL`, controls are assumed to be the first level found at `sample_group_col`.


### `ignored_sample_groups`
**Type:** Character vector  
**Default:** `NULL`  
**Description:** Sample groups to exclude while considering connection and expansion. Can also be "case" or "control".

**Example:**
```{r basic_usage, eval=FALSE}
dmrs <- buildDMRs(
    beta = beta,
    seeds = seeds,
    pheno = pheno,
    ignored_sample_groups = c("excluded_group1", "excluded_group2")
)
```

## Array and Genome Parameters

### `array`
**Type:** Character  
**Options:** `"450K"`, `"27K"`, `"EPIC"`, `"EPICv2"`  
**Default:** `"450K"`  
**Description:** Type of methylation array platform used. Ignored when using mouse genomes or when beta is provided as a BED file.


### `genome`
**Type:** Character
**Options:** `"hg19"`, `"hg38"`, `"hs1"`, `"mm10"`, `"mm39"`
**Description:** Reference genome version. If not specified, it will be inferred from the input data; if input is 450K or EPIC array data, the genome will be set to "hg19" by default, otherwise to "hg38".

## Filtering Parameters

### `ext_site_delta_beta`
**Type:** Numeric  
**Default:** `0.2`  
**Range:** 0 to 1, or `NA` to disable  
**Description:** Absolute delta beta value for neighboring sites to be included in DMRs, during the **second stage** of DMR extension, without considering correlation.

**Recommendation:** Keep 0.2 for balanced precision/recall. Use `NA` to disable the shortcut entirely. Set to 0 only when you intentionally want any proximal site with a non-missing case/control delta beta to be eligible for force-connection.

### `min_seeds`
**Type:** Integer  
**Default:** `1`  
**Description:** Minimum number of connected seeds required in a DMR.

**Recommendation:** Increase this value (e.g., to 3 or 4) for higher-confidence DMRs.

### `min_adj_seeds`
**Type:** Integer  
**Default:** `2`  
**Description:** Minimum number of seeds in a DMR after adjusting for site density. Minimum 2. This accounts for regions with different site coverage, and it is mostly applicable to array-based data. Only used when min_seeds < min_adj_seeds.


### `min_sites`
**Type:** Integer  
**Default:** `50`  
**Description:** Minimum number of sites required in a DMR after region expansion. Minimum 2.

**Recommendation:** Lower this value (e.g., 3-5) for array-based data, keep higher (50+) for WGBS data.


## Region Building Parameters

### `max_lookup_dist`
**Type:** Integer  
**Default:** `10000`  
**Unit:** Base pairs  
**Description:** Maximum genomic distance between adjacent seeds to be considered part of the same DMR.

**Recommendation:**
- 1,000-5,000 bp for tightly connected regions
- 10,000 bp (default) for moderate spacing
- 20,000+ bp for broader regions


### `expansion_window`
**Type:** Numeric  
**Default:** `1e6`  
**Description:** Stage 2 connectivity is computed only around seed-derived Stage 1 neighborhoods, using this total window width in base pairs.

Set to `<= 0` to compute connectivity genome-wide.

### `max_bridge_seeds_gaps`
**Type:** Integer  
**Default:** `1`  
**Description:** In Stage 1 seed connectivity, allows bridging up to this many consecutive p-value-driven failed edges when both flanking edges are connected.

### `max_bridge_extension_gaps`
**Type:** Integer
**Default:** `1`
**Description:** In Stage 2 DMR extension, allows bridging up to this many consecutive p-value-driven failed edges when both flanking edges are connected.

## Statistical Parameters

### `max_pval`
**Type:** Numeric  
**Default:** `0.05`  
**Range:** 0 to 1  
**Description:** Maximum p-value threshold for considering correlation between seeds as significant during the first stage of connectivity testing, and between proxial sites during the second stage DMR extension. Under `strong` entanglement, a Bonferroni correction is applied based on the number of samples groups (number of tests per site).

### `entanglement`
**Type:** Character  
**Options:** `"strong"`, `"weak"`  
**Default:** `"strong"`  
**Description:** Strategy for determining connectivity between sites across sample groups:

- `"strong"`: Requires all sample groups to show significant correlation for two sites to be considered connected. This is more conservative and ensures consistent methylation patterns across all groups.
- `"weak"`: Requires at least one sample group to show significant correlation. This is more permissive and may identify DMRs that are specific to certain groups.

**Recommendation:** Use `"strong"` (default) for most cases to ensure robust, reproducible DMRs. Use `"weak"` when you want to capture group-specific methylation patterns or when working with heterogeneous sample groups.

**Example:**
```{r bsseq_usage, eval=FALSE}
dmrs <- buildDMRs(
    beta = beta,
    seeds = seeds,
    pheno = pheno,
    entanglement = "weak"
)
```

### `testing_mode`
**Type:** Character  
**Options:** `"parametric"`, `"empirical"`, `"auto"`  
**Default:** `"auto"`  
**Description:** Method for calculating p-values during connectivity testing:

- `"parametric"`: Uses t-based correlation p-values (faster, assumes normal distribution)
- `"empirical"`: Uses permutation-based p-values (slower, no distribution assumptions)
- `"auto"`: Evaluates correlation test assumptions per sample group and chooses `"parametric"` only when diagnostics are acceptable; otherwise switches to `"empirical"`

**Recommendation:** Use `"auto"` when you want robust defaults across heterogeneous datasets. Use `"parametric"` when assumptions are known to hold and runtime is critical, and `"empirical"` when assumptions are clearly questionable.

### `empirical_strategy`
**Type:** Character  
**Options:** `"auto"`, `"montecarlo"`, `"permutations"`  
**Default:** `"auto"`  
**Description:** Strategy for empirical p-value calculation (only applies when `testing_mode = "empirical"`):

- `"auto"`: Uses Monte Carlo for groups <6 samples, permutations for groups ≥6 samples
- `"montecarlo"`: Always uses Monte Carlo simulation
- `"permutations"`: Always uses exact permutations

### `ntries`
**Type:** Integer  
**Default:** `200`  
**Description:** Number of permutations/simulations when `testing_mode = "empirical"`. The number has an upper bound of `factorial(n)` where `n` is the size of the smallest sample group. If `ntries` exceeds this bound, it will be reduced to `factorial(n)`.

**Recommendation:**
- 100-500: Faster, less precise
- 1,000-10,000: Slower, more precise

### `mid_p`
**Type:** Logical  
**Default:** `FALSE`  
**Description:** Whether to use mid-p correction in empirical p-value calculation.

### `aggfun`
**Type:** Character or function  
**Options:** `"median"`, `"mean"`, or a custom function  
**Default:** `"median"`  
**Description:** Aggregation function for calculating DMR-level statistics (delta beta, p-values).

**Recommendation:** Median is more robust to outliers; mean may be more sensitive.

## Performance Parameters

### `njobs`
**Type:** Integer  
**Default:** `getOption("CMEnt.njobs", .defaultNJobs())`  
**Description:** Number of parallel jobs to use for computation.

**Recommendation:**
- Use `-1` to automatically use all available cores minus 1
- Limit to avoid overwhelming system resources
- Consider memory requirements when increasing parallelization

### `verbose`
**Type:** Integer  
**Range:** 0 to 5  
**Default:** `getOption("CMEnt.verbose", 1)`  
**Description:** Level of verbosity for logging messages:

- `0`: No messages
- `1`: Essential messages only
- `2`: Standard progress information
- `3`: Detailed progress information
- `4-5`: Very detailed debugging information

## Input/Output Parameters

### `seeds_id_col`
**Type:** Character or integer  
**Default:** `NULL`  
**Description:** Column name or index for seed identifiers in the seeds file. If `NULL`, uses row names if present, otherwise the first column.

### `output_prefix`
**Type:** Character  
**Default:** `NULL`  
**Description:** Prefix for output files. If provided, results will be saved to files with this prefix. If `NULL`, no files are saved.

### `beta_row_names_file`
**Type:** Character  
**Default:** `NULL`  
**Description:** Path to a file containing row names for beta values. Useful for large beta files where reading row names separately is more efficient.

## BED File Parameters

### `bed_provided`
**Type:** Logical  
**Default:** `FALSE`  
**Description:** Whether the beta file is provided as a BED file. Automatically set to `TRUE` if the input file has a `.bed` extension.


### `bed_chrom_col`
**Type:** Character  
**Default:** `"chrom"`  
**Description:** Column name for chromosome in BED files.


### `bed_start_col`
**Type:** Character  
**Default:** `"start"`  
**Description:** Column name for start position in BED files.

## Annotation Parameters

### `annotate_with_genes`
**Type:** Logical  
**Default:** `TRUE`  
**Description:** Whether to annotate DMRs with overlapping genes.

### `.score_dmrs`
**Type:** Logical  
**Default:** `TRUE`  
**Description:** Whether to add complementary SVM-based discrimination scores to DMRs. When enabled, each DMR is evaluated for its ability to separate sample groups using stratified k-fold cross-validation with an RBF kernel SVM. The resulting `score` and `cv_accuracy` values summarize sample-level discriminative strength and should be read alongside DMR `pval`, `qval`, and effect-size columns, not as replacements for them.

**Details:**
- Uses stratified k-fold cross-validation (default: 5-fold)
- Number of folds can be controlled with `options(CMEnt.scoring_nfold = 5)`
- Reproducible fold assignments can be obtained with `set.seed(...)` before calling `scoreDMRs()`
- Higher `score` and `cv_accuracy` values indicate stronger discriminative power
- Requires the `e1071` package for SVM classification

## Advanced Parameters

### `.load_debug`
**Type:** Logical  
**Default:** `FALSE`  
**Description:** Enable debug mode for loading intermediate files, through short-circuiting. For internal development use only.

# Global Package Options

CMEnt uses several global options that can be set using the `options()` function. These persist across function calls in your R session.

## Parallelism

### Option: `CMEnt.njobs`
**Type:** Integer  
**Default:** `min(8, parallel::detectCores(logical = TRUE) - 1)`  
**Description:** Number of parallel jobs (defaults to the minimum of 8 and one less than the number of available CPU cores).

```{r njobs_option}
options("CMEnt.njobs" = 4)
```

## Verbosity

### Option: `CMEnt.verbose`
**Type:** Integer  
**Default:** `1`  
**Description:** Default verbosity level.

```{r verbose_option}
options("CMEnt.verbose" = 2)
```

## Memory Management

### Option: `CMEnt.beta_in_mem_threshold_mb`
**Type:** Integer  
**Default:** `500`  
**Description:** Maximum size (in Megabytes) of beta files to load into memory. Files larger than this will be processed using disk-based methods.

```{r beta_in_memory_option}
options("CMEnt.beta_in_mem_threshold_mb" = 200)
```

## Caching

### Option: `CMEnt.use_annotation_cache`
**Type:** Logical  
**Default:** `TRUE`  
**Description:** Enable caching of gene annotations.

```{r annotation_cache_option}
options("CMEnt.use_annotation_cache" = TRUE)
```

### Option: `CMEnt.annotation_cache_dir`
**Type:** Character  
**Default:** `USER_CACHE_DIR/R/CMEnt/annotation_cache`  
**Description:** Directory for annotation cache.

```{r annotation_cache_dir_option}
options("CMEnt.annotation_cache_dir" = "/path/to/cache")
```

### Option: `CMEnt.jaspar_cache_dir`
**Type:** Character  
**Default:** `USER_CACHE_DIR/R/CMEnt/jaspar_cache`  
**Description:** Directory for JASPAR motif database cache.

```{r jaspar_cache_dir_option}
options("CMEnt.jaspar_cache_dir" = "/path/to/cache")
```

## Motif Analysis

### Option: `CMEnt.jaspar_version`
**Type:** Integer  
**Default:** `2024`  
**Description:** JASPAR database version to use for motif analysis.

```{r jaspar_version_option}
options("CMEnt.jaspar_version" = 2024)
```

### Option: `CMEnt.jaspar_tax_group`
**Type:** Character  
**Default:** `"vertebrates"`  
**Description:** Taxonomic group for JASPAR motif filtering.

```{r min_motif_similarity_option}
options("CMEnt.min_motif_similarity" = 0.75)
```

### Option: `CMEnt.min_motif_similarity`
**Type:** Numeric  
**Default:** `0.8`  
**Description:** Minimum motif similarity threshold for DMR interaction analysis.

```{r jaspar_tax_group_option}
options("CMEnt.jaspar_tax_group" = "vertebrates")
```

### Option: `CMEnt.jaspar_corr_threshold`
**Type:** Numeric  
**Default:** `0.9`  
**Description:** Correlation threshold for JASPAR motif similarity.

```{r jaspar_corr_threshold_option}
options("CMEnt.jaspar_corr_threshold" = 0.85)
```

### Option: `CMEnt.make_debug_dir`
**Type:** Logical  
**Default:** `FALSE`  
**Description:** Create debug directory for troubleshooting.

```{r make_debug_dir_option}
options("CMEnt.make_debug_dir" = TRUE)
```

## DMR scoring

### Option: `CMEnt.scoring_nfold`
**Type:** Integer  
**Default:** `5`  
**Description:** Number of folds for cross-validation when scoring DMRs.

```{r scoring_nfold_option}
options("CMEnt.scoring_nfold" = 3)
```

# Configuration Examples

## Example 1: High-Confidence DMRs with Strict Filtering

```{r scoring_example, eval=FALSE}
dmrs <- buildDMRs(
    beta = beta,
    seeds = seeds,
    pheno = pheno,
    sample_group_col = "Sample_Group",
    array = "EPIC",
    genome = "hg38",
    ext_site_delta_beta = 0.2,
    min_seeds = 3,
    min_sites = 5,
    max_lookup_dist = 5000,
    max_pval = 0.01,
    njobs = 4
)
```

## Example 2: Broad Region Detection with Relaxed Parameters

```{r custom_bed_example, eval=FALSE}
dmrs <- buildDMRs(
    beta = beta,
    seeds = seeds,
    pheno = pheno,
    sample_group_col = "Sample_Group",
    min_seeds = 2,
    min_sites = 3,
    max_lookup_dist = 20000,
    max_pval = 0.05,
    njobs = 8
)
```

## Example 3: Empirical P-values for Small Sample Sizes

```{r custom_locations_example, eval=FALSE}
dmrs <- buildDMRs(
    beta = beta,
    seeds = seeds,
    pheno = pheno,
    sample_group_col = "Sample_Group",
    testing_mode = "empirical",
    empirical_strategy = "montecarlo",
    ntries = 5000,
    mid_p = TRUE,
    njobs = 4
)
```

# Best Practices

1. **Start with default parameters** and adjust based on your specific needs.

2. **For array data** (450K, EPIC), use lower `min_sites` values (3-5) since site coverage is sparse.

3. **For WGBS data**, keep `min_sites` higher (50+) to ensure robust regions.

4. **Avoid heavy pre-filtering** of seeds based on effect size. Let CMEnt handle filtering internally.

5. **Use empirical p-values** for small sample sizes (<10 per group) or when normality assumptions are questionable.

8. **Use parallel processing** (`njobs > 1`) for faster computation, but be mindful of memory requirements.

9. **Save intermediate results** using `output_prefix` for large analyses.

10. **Document your configuration** by saving parameter settings for reproducibility.

# Troubleshooting

## Issue: Out of Memory Errors

**Solution:**
- Decrease `njobs`
- Decrease `getOption("CMEnt.beta_in_mem_threshold_mb")` (default 500) to enable disk-based processing
- Use tabix-indexed files for very large datasets
- Enable caching options

## Issue: DMRs Too Small

**Solution:**
- Increase `max_lookup_dist`. This will allow seeds that are farther apart to be connected, leading to larger DMRs.
- Increase `max_pval`. This will make connectivity testing less stringent, allowing more sites to be connected and thus larger DMRs.
- Decrease `ext_site_delta_beta` . This will allow more sites to be included in DMRs during the second stage of extension, leading to larger DMRs.


## Issue: Too Many DMRs

**Solution:**
- Increase `min_seeds`. This will require more seeds to be connected to form a DMR, leading to fewer total DMRs.
- Increase `min_sites`. This will require more sites to be included in a DMR, leading to fewer total DMRs.
- Decrease `max_pval`. This will make connectivity testing more stringent, leading to fewer connected sites and thus fewer DMRs.
- Increase `max_lookup_dist`. This will join more seeds into the same DMRs, reducing the total number of DMRs.
- Decrease `ext_site_delta_beta`. This will allow more sites to be included in DMRs during the second stage of extension, leading to more merging of nearby DMRs and thus fewer total DMRs.

## Issue: Slow Performance

**Solution:**
- Increase `njobs` for parallel processing
- CMEnt derives connectivity chunk sizes from available RAM automatically.
- Use `testing_mode = "parametric"` instead of `"empirical"`
- Enable caching options
- Consider using tabix-indexed files

# Session Info

```{r sessionInfo}
sessionInfo()
```