CMEnt Configuration

Overview

This vignette provides a comprehensive guide to the configuration available in CMEnt. Understanding these parameters will help you optimize the package for your specific analysis needs.

Function Parameters

Core Input Parameters

beta

Type: Character, matrix, BetaHandler object, or BED file
Required: Yes
Description: Input methylation data. Can be:

  • Path to a beta value file (tab-separated). The beta file will be loaded into memory if its size is below getOption("CMEnt.beta_in_mem_threshold_mb")Megabytes or genomically sorted and converted to tabix for faster access, if samtools tabix is installed.
  • Path to a tabix-indexed file (.bed.gz with .tbi index)
  • A beta matrix with site IDs as rownames and sample IDs as column names.
  • A BetaHandler object (see ?BetaHandler)
  • A BED file with columns bed_chrom_col and bed_start_col, followed by sample columns, existing as row names in the provided pheno data

Example:

loadExampleInputDataChr5And11("beta")

seeds

Type: Character or data.frame
Required: Yes
Description: site sites to use as seeds for DMR detection. Can be:

  • Path to a file with line separated site IDs
  • A data.frame with DMP information

Format requirements: - Row names or first column or the column given by seeds_id_col should contain site IDs - The site IDs must match those in the beta data, either as Illumina IDs or genomic coordinates (chr:start). The latter is required when using BED files for beta.

Example:

loadExampleInputDataChr5And11("dmps")
seeds <- dmps

pheno

Type: Data.frame
Required: Yes
Description: Sample phenotype information.

Format requirements: - Row names should match column names in beta data - Must contain sample group information column (specified by sample_group_col) - May contain case/control status column (specified by casecontrol_col)

Example:

loadExampleInputDataChr5And11("pheno")

Sample Grouping Parameters

sample_group_col

Type: Character
Default: "Sample_Group"
Description: Column name in pheno that specifies sample groups (e.g., “case” vs “control”, “treated” vs “untreated”). More than two groups are supported.

casecontrol_col

Type: Character
Default: NULL
Description: Column name in pheno for case (TRUE/1) vs control (FALSE/0) status, for delta beta computations. If NULL, controls are assumed to be the first level found at sample_group_col.

ignored_sample_groups

Type: Character vector
Default: NULL
Description: Sample groups to exclude while considering connection and expansion. Can also be “case” or “control”.

Example:

dmrs <- buildDMRs(
    beta = beta,
    seeds = seeds,
    pheno = pheno,
    ignored_sample_groups = c("excluded_group1", "excluded_group2")
)

Array and Genome Parameters

array

Type: Character
Options: "450K", "27K", "EPIC", "EPICv2"
Default: "450K"
Description: Type of methylation array platform used. Ignored when using mouse genomes or when beta is provided as a BED file.

genome

Type: Character Options: "hg19", "hg38", "hs1", "mm10", "mm39" Description: Reference genome version. If not specified, it will be inferred from the input data; if input is 450K or EPIC array data, the genome will be set to “hg19” by default, otherwise to “hg38”.

Filtering Parameters

ext_site_delta_beta

Type: Numeric
Default: 0.2
Range: 0 to 1, or NA to disable
Description: Absolute delta beta value for neighboring sites to be included in DMRs, during the second stage of DMR extension, without considering correlation.

Recommendation: Keep 0.2 for balanced precision/recall. Use NA to disable the shortcut entirely. Set to 0 only when you intentionally want any proximal site with a non-missing case/control delta beta to be eligible for force-connection.

min_seeds

Type: Integer
Default: 1
Description: Minimum number of connected seeds required in a DMR.

Recommendation: Increase this value (e.g., to 3 or 4) for higher-confidence DMRs.

min_adj_seeds

Type: Integer
Default: 2
Description: Minimum number of seeds in a DMR after adjusting for site density. Minimum 2. This accounts for regions with different site coverage, and it is mostly applicable to array-based data. Only used when min_seeds < min_adj_seeds.

min_sites

Type: Integer
Default: 50
Description: Minimum number of sites required in a DMR after region expansion. Minimum 2.

Recommendation: Lower this value (e.g., 3-5) for array-based data, keep higher (50+) for WGBS data.

Region Building Parameters

max_lookup_dist

Type: Integer
Default: 10000
Unit: Base pairs
Description: Maximum genomic distance between adjacent seeds to be considered part of the same DMR.

Recommendation: - 1,000-5,000 bp for tightly connected regions - 10,000 bp (default) for moderate spacing - 20,000+ bp for broader regions

expansion_window

Type: Numeric
Default: 1e6
Description: Stage 2 connectivity is computed only around seed-derived Stage 1 neighborhoods, using this total window width in base pairs.

Set to <= 0 to compute connectivity genome-wide.

max_bridge_seeds_gaps

Type: Integer
Default: 1
Description: In Stage 1 seed connectivity, allows bridging up to this many consecutive p-value-driven failed edges when both flanking edges are connected.

max_bridge_extension_gaps

Type: Integer Default: 1 Description: In Stage 2 DMR extension, allows bridging up to this many consecutive p-value-driven failed edges when both flanking edges are connected.

Statistical Parameters

max_pval

Type: Numeric
Default: 0.05
Range: 0 to 1
Description: Maximum p-value threshold for considering correlation between seeds as significant during the first stage of connectivity testing, and between proxial sites during the second stage DMR extension. Under strong entanglement, a Bonferroni correction is applied based on the number of samples groups (number of tests per site).

entanglement

Type: Character
Options: "strong", "weak"
Default: "strong"
Description: Strategy for determining connectivity between sites across sample groups:

  • "strong": Requires all sample groups to show significant correlation for two sites to be considered connected. This is more conservative and ensures consistent methylation patterns across all groups.
  • "weak": Requires at least one sample group to show significant correlation. This is more permissive and may identify DMRs that are specific to certain groups.

Recommendation: Use "strong" (default) for most cases to ensure robust, reproducible DMRs. Use "weak" when you want to capture group-specific methylation patterns or when working with heterogeneous sample groups.

Example:

dmrs <- buildDMRs(
    beta = beta,
    seeds = seeds,
    pheno = pheno,
    entanglement = "weak"
)

testing_mode

Type: Character
Options: "parametric", "empirical", "auto"
Default: "auto"
Description: Method for calculating p-values during connectivity testing:

  • "parametric": Uses t-based correlation p-values (faster, assumes normal distribution)
  • "empirical": Uses permutation-based p-values (slower, no distribution assumptions)
  • "auto": Evaluates correlation test assumptions per sample group and chooses "parametric" only when diagnostics are acceptable; otherwise switches to "empirical"

Recommendation: Use "auto" when you want robust defaults across heterogeneous datasets. Use "parametric" when assumptions are known to hold and runtime is critical, and "empirical" when assumptions are clearly questionable.

empirical_strategy

Type: Character
Options: "auto", "montecarlo", "permutations"
Default: "auto"
Description: Strategy for empirical p-value calculation (only applies when testing_mode = "empirical"):

  • "auto": Uses Monte Carlo for groups <6 samples, permutations for groups ≥6 samples
  • "montecarlo": Always uses Monte Carlo simulation
  • "permutations": Always uses exact permutations

ntries

Type: Integer
Default: 200
Description: Number of permutations/simulations when testing_mode = "empirical". The number has an upper bound of factorial(n) where n is the size of the smallest sample group. If ntries exceeds this bound, it will be reduced to factorial(n).

Recommendation: - 100-500: Faster, less precise - 1,000-10,000: Slower, more precise

mid_p

Type: Logical
Default: FALSE
Description: Whether to use mid-p correction in empirical p-value calculation.

aggfun

Type: Character or function
Options: "median", "mean", or a custom function
Default: "median"
Description: Aggregation function for calculating DMR-level statistics (delta beta, p-values).

Recommendation: Median is more robust to outliers; mean may be more sensitive.

Performance Parameters

njobs

Type: Integer
Default: getOption("CMEnt.njobs", .defaultNJobs())
Description: Number of parallel jobs to use for computation.

Recommendation: - Use -1 to automatically use all available cores minus 1 - Limit to avoid overwhelming system resources - Consider memory requirements when increasing parallelization

verbose

Type: Integer
Range: 0 to 5
Default: getOption("CMEnt.verbose", 1)
Description: Level of verbosity for logging messages:

  • 0: No messages
  • 1: Essential messages only
  • 2: Standard progress information
  • 3: Detailed progress information
  • 4-5: Very detailed debugging information

Input/Output Parameters

seeds_id_col

Type: Character or integer
Default: NULL
Description: Column name or index for seed identifiers in the seeds file. If NULL, uses row names if present, otherwise the first column.

output_prefix

Type: Character
Default: NULL
Description: Prefix for output files. If provided, results will be saved to files with this prefix. If NULL, no files are saved.

beta_row_names_file

Type: Character
Default: NULL
Description: Path to a file containing row names for beta values. Useful for large beta files where reading row names separately is more efficient.

BED File Parameters

bed_provided

Type: Logical
Default: FALSE
Description: Whether the beta file is provided as a BED file. Automatically set to TRUE if the input file has a .bed extension.

bed_chrom_col

Type: Character
Default: "chrom"
Description: Column name for chromosome in BED files.

bed_start_col

Type: Character
Default: "start"
Description: Column name for start position in BED files.

Annotation Parameters

annotate_with_genes

Type: Logical
Default: TRUE
Description: Whether to annotate DMRs with overlapping genes.

.score_dmrs

Type: Logical
Default: TRUE
Description: Whether to add complementary SVM-based discrimination scores to DMRs. When enabled, each DMR is evaluated for its ability to separate sample groups using stratified k-fold cross-validation with an RBF kernel SVM. The resulting score and cv_accuracy values summarize sample-level discriminative strength and should be read alongside DMR pval, qval, and effect-size columns, not as replacements for them.

Details: - Uses stratified k-fold cross-validation (default: 5-fold) - Number of folds can be controlled with options(CMEnt.scoring_nfold = 5) - Reproducible fold assignments can be obtained with set.seed(...) before calling scoreDMRs() - Higher score and cv_accuracy values indicate stronger discriminative power - Requires the e1071 package for SVM classification

Advanced Parameters

.load_debug

Type: Logical
Default: FALSE
Description: Enable debug mode for loading intermediate files, through short-circuiting. For internal development use only.

Global Package Options

CMEnt uses several global options that can be set using the options() function. These persist across function calls in your R session.

Parallelism

Option: CMEnt.njobs

Type: Integer
Default: min(8, parallel::detectCores(logical = TRUE) - 1)
Description: Number of parallel jobs (defaults to the minimum of 8 and one less than the number of available CPU cores).

options("CMEnt.njobs" = 4)

Verbosity

Option: CMEnt.verbose

Type: Integer
Default: 1
Description: Default verbosity level.

options("CMEnt.verbose" = 2)

Memory Management

Option: CMEnt.beta_in_mem_threshold_mb

Type: Integer
Default: 500
Description: Maximum size (in Megabytes) of beta files to load into memory. Files larger than this will be processed using disk-based methods.

options("CMEnt.beta_in_mem_threshold_mb" = 200)

Caching

Option: CMEnt.use_annotation_cache

Type: Logical
Default: TRUE
Description: Enable caching of gene annotations.

options("CMEnt.use_annotation_cache" = TRUE)

Option: CMEnt.annotation_cache_dir

Type: Character
Default: USER_CACHE_DIR/R/CMEnt/annotation_cache
Description: Directory for annotation cache.

options("CMEnt.annotation_cache_dir" = "/path/to/cache")

Option: CMEnt.jaspar_cache_dir

Type: Character
Default: USER_CACHE_DIR/R/CMEnt/jaspar_cache
Description: Directory for JASPAR motif database cache.

options("CMEnt.jaspar_cache_dir" = "/path/to/cache")

Motif Analysis

Option: CMEnt.jaspar_version

Type: Integer
Default: 2024
Description: JASPAR database version to use for motif analysis.

options("CMEnt.jaspar_version" = 2024)

Option: CMEnt.jaspar_tax_group

Type: Character
Default: "vertebrates"
Description: Taxonomic group for JASPAR motif filtering.

options("CMEnt.min_motif_similarity" = 0.75)

Option: CMEnt.min_motif_similarity

Type: Numeric
Default: 0.8
Description: Minimum motif similarity threshold for DMR interaction analysis.

options("CMEnt.jaspar_tax_group" = "vertebrates")

Option: CMEnt.jaspar_corr_threshold

Type: Numeric
Default: 0.9
Description: Correlation threshold for JASPAR motif similarity.

options("CMEnt.jaspar_corr_threshold" = 0.85)

Option: CMEnt.make_debug_dir

Type: Logical
Default: FALSE
Description: Create debug directory for troubleshooting.

options("CMEnt.make_debug_dir" = TRUE)

DMR scoring

Option: CMEnt.scoring_nfold

Type: Integer
Default: 5
Description: Number of folds for cross-validation when scoring DMRs.

options("CMEnt.scoring_nfold" = 3)

Configuration Examples

Example 1: High-Confidence DMRs with Strict Filtering

dmrs <- buildDMRs(
    beta = beta,
    seeds = seeds,
    pheno = pheno,
    sample_group_col = "Sample_Group",
    array = "EPIC",
    genome = "hg38",
    ext_site_delta_beta = 0.2,
    min_seeds = 3,
    min_sites = 5,
    max_lookup_dist = 5000,
    max_pval = 0.01,
    njobs = 4
)

Example 2: Broad Region Detection with Relaxed Parameters

dmrs <- buildDMRs(
    beta = beta,
    seeds = seeds,
    pheno = pheno,
    sample_group_col = "Sample_Group",
    min_seeds = 2,
    min_sites = 3,
    max_lookup_dist = 20000,
    max_pval = 0.05,
    njobs = 8
)

Example 3: Empirical P-values for Small Sample Sizes

dmrs <- buildDMRs(
    beta = beta,
    seeds = seeds,
    pheno = pheno,
    sample_group_col = "Sample_Group",
    testing_mode = "empirical",
    empirical_strategy = "montecarlo",
    ntries = 5000,
    mid_p = TRUE,
    njobs = 4
)

Best Practices

  1. Start with default parameters and adjust based on your specific needs.

  2. For array data (450K, EPIC), use lower min_sites values (3-5) since site coverage is sparse.

  3. For WGBS data, keep min_sites higher (50+) to ensure robust regions.

  4. Avoid heavy pre-filtering of seeds based on effect size. Let CMEnt handle filtering internally.

  5. Use empirical p-values for small sample sizes (<10 per group) or when normality assumptions are questionable.

  6. Use parallel processing (njobs > 1) for faster computation, but be mindful of memory requirements.

  7. Save intermediate results using output_prefix for large analyses.

  8. Document your configuration by saving parameter settings for reproducibility.

Troubleshooting

Issue: Out of Memory Errors

Solution: - Decrease njobs - Decrease getOption("CMEnt.beta_in_mem_threshold_mb") (default 500) to enable disk-based processing - Use tabix-indexed files for very large datasets - Enable caching options

Issue: DMRs Too Small

Solution: - Increase max_lookup_dist. This will allow seeds that are farther apart to be connected, leading to larger DMRs. - Increase max_pval. This will make connectivity testing less stringent, allowing more sites to be connected and thus larger DMRs. - Decrease ext_site_delta_beta . This will allow more sites to be included in DMRs during the second stage of extension, leading to larger DMRs.

Issue: Too Many DMRs

Solution: - Increase min_seeds. This will require more seeds to be connected to form a DMR, leading to fewer total DMRs. - Increase min_sites. This will require more sites to be included in a DMR, leading to fewer total DMRs. - Decrease max_pval. This will make connectivity testing more stringent, leading to fewer connected sites and thus fewer DMRs. - Increase max_lookup_dist. This will join more seeds into the same DMRs, reducing the total number of DMRs. - Decrease ext_site_delta_beta. This will allow more sites to be included in DMRs during the second stage of extension, leading to more merging of nearby DMRs and thus fewer total DMRs.

Issue: Slow Performance

Solution: - Increase njobs for parallel processing - CMEnt derives connectivity chunk sizes from available RAM automatically. - Use testing_mode = "parametric" instead of "empirical" - Enable caching options - Consider using tabix-indexed files

Session Info

sessionInfo()
## R version 4.6.0 (2026-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] DMRsegaldata_1.1.0   ExperimentHub_3.3.0  AnnotationHub_4.3.0 
##  [4] BiocFileCache_3.3.0  dbplyr_2.5.2         ggplot2_4.0.3       
##  [7] GenomicRanges_1.65.0 Seqinfo_1.3.0        IRanges_2.47.2      
## [10] S4Vectors_0.51.3     BiocGenerics_0.59.7  generics_0.1.4      
## [13] CMEnt_0.99.4         BiocStyle_2.41.0    
## 
## loaded via a namespace (and not attached):
##   [1] BiocIO_1.23.3               bitops_1.0-9               
##   [3] filelock_1.0.3              tibble_3.3.1               
##   [5] R.oo_1.27.1                 XML_3.99-0.23              
##   [7] DirichletMultinomial_1.55.0 lifecycle_1.0.5            
##   [9] httr2_1.2.2                 pwalign_1.9.1              
##  [11] doParallel_1.0.17           lattice_0.22-9             
##  [13] backports_1.5.1             magrittr_2.0.5             
##  [15] limma_3.69.2                sass_0.4.10                
##  [17] rmarkdown_2.31              jquerylib_0.1.4            
##  [19] yaml_2.3.12                 otel_0.2.0                 
##  [21] DBI_1.3.0                   buildtools_1.0.0           
##  [23] RColorBrewer_1.1-3          abind_1.4-8                
##  [25] purrr_1.2.2                 R.utils_2.13.0             
##  [27] RCurl_1.98-1.19             rappdirs_0.3.4             
##  [29] circlize_0.4.18             maketools_1.3.2            
##  [31] seqLogo_1.79.0              testthat_3.3.2             
##  [33] permute_0.9-10              DelayedMatrixStats_1.35.0  
##  [35] codetools_0.2-20            DelayedArray_0.39.3        
##  [37] DT_0.34.0                   tidyselect_1.2.1           
##  [39] shape_1.4.6.1               futile.logger_1.4.9        
##  [41] ggseqlogo_0.2.2             UCSC.utils_1.9.0           
##  [43] farver_2.1.2                matrixStats_1.5.0          
##  [45] showtext_0.9-8              GenomicAlignments_1.49.0   
##  [47] jsonlite_2.0.0              GetoptLong_1.1.1           
##  [49] iterators_1.0.14            foreach_1.5.2              
##  [51] tools_4.6.0                 TFMPvalue_1.0.0            
##  [53] Rcpp_1.1.1-1.1              glue_1.8.1                 
##  [55] gridExtra_2.3               SparseArray_1.13.2         
##  [57] BiocBaseUtils_1.15.1        xfun_0.58                  
##  [59] MatrixGenerics_1.25.0       GenomeInfoDb_1.49.1        
##  [61] dplyr_1.2.1                 HDF5Array_1.41.0           
##  [63] withr_3.0.2                 formatR_1.14               
##  [65] BiocManager_1.30.27         fastmap_1.2.0              
##  [67] bedr_1.1.5                  rhdf5filters_1.25.0        
##  [69] caTools_1.18.3              digest_0.6.39              
##  [71] R6_2.6.1                    colorspace_2.1-2           
##  [73] gtools_3.9.5                dichromat_2.0-0.1          
##  [75] RSQLite_3.53.1              cigarillo_1.3.0            
##  [77] R.methodsS3_1.8.2           h5mread_1.5.0              
##  [79] data.table_1.18.4           rtracklayer_1.73.0         
##  [81] FNN_1.1.4.1                 httr_1.4.8                 
##  [83] htmlwidgets_1.6.4           S4Arrays_1.13.0            
##  [85] TFBSTools_1.51.0            pkgconfig_2.0.3            
##  [87] gtable_0.3.6                blob_1.3.0                 
##  [89] ComplexHeatmap_2.29.0       S7_0.2.2                   
##  [91] XVector_0.53.0              sys_3.4.3                  
##  [93] brio_1.1.5                  htmltools_0.5.9            
##  [95] sysfonts_0.8.9              strex_2.0.1                
##  [97] clue_0.3-68                 scales_1.4.0               
##  [99] Biobase_2.73.1              png_0.1-9                  
## [101] knitr_1.51                  lambda.r_1.2.4             
## [103] reshape2_1.4.5              rjson_0.2.23               
## [105] checkmate_2.3.4             curl_7.1.0                 
## [107] showtextdb_3.0              cachem_1.1.0               
## [109] rhdf5_2.57.1                GlobalOptions_0.1.4        
## [111] stringr_1.6.0               BiocVersion_3.24.0         
## [113] parallel_4.6.0              AnnotationDbi_1.75.0       
## [115] restfulr_0.0.17             pillar_1.11.1              
## [117] grid_4.6.0                  vctrs_0.7.3                
## [119] beachmat_2.29.0             cluster_2.1.8.2            
## [121] JASPAR2024_0.99.7           evaluate_1.0.5             
## [123] bsseq_1.49.0                VennDiagram_1.8.2          
## [125] cli_3.6.6                   locfit_1.5-9.12            
## [127] compiler_4.6.0              futile.options_1.0.1       
## [129] Rsamtools_2.29.0            rlang_1.2.0                
## [131] crayon_1.5.3                labeling_0.4.3             
## [133] plyr_1.8.9                  stringi_1.8.7              
## [135] gridBase_0.4-7              BiocParallel_1.47.0        
## [137] Biostrings_2.81.3           Matrix_1.7-5               
## [139] BSgenome_1.81.0             sparseMatrixStats_1.25.0   
## [141] bit64_4.8.2                 Rhdf5lib_2.1.0             
## [143] KEGGREST_1.53.0             statmod_1.5.2              
## [145] SummarizedExperiment_1.43.0 igraph_2.3.2               
## [147] memoise_2.0.1               bslib_0.11.0               
## [149] bit_4.6.0