This vignette provides a comprehensive guide to the configuration available in CMEnt. Understanding these parameters will help you optimize the package for your specific analysis needs.
betaType: Character, matrix, BetaHandler object, or BED
file
Required: Yes
Description: Input methylation data. Can be:
getOption("CMEnt.beta_in_mem_threshold_mb")Megabytes or
genomically sorted and converted to tabix for faster access, if samtools
tabix is installed.?BetaHandler)bed_chrom_col and
bed_start_col, followed by sample columns, existing as row
names in the provided pheno dataExample:
seedsType: Character or data.frame
Required: Yes
Description: site sites to use as seeds for DMR
detection. Can be:
Format requirements: - Row names or first column or
the column given by seeds_id_col should contain site IDs -
The site IDs must match those in the beta data, either as Illumina IDs
or genomic coordinates (chr:start). The latter is required when using
BED files for beta.
Example:
phenoType: Data.frame
Required: Yes
Description: Sample phenotype information.
Format requirements: - Row names should match column
names in beta data - Must contain sample group information column
(specified by sample_group_col) - May contain case/control
status column (specified by casecontrol_col)
Example:
sample_group_colType: Character
Default: "Sample_Group"
Description: Column name in pheno that
specifies sample groups (e.g., “case” vs “control”, “treated” vs
“untreated”). More than two groups are supported.
casecontrol_colType: Character
Default: NULL
Description: Column name in pheno for case
(TRUE/1) vs control (FALSE/0) status, for delta beta computations. If
NULL, controls are assumed to be the first level found at
sample_group_col.
ignored_sample_groupsType: Character vector
Default: NULL
Description: Sample groups to exclude while considering
connection and expansion. Can also be “case” or “control”.
Example:
arrayType: Character
Options: "450K", "27K",
"EPIC", "EPICv2"
Default: "450K"
Description: Type of methylation array platform used.
Ignored when using mouse genomes or when beta is provided as a BED
file.
genomeType: Character Options:
"hg19", "hg38", "hs1",
"mm10", "mm39" Description:
Reference genome version. If not specified, it will be inferred from the
input data; if input is 450K or EPIC array data, the genome will be set
to “hg19” by default, otherwise to “hg38”.
ext_site_delta_betaType: Numeric
Default: 0.2
Range: 0 to 1, or NA to disable
Description: Absolute delta beta value for neighboring
sites to be included in DMRs, during the second stage
of DMR extension, without considering correlation.
Recommendation: Keep 0.2 for balanced
precision/recall. Use NA to disable the shortcut entirely.
Set to 0 only when you intentionally want any proximal site with a
non-missing case/control delta beta to be eligible for
force-connection.
min_seedsType: Integer
Default: 1
Description: Minimum number of connected seeds required
in a DMR.
Recommendation: Increase this value (e.g., to 3 or 4) for higher-confidence DMRs.
min_adj_seedsType: Integer
Default: 2
Description: Minimum number of seeds in a DMR after
adjusting for site density. Minimum 2. This accounts for regions with
different site coverage, and it is mostly applicable to array-based
data. Only used when min_seeds < min_adj_seeds.
min_sitesType: Integer
Default: 50
Description: Minimum number of sites required in a DMR
after region expansion. Minimum 2.
Recommendation: Lower this value (e.g., 3-5) for array-based data, keep higher (50+) for WGBS data.
max_lookup_distType: Integer
Default: 10000
Unit: Base pairs
Description: Maximum genomic distance between adjacent
seeds to be considered part of the same DMR.
Recommendation: - 1,000-5,000 bp for tightly connected regions - 10,000 bp (default) for moderate spacing - 20,000+ bp for broader regions
expansion_windowType: Numeric
Default: 1e6
Description: Stage 2 connectivity is computed only
around seed-derived Stage 1 neighborhoods, using this total window width
in base pairs.
Set to <= 0 to compute connectivity genome-wide.
max_bridge_seeds_gapsType: Integer
Default: 1
Description: In Stage 1 seed connectivity, allows
bridging up to this many consecutive p-value-driven failed edges when
both flanking edges are connected.
max_bridge_extension_gapsType: Integer Default:
1 Description: In Stage 2 DMR extension,
allows bridging up to this many consecutive p-value-driven failed edges
when both flanking edges are connected.
max_pvalType: Numeric
Default: 0.05
Range: 0 to 1
Description: Maximum p-value threshold for considering
correlation between seeds as significant during the first stage of
connectivity testing, and between proxial sites during the second stage
DMR extension. Under strong entanglement, a Bonferroni
correction is applied based on the number of samples groups (number of
tests per site).
entanglementType: Character
Options: "strong",
"weak"
Default: "strong"
Description: Strategy for determining connectivity
between sites across sample groups:
"strong": Requires all sample groups to show
significant correlation for two sites to be considered connected. This
is more conservative and ensures consistent methylation patterns across
all groups."weak": Requires at least one sample group to show
significant correlation. This is more permissive and may identify DMRs
that are specific to certain groups.Recommendation: Use "strong" (default)
for most cases to ensure robust, reproducible DMRs. Use
"weak" when you want to capture group-specific methylation
patterns or when working with heterogeneous sample groups.
Example:
testing_modeType: Character
Options: "parametric",
"empirical", "auto"
Default: "auto"
Description: Method for calculating p-values during
connectivity testing:
"parametric": Uses t-based correlation p-values
(faster, assumes normal distribution)"empirical": Uses permutation-based p-values (slower,
no distribution assumptions)"auto": Evaluates correlation test assumptions per
sample group and chooses "parametric" only when diagnostics
are acceptable; otherwise switches to "empirical"Recommendation: Use "auto" when you
want robust defaults across heterogeneous datasets. Use
"parametric" when assumptions are known to hold and runtime
is critical, and "empirical" when assumptions are clearly
questionable.
empirical_strategyType: Character
Options: "auto",
"montecarlo", "permutations"
Default: "auto"
Description: Strategy for empirical p-value calculation
(only applies when testing_mode = "empirical"):
"auto": Uses Monte Carlo for groups <6 samples,
permutations for groups ≥6 samples"montecarlo": Always uses Monte Carlo simulation"permutations": Always uses exact permutationsntriesType: Integer
Default: 200
Description: Number of permutations/simulations when
testing_mode = "empirical". The number has an upper bound
of factorial(n) where n is the size of the
smallest sample group. If ntries exceeds this bound, it
will be reduced to factorial(n).
Recommendation: - 100-500: Faster, less precise - 1,000-10,000: Slower, more precise
mid_pType: Logical
Default: FALSE
Description: Whether to use mid-p correction in
empirical p-value calculation.
aggfunType: Character or function
Options: "median", "mean", or
a custom function
Default: "median"
Description: Aggregation function for calculating
DMR-level statistics (delta beta, p-values).
Recommendation: Median is more robust to outliers; mean may be more sensitive.
njobsType: Integer
Default:
getOption("CMEnt.njobs", .defaultNJobs())
Description: Number of parallel jobs to use for
computation.
Recommendation: - Use -1 to
automatically use all available cores minus 1 - Limit to avoid
overwhelming system resources - Consider memory requirements when
increasing parallelization
verboseType: Integer
Range: 0 to 5
Default:
getOption("CMEnt.verbose", 1)
Description: Level of verbosity for logging
messages:
0: No messages1: Essential messages only2: Standard progress information3: Detailed progress information4-5: Very detailed debugging informationseeds_id_colType: Character or integer
Default: NULL
Description: Column name or index for seed identifiers
in the seeds file. If NULL, uses row names if present,
otherwise the first column.
output_prefixType: Character
Default: NULL
Description: Prefix for output files. If provided,
results will be saved to files with this prefix. If NULL,
no files are saved.
beta_row_names_fileType: Character
Default: NULL
Description: Path to a file containing row names for
beta values. Useful for large beta files where reading row names
separately is more efficient.
bed_providedType: Logical
Default: FALSE
Description: Whether the beta file is provided as a BED
file. Automatically set to TRUE if the input file has a
.bed extension.
bed_chrom_colType: Character
Default: "chrom"
Description: Column name for chromosome in BED
files.
bed_start_colType: Character
Default: "start"
Description: Column name for start position in BED
files.
annotate_with_genesType: Logical
Default: TRUE
Description: Whether to annotate DMRs with overlapping
genes.
.score_dmrsType: Logical
Default: TRUE
Description: Whether to add complementary SVM-based
discrimination scores to DMRs. When enabled, each DMR is evaluated for
its ability to separate sample groups using stratified k-fold
cross-validation with an RBF kernel SVM. The resulting
score and cv_accuracy values summarize
sample-level discriminative strength and should be read alongside DMR
pval, qval, and effect-size columns, not as
replacements for them.
Details: - Uses stratified k-fold cross-validation
(default: 5-fold) - Number of folds can be controlled with
options(CMEnt.scoring_nfold = 5) - Reproducible fold
assignments can be obtained with set.seed(...) before
calling scoreDMRs() - Higher score and
cv_accuracy values indicate stronger discriminative power -
Requires the e1071 package for SVM classification
.load_debugType: Logical
Default: FALSE
Description: Enable debug mode for loading intermediate
files, through short-circuiting. For internal development use only.
CMEnt uses several global options that can be set using the
options() function. These persist across function calls in
your R session.
CMEnt.use_annotation_cacheType: Logical
Default: TRUE
Description: Enable caching of gene annotations.
CMEnt.annotation_cache_dirType: Character
Default:
USER_CACHE_DIR/R/CMEnt/annotation_cache
Description: Directory for annotation cache.
CMEnt.jaspar_versionType: Integer
Default: 2024
Description: JASPAR database version to use for motif
analysis.
CMEnt.jaspar_tax_groupType: Character
Default: "vertebrates"
Description: Taxonomic group for JASPAR motif
filtering.
CMEnt.min_motif_similarityType: Numeric
Default: 0.8
Description: Minimum motif similarity threshold for DMR
interaction analysis.
CMEnt.jaspar_corr_thresholdType: Numeric
Default: 0.9
Description: Correlation threshold for JASPAR motif
similarity.
Start with default parameters and adjust based on your specific needs.
For array data (450K, EPIC), use lower
min_sites values (3-5) since site coverage is
sparse.
For WGBS data, keep min_sites
higher (50+) to ensure robust regions.
Avoid heavy pre-filtering of seeds based on effect size. Let CMEnt handle filtering internally.
Use empirical p-values for small sample sizes (<10 per group) or when normality assumptions are questionable.
Use parallel processing
(njobs > 1) for faster computation, but be mindful of
memory requirements.
Save intermediate results using
output_prefix for large analyses.
Document your configuration by saving parameter settings for reproducibility.
Solution: - Decrease njobs - Decrease
getOption("CMEnt.beta_in_mem_threshold_mb") (default 500)
to enable disk-based processing - Use tabix-indexed files for very large
datasets - Enable caching options
Solution: - Increase max_lookup_dist.
This will allow seeds that are farther apart to be connected, leading to
larger DMRs. - Increase max_pval. This will make
connectivity testing less stringent, allowing more sites to be connected
and thus larger DMRs. - Decrease ext_site_delta_beta . This
will allow more sites to be included in DMRs during the second stage of
extension, leading to larger DMRs.
Solution: - Increase min_seeds. This
will require more seeds to be connected to form a DMR, leading to fewer
total DMRs. - Increase min_sites. This will require more
sites to be included in a DMR, leading to fewer total DMRs. - Decrease
max_pval. This will make connectivity testing more
stringent, leading to fewer connected sites and thus fewer DMRs. -
Increase max_lookup_dist. This will join more seeds into
the same DMRs, reducing the total number of DMRs. - Decrease
ext_site_delta_beta. This will allow more sites to be
included in DMRs during the second stage of extension, leading to more
merging of nearby DMRs and thus fewer total DMRs.
Solution: - Increase njobs for parallel
processing - CMEnt derives connectivity chunk sizes from available RAM
automatically. - Use testing_mode = "parametric" instead of
"empirical" - Enable caching options - Consider using
tabix-indexed files
## R version 4.6.0 (2026-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] DMRsegaldata_1.1.0 ExperimentHub_3.3.0 AnnotationHub_4.3.0
## [4] BiocFileCache_3.3.0 dbplyr_2.5.2 ggplot2_4.0.3
## [7] GenomicRanges_1.65.0 Seqinfo_1.3.0 IRanges_2.47.2
## [10] S4Vectors_0.51.3 BiocGenerics_0.59.7 generics_0.1.4
## [13] CMEnt_0.99.4 BiocStyle_2.41.0
##
## loaded via a namespace (and not attached):
## [1] BiocIO_1.23.3 bitops_1.0-9
## [3] filelock_1.0.3 tibble_3.3.1
## [5] R.oo_1.27.1 XML_3.99-0.23
## [7] DirichletMultinomial_1.55.0 lifecycle_1.0.5
## [9] httr2_1.2.2 pwalign_1.9.1
## [11] doParallel_1.0.17 lattice_0.22-9
## [13] backports_1.5.1 magrittr_2.0.5
## [15] limma_3.69.2 sass_0.4.10
## [17] rmarkdown_2.31 jquerylib_0.1.4
## [19] yaml_2.3.12 otel_0.2.0
## [21] DBI_1.3.0 buildtools_1.0.0
## [23] RColorBrewer_1.1-3 abind_1.4-8
## [25] purrr_1.2.2 R.utils_2.13.0
## [27] RCurl_1.98-1.19 rappdirs_0.3.4
## [29] circlize_0.4.18 maketools_1.3.2
## [31] seqLogo_1.79.0 testthat_3.3.2
## [33] permute_0.9-10 DelayedMatrixStats_1.35.0
## [35] codetools_0.2-20 DelayedArray_0.39.3
## [37] DT_0.34.0 tidyselect_1.2.1
## [39] shape_1.4.6.1 futile.logger_1.4.9
## [41] ggseqlogo_0.2.2 UCSC.utils_1.9.0
## [43] farver_2.1.2 matrixStats_1.5.0
## [45] showtext_0.9-8 GenomicAlignments_1.49.0
## [47] jsonlite_2.0.0 GetoptLong_1.1.1
## [49] iterators_1.0.14 foreach_1.5.2
## [51] tools_4.6.0 TFMPvalue_1.0.0
## [53] Rcpp_1.1.1-1.1 glue_1.8.1
## [55] gridExtra_2.3 SparseArray_1.13.2
## [57] BiocBaseUtils_1.15.1 xfun_0.58
## [59] MatrixGenerics_1.25.0 GenomeInfoDb_1.49.1
## [61] dplyr_1.2.1 HDF5Array_1.41.0
## [63] withr_3.0.2 formatR_1.14
## [65] BiocManager_1.30.27 fastmap_1.2.0
## [67] bedr_1.1.5 rhdf5filters_1.25.0
## [69] caTools_1.18.3 digest_0.6.39
## [71] R6_2.6.1 colorspace_2.1-2
## [73] gtools_3.9.5 dichromat_2.0-0.1
## [75] RSQLite_3.53.1 cigarillo_1.3.0
## [77] R.methodsS3_1.8.2 h5mread_1.5.0
## [79] data.table_1.18.4 rtracklayer_1.73.0
## [81] FNN_1.1.4.1 httr_1.4.8
## [83] htmlwidgets_1.6.4 S4Arrays_1.13.0
## [85] TFBSTools_1.51.0 pkgconfig_2.0.3
## [87] gtable_0.3.6 blob_1.3.0
## [89] ComplexHeatmap_2.29.0 S7_0.2.2
## [91] XVector_0.53.0 sys_3.4.3
## [93] brio_1.1.5 htmltools_0.5.9
## [95] sysfonts_0.8.9 strex_2.0.1
## [97] clue_0.3-68 scales_1.4.0
## [99] Biobase_2.73.1 png_0.1-9
## [101] knitr_1.51 lambda.r_1.2.4
## [103] reshape2_1.4.5 rjson_0.2.23
## [105] checkmate_2.3.4 curl_7.1.0
## [107] showtextdb_3.0 cachem_1.1.0
## [109] rhdf5_2.57.1 GlobalOptions_0.1.4
## [111] stringr_1.6.0 BiocVersion_3.24.0
## [113] parallel_4.6.0 AnnotationDbi_1.75.0
## [115] restfulr_0.0.17 pillar_1.11.1
## [117] grid_4.6.0 vctrs_0.7.3
## [119] beachmat_2.29.0 cluster_2.1.8.2
## [121] JASPAR2024_0.99.7 evaluate_1.0.5
## [123] bsseq_1.49.0 VennDiagram_1.8.2
## [125] cli_3.6.6 locfit_1.5-9.12
## [127] compiler_4.6.0 futile.options_1.0.1
## [129] Rsamtools_2.29.0 rlang_1.2.0
## [131] crayon_1.5.3 labeling_0.4.3
## [133] plyr_1.8.9 stringi_1.8.7
## [135] gridBase_0.4-7 BiocParallel_1.47.0
## [137] Biostrings_2.81.3 Matrix_1.7-5
## [139] BSgenome_1.81.0 sparseMatrixStats_1.25.0
## [141] bit64_4.8.2 Rhdf5lib_2.1.0
## [143] KEGGREST_1.53.0 statmod_1.5.2
## [145] SummarizedExperiment_1.43.0 igraph_2.3.2
## [147] memoise_2.0.1 bslib_0.11.0
## [149] bit_4.6.0