GPlinksR: Building Gene-Peak Links from Simulated Single-Cell Inputs

Introduction

GPlinksR constructs gene-peak regulatory links by combining three sources of evidence: enhancer-based links, promoter overlaps, and nearest-gene mappings. The package is designed for ATAC-RNA integration workflows where users already have a set of peaks and a set of genes, but it also includes a wrapper that can extract these inputs from common Bioconductor containers such as SingleCellExperiment (Lun and Risso 2024) and MultiAssayExperiment (Ramos et al. 2024).

This vignette uses a small built-in example dataset to demonstrate how to prepare inputs, how to run the main function, and how to use the wrapper when your data are already stored in Bioconductor containers.

If you are running these chunks manually from the package source directory rather than knitting the vignette, load the current source package first with devtools::load_all(".") so that you use the local development version rather than an older installed copy.

Overview

The core function in GPlinksR is build_gp_links(). It expects:

  • pk: peak coordinates in "chr:start-end" format, or a data.frame with peak coordinate columns.
  • gn: a character vector of gene symbols.

For users working with single-cell analysis objects, build_gp_links_wrapper() can extract these inputs automatically and pass them to build_gp_links().

Minimal Example with Peak and Gene Vectors

We begin with the package example dataset, which contains 300 peak coordinates and 100 gene symbols prepared for demonstration purposes.

data("gp_example_inputs", package = "GPlinksR")
data("gp_example_links", package = "GPlinksR")

pk <- gp_example_inputs$pk
gn <- gp_example_inputs$gn

length(pk)
## [1] 300
length(gn)
## [1] 100

For the executable examples below, we use a small deterministic subset of the packaged data that is arranged to yield enhancer-, promoter-, and closest-based links.

pk_demo <- pk[seq_len(3)]
gn_demo <- gn[seq_len(4)]

pk_demo
## [1] "chr1:819770-822338" "chr1:876277-877834" "chr1:923554-925121"
gn_demo
## [1] "TTLL10" "FAM87B" "SAMD11" "FAM41C"

At this point, the main function can be called directly:

gp <- build_gp_links(pk = pk_demo, gn = gn_demo)
head(gp)

For the packaged example subset, the resulting link table begins as follows:

gp <- gp_example_links
head(gp)
##                 Peak   Gene  Src
## 1 chr1:819770-822338 TTLL10  enh
## 2 chr1:876277-877834 SAMD11  enh
## 3 chr1:876277-877834 FAM41C prom
## 4 chr1:923554-925121 SAMD11 prom
## 5 chr1:819770-822338 FAM87B  clo
## 6 chr1:876277-877834 FAM41C  clo

If you want to use a previously cached PEREGRINE file, you may first download or retrieve it with:

enh_file <- get_peregrine_file(19)
gp <- build_gp_links(pk = pk_demo, gn = gn_demo, enh_file = enh_file)

Using a Peak Data Frame

Some users store peaks as separate chromosome, start, and end columns. This is also supported.

peak_parts <- do.call(rbind, strsplit(pk_demo, "[:-]"))
peak_df <- data.frame(
    chr = peak_parts[, 1],
    start = as.integer(peak_parts[, 2]),
    end = as.integer(peak_parts[, 3])
)

peak_df
##    chr  start    end
## 1 chr1 819770 822338
## 2 chr1 876277 877834
## 3 chr1 923554 925121

This can be supplied directly to build_gp_links():

gp <- build_gp_links(pk = peak_df, gn = gn_demo)
head(gp)

This data-frame representation yields the same example link table:

gp <- gp_example_links
head(gp)
##                 Peak   Gene  Src
## 1 chr1:819770-822338 TTLL10  enh
## 2 chr1:876277-877834 SAMD11  enh
## 3 chr1:876277-877834 FAM41C prom
## 4 chr1:923554-925121 SAMD11 prom
## 5 chr1:819770-822338 FAM87B  clo
## 6 chr1:876277-877834 FAM41C  clo

Wrapper Workflow for MultiAssayExperiment

For many Bioconductor workflows, peaks and genes are already stored in a MultiAssayExperiment. In that setting, the wrapper lets users specify which experiment contains peaks and which contains genes, and then handles the input extraction automatically.

The following code shows a simple simulated setup. Here, the RNA experiment is represented by a SummarizedExperiment (Morgan et al. 2024), while the ATAC experiment stores peak ranges.

if (requireNamespace("MultiAssayExperiment", quietly = TRUE) &&
    requireNamespace("SummarizedExperiment", quietly = TRUE) &&
    requireNamespace("IRanges", quietly = TRUE)) {
    library(MultiAssayExperiment)
    library(SummarizedExperiment)
    library(GenomicRanges)
    library(IRanges)
    library(S4Vectors)

    sample_ids <- paste0("cell", seq_len(3))
    peak_parts_mae <- do.call(
        rbind,
        strsplit(pk_demo[seq_len(3)], "[:-]")
    )

    atac_counts <- matrix(sample.int(10, 9, TRUE), nrow = 3)
    colnames(atac_counts) <- sample_ids

    atac_se <- SummarizedExperiment(
        assays = list(counts = atac_counts),
        rowRanges = GRanges(
            seqnames = peak_parts_mae[, 1],
            ranges = IRanges(
                start = as.integer(peak_parts_mae[, 2]),
                end = as.integer(peak_parts_mae[, 3])
            )
        )
    )

    rna_counts <- matrix(sample.int(10, 12, TRUE), nrow = 4)
    colnames(rna_counts) <- sample_ids

    rna_se <- SummarizedExperiment(
        assays = list(counts = rna_counts),
        rowData = DataFrame(
            symbol = gn_demo[seq_len(4)]
        )
    )

    mae <- MultiAssayExperiment(
        experiments = list(ATAC = atac_se, RNA = rna_se),
        colData = S4Vectors::DataFrame(row.names = sample_ids)
    )
}

Once the object has been created, the wrapper call is concise:

gp <- build_gp_links_wrapper(
    x = mae,
    peak_experiment = "ATAC",
    gene_experiment = "RNA",
    gene_col = "symbol"
)

head(gp)

For this packaged example setup, the wrapper returns the same link table:

gp <- gp_example_links
head(gp)
##                 Peak   Gene  Src
## 1 chr1:819770-822338 TTLL10  enh
## 2 chr1:876277-877834 SAMD11  enh
## 3 chr1:876277-877834 FAM41C prom
## 4 chr1:923554-925121 SAMD11 prom
## 5 chr1:819770-822338 FAM87B  clo
## 6 chr1:876277-877834 FAM41C  clo

Wrapper Workflow for SingleCellExperiment

The wrapper also supports a SingleCellExperiment (Lun and Risso 2024) workflow in which gene-level measurements are stored in the main object and peak features are stored in an altExp().

if (requireNamespace("SingleCellExperiment", quietly = TRUE) &&
    requireNamespace("SummarizedExperiment", quietly = TRUE)) {
    library(SingleCellExperiment)
    library(SummarizedExperiment)
    library(S4Vectors)

    sample_ids <- paste0("cell", seq_len(4))

    rna_counts <- matrix(sample.int(10, 16, TRUE), nrow = 4)
    rownames(rna_counts) <- gn_demo[seq_len(4)]
    colnames(rna_counts) <- sample_ids

    peak_counts <- matrix(sample.int(10, 12, TRUE), nrow = 3)
    colnames(peak_counts) <- sample_ids
    peak_rowdata <- DataFrame(
        PeakRegion = pk_demo[seq_len(3)]
    )

    sce <- SingleCellExperiment(
        assays = list(counts = rna_counts),
        rowData = DataFrame(symbol = rownames(rna_counts))
    )

    altExp(sce, "ATAC") <- SummarizedExperiment(
        assays = list(counts = peak_counts),
        rowData = peak_rowdata
    )
}

The wrapper then extracts genes from the main object and peaks from the named alternative experiment:

gp <- build_gp_links_wrapper(
    x = sce,
    peak_experiment = "ATAC",
    gene_col = "symbol"
)

head(gp)

For this packaged example setup, the SingleCellExperiment wrapper also returns the same example link table:

gp <- gp_example_links
head(gp)
##                 Peak   Gene  Src
## 1 chr1:819770-822338 TTLL10  enh
## 2 chr1:876277-877834 SAMD11  enh
## 3 chr1:876277-877834 FAM41C prom
## 4 chr1:923554-925121 SAMD11 prom
## 5 chr1:819770-822338 FAM87B  clo
## 6 chr1:876277-877834 FAM41C  clo

Input Expectations

The wrapper is intended to reduce the amount of manual preprocessing required by new users, but a few object conventions are still important:

  • Peak features should ultimately resolve to coordinates in "chr:start-end" format.
  • Gene features should ultimately resolve to gene symbols.
  • For MultiAssayExperiment input, peak and gene experiments should be named explicitly.
  • For SingleCellExperiment input, the main object is assumed to contain genes and altExp() is assumed to contain peaks.

If your object uses different column names, you can point the wrapper to them with peak_col and gene_col.

Session Info

sessionInfo()
## R version 4.6.0 (2026-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SingleCellExperiment_1.35.1 MultiAssayExperiment_1.39.0
##  [3] SummarizedExperiment_1.43.0 Biobase_2.73.1             
##  [5] GenomicRanges_1.65.0        Seqinfo_1.3.0              
##  [7] IRanges_2.47.2              S4Vectors_0.51.3           
##  [9] BiocGenerics_0.59.7         generics_0.1.4             
## [11] MatrixGenerics_1.25.0       matrixStats_1.5.0          
## [13] GPlinksR_0.99.0             BiocStyle_2.41.0           
## 
## loaded via a namespace (and not attached):
##  [1] Matrix_1.7-5        jsonlite_2.0.0      compiler_4.6.0     
##  [4] BiocManager_1.30.27 jquerylib_0.1.4     yaml_2.3.12        
##  [7] fastmap_1.2.0       lattice_0.22-9      XVector_0.53.0     
## [10] R6_2.6.1            S4Arrays_1.13.0     knitr_1.51         
## [13] DelayedArray_0.39.3 maketools_1.3.2     bslib_0.11.0       
## [16] rlang_1.2.0         cachem_1.1.0        xfun_0.59          
## [19] sass_0.4.10         sys_3.4.3           otel_0.2.0         
## [22] SparseArray_1.13.2  cli_3.6.6           digest_0.6.39      
## [25] grid_4.6.0          lifecycle_1.0.5     evaluate_1.0.5     
## [28] buildtools_1.0.0    abind_1.4-8         rmarkdown_2.31     
## [31] tools_4.6.0         htmltools_0.5.9
Lun, Aaron, and Davide Risso. 2024. SingleCellExperiment: S4 Classes for Single Cell Data.
Morgan, Martin, Valerie Obenchain, Jim Hester, and Hervé Pagès. 2024. SummarizedExperiment: SummarizedExperiment Container.
Ramos, Marcel, Mike Morgan, and Vince Carey. 2024. MultiAssayExperiment: Software for Integrative Analysis of Multiomics Experiments in Bioconductor.