---
title: "Visualizing signals across many regions"
author:
- name: Pierre-Luc Germain
affiliation:
- "Lab of Statistical Bioinformatics, University of Zürich; "
- "D-HEST Institute for Neuroscience, ETH Zürich, Switzerland"
package: epiwraps
output:
BiocStyle::html_document
abstract: |
This vignette covers the functions necessary for plotting signal across
multiple regions. This involves acquiring positional information in/around
given regions across tracks, functions to manipulate and aggregate these
matrices, as well as functions to plot signal heatmaps from them.
vignette: |
%\VignetteIndexEntry{multiRegionPlot}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r prelim, include=FALSE}
library(BiocStyle)
```
# Introduction
Since reading data from many regions is typically longer than plotting it, we
split plotting and acquiring the data. The latter is done through function
specific to this package, while the former wraps around the
`r BiocStyle::Biocpkg("EnrichedHeatmap")` package. The interface here has
been simplified, but for full functionality and customization it is
recommended to have a look at the `r BiocStyle::Biocpkg("EnrichedHeatmap")`
documentation.
# Reading signal in/around a set of regions
The `signal2Matrix` function reads genomic signals around the centers of a set
of regions. It can read from bam and BigWig files, although reading from bam
files is considerably slower and we strongly recommend using bigwig files. For
generating bigwig files that show the kind of signal you want to visualize, see
the vignette to this effect.
```{r signal2Matrix}
suppressPackageStartupMessages(library(epiwraps))
# we fetch the path to the example bigwig file:
bwf <- system.file("extdata/example_atac.bw", package="epiwraps")
# we load example regions (could be a GRanges or a path to a bed-like file):
regions <- system.file("extdata/example_peaks.bed", package="epiwraps")
# we obtain the matrix of the signal around the regions. For the purpose of this
# example, we'll read twice from the same:
ese <- signal2Matrix(c(atac1=bwf, atac2=bwf), regions, extend=1000L)
ese
```
The result is an object of class `EnrichmentSE`, which inherits from a
`r BiocStyle::Biocpkg("SummarizedExperiment","RangedSummarizedExperiment")`, and
therefore affords all the manipulations that the latter offers. Each region is
stored as a row, and each sample or signal track as a column of the object. So
we can see that we have signal for `r nrow(ese)` rows/regions from two tracks.
We could subset to the first 50 regions as follows:
```{r subsetESE}
ese[1:50,]
```
or obtain the coordinates of the queried regions :
```{r ESEranges}
rowRanges(ese)
```
One can further obtain more detailed information about the bins saved in the object:
```{r showTrackInfo}
showTrackInfo(ese)
```
This means that each signal track is a matrix of 200 columns, because we
asked to extend 1000bp on either side, and the default bin size is 10bp, making
100 bins/windows on each side.
## Extracting and manipulating signal matrices
It is possible to extract the list of signal matrix for manipulations, e.g. for
transformation:
```{r getSignalMatrices}
# square-root transform
m2 <- lapply(getSignalMatrices(ese), sqrt)
```
See `?addAssayToESE` for adding a list of signal matrices (such as `m2` here) to
an existing `EnrichmentSE` object. In addition, signal matrices can be combined,
either manually or using `?mergeSignalMatrices`.
## Normalization
By default, bigwig files generated by `epiwraps` are normalized for library
size, but this is not always sufficient. `EnrichmentSE` can further be
normalized using various methods. For an overview of these methods, see the
normalization vignette.
# Plotting heatmaps
Once the signal has been read and the object prepared, (and eventually
normalized, see the section below), we can plot heatmaps based on them as follows:
```{r heatmap1, fig.height=3, fig.width=4}
plotEnrichedHeatmaps(ese)
```
We can use most arguments that are supported by
`r BiocStyle::Biocpkg("EnrichedHeatmap")` (and thus, by extension, by
`r BiocStyle::Biocpkg("ComplexHeatmap")`), for example:
```{r heatmap2, fig.height=3, fig.width=4}
plotEnrichedHeatmaps(ese, colors=c("white","darkred"), cluster_rows=TRUE,
show_row_dend=TRUE, top_annotation=FALSE,
row_title="My list of cool regions")
```
Note however that the default clustering function is typically not ideal for
such data -- see for instance `?clusterSignalMatrices` instead.
It is often useful to subset to regions with a high enrichment, which we can do
with the `score` function:
```{r heatmap2b, fig.height=3, fig.width=4}
plotEnrichedHeatmaps(ese[head(order(-rowMeans(score(ese))),50),])
```
## Color-scale trimming
By default, the colorscale is trimmed to prevent most of it being driven by rare
extreme values. This can be controlled via the `trim` argument (which indicates
up to which quantile of data points to keep to establish the colorscale).
Compare for instance the following two heatmaps:
```{r heatmapTrim, fig.height=4, fig.width=5}
plotEnrichedHeatmaps(ese[,1], trim=1, scale_title="trim=1", column_title="trim=1 (no trim)",
top_annotation=FALSE) +
plotEnrichedHeatmaps(ese[,1], trim=0.99, scale_title="trim=0.99",
column_title="trim=0.99", top_annotation=FALSE) +
plotEnrichedHeatmaps(ese[,1], trim=0.9, column_title="trim=0.9",
scale_title="trim=0.9", top_annotation=FALSE)
```
The underlying data is exactly the same, only the color-mapping changes. In the
left one, which has no trimming, a single very high value at the top forces the
colorscale to extend to high values, even though most of the data is in the
very low range, resulting in a very dark heatmap. In the one on the right, it's
the opposite: so much is trimmed that many points reach the top of the
colorscale, resulting in a an 'over-exposed' heatmap. In practice, it is
advisable to use minimal trimming (e.g. the default is `c(0.02,0.98)`).
## Different colorscales for different tracks
It is also possible to have different colorscales for different tracks, which is
especially useful when comparing very different signals. To illustrate this,
let's load an example with different tracks:
```{r exampleESE}
data(exampleESE)
exampleESE
```
We can put each of the three tracks on its own color scale:
```{r heatmap1ms, fig.height=4, fig.width=4.5}
plotEnrichedHeatmaps(exampleESE, multiScale=TRUE)
```
One could also specify colors separately by providing them as a list:
```{r heatmap1ms2, fig.height=4, fig.width=4.5}
plotEnrichedHeatmaps(exampleESE,
colors=list(c("white","darkblue"), "darkgreen", "darkred"))
```
This information can also be stored in the object, rather than specified
everytime:
```{r hmcolors, fig.height=4, fig.width=4.5}
exampleESE$hmcolors <- list(viridisLite::inferno(100), "darkgreen", "darkred")
plotEnrichedHeatmaps(exampleESE)
```
## Scaled regions
By default, `signal2Matrix` looks at a pre-defined interval (defined by the
`extend` argument) around the center of the provided regions. This means that
the width of the input regions is ignored. In some circumstances, however, it
can be useful to scale regions to the same width, which can be done using the
`type="scaled"` argument. Consider the following example:
```{r scaling}
ese <- cbind(
signal2Matrix(c(center=bwf), regions, extend=1000L),
signal2Matrix(c(scaled=bwf), regions, extend=1000L, type="scaled")
)
plotEnrichedHeatmaps(ese)
```
For some purposes, such as when plotting signal over transcripts, it can be
useful for the target region to consist of multiple regions (e.g. exons)
stitched together. This can be done by supplying a `GRangesList` as regions:
```{r GRangesList}
# we make a dummy GRangesList:
a <- sort(rtracklayer::import(regions))
dummy.grl <- GRangesList(split(a, rep(LETTERS[1:15],each=10)))
sm <- signal2Matrix(c(scaled=bwf), dummy.grl, extend=1000L, type="scaled")
plotEnrichedHeatmaps(sm)
```
## Heatmap rasterization
When plotting more regions that there are pixels available, several regions
have to be summarized in one pixel, and doing this before generating the heatmap
makes the plot much less heavy.
By default, `EnrichedHeatmap` performs rasterization using the `magick` package
when it is installed, and falls back to a very suboptimal method when not. It is
therefore recommended to install the `magick` package.
Depending on your settings, if the heatmap is _very big_ you might hit the preset
limits of 'magick' base rasterization, which could result in an error such as
'Image must have at least 1 frame to write a bitmap'. In such cases, you might
have to degrade to a lower-quality rasterization using the additional arguments
`raster_by_magick=FALSE, raster_device="CairoJPEG"`.
Finally, on some systems, the rasterization sometimes encounters an
'UnableToOpenBlob' error. At the moment, the only workaround this has been to
disable rasterization using `use_raster=FALSE`.
# Sorting and clustering
The traditional ranking by decreasing overall enrichment can easily hide
patterns in the data, which are instead revealed by clustering. One approach is
to use hierarchical clustering of the rows:
```{r clusterRows1}
plotEnrichedHeatmaps(exampleESE, cluster_rows=TRUE)
```
In this example, this already reveals important patterns in the data, namely the
fact that p300 binding is associated with H3K27ac and tends to be mutually
exclusive with the promoter-associated H3K4me3 mark.
The hierarchical clustering is based on the whole enrichment profile, can easily
be led astray by patterns in individual signals, and seldom provides good
results in practice. An alternative is to use enrichment score weighted by
distance to the center, eventually row-normalized, to cluster the regions. We
provide a function to this end:
```{r clusterRows2}
# we cluster the regions using 2 clusters, and store the cluster labels in the
# rowData of the object:
rowData(exampleESE)$cluster <- clusterSignalMatrices(exampleESE, k=2, scaleRows=TRUE)
# we additionally label the clusters with colors:
plotEnrichedHeatmaps(exampleESE, row_split="cluster",
mean_color=c("1"="red", "2"="blue"))
```
Note that here we are splitting into 3 clusters, you can also provide a range
of values (e.g. `k=2:8`) and the function will also return cluster quality
metrics for each.
Users can of course use their own algorithms to cluster regions. To this end, it
can be useful to summarize each region in each sample using a single score which
weights the signal according to the distance to the center of the target (see
`?EnrichedHeatmap::enriched_score`). This can be done for the default assay with
the `score` method:
```{r score}
head(score(exampleESE))
```
# Plotting aggregated signals
It is also possible to plot only the average signals across regions. To do this,
we first melt the signal matrices and then use `r CRANpkg("ggplot2")`. The
`meltSignals` function will return a data.frame showing the mean, standard
deviation, standard error and median at each position relative to the center,
for each sample/matrix:
```{r meltSignals}
d <- meltSignals(exampleESE)
head(d)
```
This can then be used for plotting, simply with `ggplot`:
```{r aggPlot, fig.width=4, fig.height=3}
library(ggplot2)
ggplot(d, aes(position, mean, colour=sample)) +
geom_vline(xintercept=0, linetype="dashed") +
geom_ribbon(aes(position, ymin=mean-SE, ymax=mean+SE, fill=sample), alpha=0.4, colour=NA) +
geom_line(linewidth=1.2) +
theme_bw() + labs(x="relative position", y="mean RPKM")
```
We could also include cluster information:
```{r aggPlot2, fig.width=6, fig.height=3}
d <- meltSignals(exampleESE, splitBy = "cluster")
ggplot(d, aes(position, mean, colour=sample)) +
geom_vline(xintercept=0, linetype="dashed") +
geom_ribbon(aes(position, ymin=mean-SE, ymax=mean+SE, fill=sample), alpha=0.4, colour=NA) +
geom_line(linewidth=1.2) + facet_wrap(~split) +
theme_bw() + labs(x="relative position", y="mean RPKM")
```
# Visualizing DNAme and sparse signals
Nucleotide-resolution DNA methylation (as obtained from bisulfite sequencing)
signal differs from the signals used throughout this vignette in that it is
not continuous across the genome, but specifically at C or CpG nucleotides which
have a variable density throughout the genome. As a consequence, it is likely
that some of the plotting bins do not contain a CpG, in which case they get
assigned a value of 0, even though they could be in a completely methylated
region. For this reason, it is advisable to smooth DNA methylation signals for
the purpose of visualization.
As an example, let's look at the gene bodies of some active genes from chr8 of
the A549 cell lines:
```{r exampleDNAme}
data("exampleDNAme")
head(exampleDNAme)
```
As is typical of DNAme data, the object is a GRanges object (or more
specifically a GPos object, since all ranges have a width of 1 nucleotide) with,
in the score column, the percentage of DNA methylation. Let's see what happens
if we plot a heatmap of this signal, with and without smoothing.
```{r plotDNAme}
o1 <- signal2Matrix(list(noSmooth=exampleDNAme), geneBodies, type="scaled")
o2 <- signal2Matrix(list(smoothed=exampleDNAme), geneBodies, type="scaled",
smooth=TRUE, limit=c(0,100)) # recommended for DNAme
# we add a third one to visualize the CpGs:
o3 <- signal2Matrix(list("noSmooth only CpGs"=exampleDNAme), geneBodies, type="scaled", background=NA)
o <- cbind(o1,o2,o3)
plotEnrichedHeatmaps(o, scale_title="%\nmethylation", axis_name=c("TSS","TES"))
```
Here we use `type="scaled"` to scale the gene bodies to the same size, since
these can have very different sizes. The first heatmap is the default (no
smoothing), while the center one uses smoothing (note that we cap values from 0
to 100 to avoid smoothing artefacts). The third doesn't use smoothing, but only
display values for bins that contain a CpG.
All heatmaps show a very clear absence of DNA methylation at the promoter of
these genes (upstream of the TSS) and predominantly methylated gene bodies.
However they disagree substantially on the methylation levels upstream the
promoter and downstream the transcription end sites (TES). This is because of
the density of these regions in (covered) CpG nucleotides. Since most genes are
rather long, most of the bins in the gene body heatmap contain a CpG, leading to
an actual methylation signal. In the flanking regions, however, this is not
necessarily the case, and the left heatmap does not distinguish bins that are
unmethylated form bins for which there is no information. On the right heatmap,
we can see that regions downstream the TES are CpG-depleted. Instead, the
smoothed heatmap (center) uses neighboring bins to estimate the methylation
status of each bin, effectively filling out the gaps. In doing so it provides
the truthful representation, i.e. that the regions downstream of the genes and
upstream of the promoters are, most of the time, as methylated as the gene
bodies.
Smoothing is performed by `r BiocStyle::Biocpkg("EnrichedHeatmap")`; see
`?EnrichedHeatmap::normalizeToMatrix` for more information/customization.
# Session information {-}
```{r sessionInfo}
sessionInfo()
```