--- title: "Scaling reglScatterplot to millions of points" author: "George Muñoz" date: "`r Sys.Date()`" package: reglScatterplotR output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{Scaling reglScatterplot to millions of points} %\VignetteEngine{knitr::knitr} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", screenshot.force = FALSE ) ``` # What works at what scale `reglScatterplot()` was designed for the size of typical single-cell and spatial datasets, but it can push well past that. | Point count | Status | Notes | |------------------:|:-------------|:--------------------------------------------------------------| | 1 - 500 000 | Flawless | Below the auto performance-mode threshold; full interactivity | | 500 k - 5 M | Smooth | `performanceMode` kicks in automatically | | 5 M - 20 M | Usable | Use `pointSize = 1`, `opacity = 1`, drop `pointLabels` | | 20 M - 100 M | Standalone HTML reaches RAM ceiling | Tile-based architectures (e.g. deepscatter) start to win | | > 100 M | Out of reach in-browser | Server-side rendering / WebGPU territory | # How the wire format works To keep large datasets shippable inside a standalone `htmlwidget`, every numeric channel is binary-encoded and base64-wrapped before transit: | Channel | Encoder | Precision | Bytes / point | |------------------------|--------------------------|------------------|--------------:| | X / Y (normalised) | `.toBase64U16()` | 1 / 32 767 | 2 | | Continuous color z | `.toBase64U16Unit()` | 1 / 65 535 | 2 | | Categorical color z | `.toBase64U16Int()` | exact (< 65 536) | 2 | | Filter ranges | `toBase64()` (Float32) | full f32 | 4 | At 10 M points the resulting HTML file is around 80 - 90 MB - large but finite. The same data with Float32 everywhere would be ~150 MB. # A benchmark you can run yourself ```{r bench} library(reglScatterplotR) bench_sizes <- c(1e4, 1e5, 1e6, 5e6) for (n in bench_sizes) { df <- data.frame(x = rnorm(n), y = rnorm(n), v = runif(n)) t0 <- Sys.time() w <- reglScatterplot(df, x = "x", y = "y", colorBy = "v", height = 600 ) payload <- htmlwidgets:::toJSON(w$x) cat(sprintf( "n = %s : build = %.2fs, payload = %.1f MB\n", format(n, big.mark = ","), as.numeric(Sys.time() - t0, units = "secs"), nchar(payload) / 1024 / 1024 )) rm(df, w, payload) gc(verbose = FALSE) } ``` On a 2020-era laptop with an RTX 2060, 5 M points takes ~1.5s on the R side and another ~2s for the browser to parse and upload to the GPU; pan/zoom then runs at 60 fps. # Sizing inside the host viewport The widget honours an explicit pixel `height` verbatim. If the value exceeds the height of the host window (small browser tab, RStudio Viewer in a tiling WM, narrow Jupyter notebook column, etc.), the bottom of the canvas is clipped by the host - not by `reglScatterplot`. ```{r sizing, eval = FALSE} # Bad in small viewports: a 500 px tall widget overflows a 450 px window. reglScatterplot(df, x = "x", y = "y", height = 500) # Good: fill whatever vertical space is available. reglScatterplot(df, x = "x", y = "y", height = "100%") # Also good: omit `height` entirely - the sizingPolicy fills the viewer pane. reglScatterplot(df, x = "x", y = "y") ``` Knitting to HTML produces a full-page document where the widget can take as much height as you give it, so the same code that clips in the Viewer pane prints cleanly in a knit report. This is purely a viewport effect. # Memory levers for very large data When you really want to push past 5 M, every per-point byte counts. Suggested defaults for huge inputs: ```{r huge, eval = FALSE} reglScatterplot(huge_df, x = "x", y = "y", pointSize = 1, # one pixel per point opacity = 1, # no blending math showAxes = FALSE, # drops the D3 axis layer showTooltip = FALSE, # frees per-point hit-test work enableDownload = FALSE, # no html2canvas / jsPDF download pointLabels = NULL # don't ship gene names ) ``` Things you might think help but don't: * Reducing `vmin` / `vmax` clip range - colour scale only, not memory. * Setting `legendPosition = "bottom-left"` - cosmetic, no perf impact. # Comparison with other R packages `reglScatterplot` is one of three credible options for high-volume scatter in R. They aren't doing the same thing: | Package | Interactive? | Best at | Limit | |-----------------|:------------:|--------------------------------------|------------------------------| | `reglScatterplot` | Yes | 1 - 20 M points in HTML / Shiny | Browser RAM / VRAM | | `plotly` (+ `toWebGL()`) | Yes | < 500 k points, broad feature set | JSON payload bloats past 1 M | | `scattermore` | No (static) | Quickly rasterising 10 M+ to a PNG | No pan / zoom interactivity | | `ggplot2` | No (static) | Publication graphics, small data | Practical ceiling ~50 k pts | The right choice depends on what you need: * Want a printable figure? `ggplot2` or `scattermore`. * Want to embed an interactive plot in an HTML report? `reglScatterplot`. * Need brush, click, and faceted layouts more than scale? `plotly`. # Where the next jump comes from For genuinely huge data (multi-modal CosMx slides, whole-atlas integrations beyond ~50 M cells), no in-browser library is the right answer today. The viable paths are: 1. **Tile-based architectures** - precompute spatial tiles on disk, only load what the viewport needs. See `deepscatter` (Apache Arrow + Parquet tiles). Requires a server or a static tile directory. 2. **Server-side rendering** - send camera state to a Python / Julia backend that renders frames; stream them as images. Lower fidelity but independent of the client. 3. **WebGPU** - browser support is maturing; offers compute shaders that would let us do GPU-side filtering and density binning. Currently a two-year horizon. For now, `reglScatterplot` covers the typical single-cell, spatial and fold-change use cases comfortably. If you find yourself loading the same 10 M+ dataset repeatedly, the right next step is to switch to a tile server, not a faster scatterplot. # Session info ```{r sessionInfo, eval = TRUE} sessionInfo() ```