--- title: "spatialdataR" date: "`r format(Sys.Date(), '%B %d, %Y')`" package: "`r BiocStyle::pkg_ver('spatialdataR')`" output: BiocStyle::html_document: toc: true toc_depth: 2 toc_float: true vignette: | %\VignetteIndexEntry{spatialdataR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: "refs.bib" --- ```{r setup, include=FALSE} knitr::opts_chunk$set(cache=FALSE, message=FALSE, warning=FALSE) ``` # Preamble ## Introduction The `r BiocStyle::Biocpkg("spatialdataR")` package provides an R interface to Python's [spatialdata](https://spatialdata.scverse.org) framework for unified handling of spatial omics data, including tabular annotations, vector- and raster-based components. Developed as part of the [scverse](https://scverse.org) project [@Virshup2023-scverse], `spatialdata` aims to solve the challenges of integrating diverse spatial datasets -- including -- by employing the [OME-NGFF (Next Generation File Format)](https://ngff.openmicroscopy.org) standard [@Marconato2025-SpatialData]. The Python implementation and core specifications are found at the [official `spatialdata` website](https://spatialdata.scverse.org). ## Representation The core data structure is the `SpatialData` class, which organizes data into 5 coordinated **layers: images, labels, points, shapes, and tables**. Each layer is stored as a list of layer-specific objects that carry associated `SpatialDataAttr` (`@meta` slot), which encode `spatialdata`-specific Zarr attributes (*.zattr* for Zarr v2, and *zarr.json* for Zarr v3). Together, these layers provide a unified representation of spatial omics data, combining raster, vector, and tabular data within a single coherent framework. **Images/labels** store raster-based data as multi-scale, multi-channel arrays (e.g., immunofluorescent images or segmentation masks). They are represented as `SpatialDataImage/Label` objects that, in turn, inherit from `SpatialDataArray`. These are backed by a list of `ZarrArray`s via `r BiocStyle::Biocpkg("Rarr")` and `r BiocStyle::Biocpkg("ZarrArray")`, enabling chunked, on-disk access. **Points/shapes** represent spatial coordinates and geometric regions (e.g., transcript locations or segmentation boundaries). They are represented as `SpatialDataPoint/Shape` objects that, in turn, inherit from `SpatialDataFrame.` These are DuckDB-backed by a `duckspatial_df`, enabling efficient lazy handling. **Tables** store functional annotations or information that has been aggregated across layers (e.g., gene $\times$ cell data). They are currently represented as in-memory `r BiocStyle::Biocpkg("SingleCellExperiment")` objects; delayed, Zarr-backed handling of assay data is under active development. ```{r schematic, echo=FALSE, fig.wide=TRUE} knitr::include_graphics("schematic.png") ``` # Handling `SpatialData` are represented on-disk as Zarr stores. The package provides the `readSpatialData()` function to ingest an entire store, although arguments to control which layers and elements to read or not to read are also available. For this demonstration, we use a toy dataset included in the package: ```{r load} # dependencies library(spatialdataR) library(SingleCellExperiment) # path to 'spatialdata' Zarr store zs <- file.path("extdata", "blobs.zarr") zs <- system.file(zs, package="spatialdataR") # read as 'SpatialData' R object (sd <- readSpatialData(zs)) ``` The output above summarizes the `SpatialData` object, showing the dimension of elements in each layer (images, labels, points, shapes, tables), as well as the defined coordinate systems and which elements align to each of these. ## Accession `SpatialData` objects behave like a nested list: the first level corresponds to layers, and the second level corresponds to elements within each layers. For convenience, frequently needed accessor functions are provided as well. We here demonstrate various equivalent ways of accessing a layer or element: ```{r get-one, results="hide"} # preferred way # (using accessor) image(sd, 1) image(sd, "blobs_image") # alternative ways # (using list-style) images(sd)[[1]] images(sd)[["blobs_image"]] sd$images[[1]] sd$images$blobs_image sd$images[["blobs_image"]] ``` Similarly, the following are equivalent ways of retrieving element names: ```{r get-nms, results="hide"} # preferred way # (using accessor) imageNames(sd) # alternative ways # (using list-style) names(sd[[1]]) names(sd$images) names(sd[["images"]]) ``` ## Subsetting Object-wide subsetting of `SpatialData` is supported via `[`, which can be used to drop layers, or elements within layers. Note that the following operations are not in-place, but return a `SpatialData` object with fewer layers/elements: ```{r sub, results="hide"} # keep image-layer only sd[1] sd["images"] # keep 1st image & label element sd[c(1, 2), list(1, 1)] sd[c("images", "labels"), list(1, 1)] ``` ## Internals Every spatial element (tables excluded) is composed of two key slots: - `data`: a list of `ZarrArray`s for images/labels, or a `duckspatial_df` for shapes/points. - `meta`: a `SpatialDataAttrs` object containing the OME-NGFF metadata retrieved from the zarr attributes present in the original Zarr store. We here demonstrate how to access these slots for a given element `image` elements represent a special case, as they are stored as a list of `ZarrArray`s (one per multi-scale resolution). For them, `data()` provides an additional argument `k` that specifies which resolution to retrieve: - `k=1` retrieves the highest resolution (default). - `k=Inf` retrieves the lowest resolution available. - `k=NULL` retrieves the full list of available resolutions. ```{r data} # get multi-scale image i <- image(sd, 2) a <- data(i, 1) # highest b <- data(i, Inf) # lowest # available resolutions l <- data(i, NULL) # dimensions of each d <- vapply(l, dim, integer(3)) rownames(d) <- c("c", "y", "x") colnames(d) <- seq_along(l) show(d) ``` # Annotations For single-cell and spatial omics datasets, functional annotations are commonly stored as [AnnData](https://anndata.readthedocs.io) objects in Python. In R, we use `r BiocStyle::Biocpkg("anndataR")` [@Deconinck2025-anndataR] to read these Zarr-backed `AnnData` as `r BiocStyle::Biocpkg("SingleCellExperiment")`(s). A `table` can link to one or more `label` or `shape` (but not other layers), whereby internal metadata (`spatialdata_attrs`) are used to keep track of the element(s) and observations being annotated. This is handled internally so the user needn't worry about it; however, we show it here for didactic purposes: ```{r tables} # access annotation (se <- table(sd)) # annotated region(s) region(se) # annotated instance(s) instances(se) ``` # Transformations A key feature of the `SpatialData` framework is its handling of different coordinate systems. Each element can exist in multiple coordinate spaces simultaneously, defined by transformations in its on-disk Zarr attributes. The relationships between different elements and their respective coordinate spaces can be complex. `SpatialData` provides the `CTgraph()` and `CTplot()` functions to construct and visualize a directed graph of these relationships: - **source nodes** (prefixed with `_`) represent individual elements. - **target nodes** represent the coordinate systems (e.g., `global`). - **edges** represent the transformations required to align an element. ```{r ct-graph, fig.width=6, fig.height=3} g <- CTgraph(sd) CTplot(g) ``` The `transform()` function resolves the necessary steps to project an element into a target coordinate system by traversing this graph, and applying the respective transformation(s). Under the hood, this involves: 1. Retrieving the relevant transformation data from the element's `SpatialDataAttr` (e.g., scale factors for x- and y-coordinates); and, 2. Applying the appropriate transformation function(s) in the correct order (e.g., `scale()` then `translation()`). ```{r transform} # get element a <- label(sd) # project into 'global' b <- spatialdataR::transform(a, "scale") # compare XY extents do.call(rbind, c(a=extent(a), b=extent(b))) ``` # Utilities ## Cropping `crop()` may be used to subset elements -- across all layers -- according to a *spatial* bounding box or polygon. This region may be supplied in different ways, including as a `SpatialDataShape`. In addition, the following are okay: - For bounding box cropping, an `sf::st_bbox()` object, or a list of `xmin/xmax/ymin/ymax` values (order irrelevant). - For polygon cropping, an `sf::st_polygon()` or `sf::st_sfc()` object, or a two-column matrix of XY coordinates (at least 3 rows = triangle). ```{r crop, echo=-1, results="hold", out.width="75%", fig.align="center"} par(mfrow=c(1,2), mar=c(0,1,2,1)) # bounding box xy <- list(xmin=-Inf, xmax=Inf, ymin=20, ymax=40) sp <- crop(sd, xy) plot( point(sd)$geometry, col="blue", main="crop() with\nbounding box") points(point(sp)$geometry, col="red") # polygon xy <- rbind(c(0, 0), c(64, 0), c(32, 64)) sq <- crop(sd, xy) plot( point(sd)$geometry, col="blue", main="crop() with\npolygon") points(point(sq)$geometry, col="red") ``` ## Masking `mask()` aggregates data between elements and across layers, with support for masking of points by images by labels, points by shapes, and shapes by shapes: - **point by shape** masking counts the number of points that fall within each shape (e.g., counting transcripts in cell membrane or nucleus segmentation boundaries in order to obtain a gene $\times$ cell-level data). - **image by label** masking aggregates channel-wise pixel values in an image according to the regions defined by a label (e.g., obtaining mean fluorescence intensities per cell). - **shape by shape** masking aggregates the data in a table of one shape by another shape (e.g., summarizing cell-level data into regions of interest). A couple considerations are also worth mentioning: - The identifier of the resulting `table` may be specified via `name`, which will default to `i_by_j` when masking element `i` by element `j`. - Instances of `i` that do not map to any instances of `j` (e.g., unassigned transcripts) will be assigned to a special "0" column in the resulting table. ```{r mask-image-label} # average channel-wise pixel values by labels sp <- mask(sd, i="blobs_image", j="blobs_labels") se <- table(sp, "blobs_image_by_blobs_labels") assay(se) ``` ```{r mask-point-shape} # count different point species in polygons sp <- mask(sd, i="blobs_points", j="blobs_polygons") se <- table(sp, "blobs_points_by_blobs_polygons") assay(se) ``` ```{r mask-shape-shape, eval=FALSE} # average shape-level data by other shapes sp <- mask(sd, i="blobs_polygons", j="blobs_circles") ``` ## Querying `query()` filters elements across all layers based on `table` metadata in `dplyr`-style syntax, where queries may be passed via the ellipsis (`...`): TODO ## Combining `combine()` can be used to merge two `SpatialData` objects into one (or many, via `do.call(list(...), combine)`). Here, elements names will be made unique across objects via `make.names()`, appending a suffix to the element names of subsequent objects. Alternatively, names could be customize before combining. ```{r combine} sp <- combine(list(foo=sd, bar=sd)) cbind( original=lengths(colnames(sd)), combined=lengths(colnames(sp))) imageNames(sp) ``` ## Coordinates `centroids()` may be used to extract spatial coordinates for every instance in a given element. This applies all layers except images and tables. Notably, for labels and shapes, the centroids of each region are returned (center of mass). ```{r centroids} head(centroids(point(sd))) ``` `extent()` will obtain the range of an element's spatial coordinates in a target coordinate space. This can be done for one element, or object-wide in order to obtain the largest extent across all elements in an object. ```{r} # object-wide xy <- extent(sd) unlist(xy) # one element xy <- extent(point(sd)) unlist(xy) # with prior alignment to target coordinate space xy <- extent(label(sd)) yx <- extent(label(sd), "scale") rbind(native=unlist(xy), scaled=unlist(yx)) ``` # Appendix ## Resources - [SpatialData.plot](https://github.com/HelenaLC/SpatialData.plot): companion package with `ggplot`-based visualization capabilities, including layered plotting of different elements with control over each's aesthetics, channel-mixing and auto-contrasting for images, transformations handling, etc. - [SpatialData.data](https://github.com/HelenaLC/SpatialData.data): companion package with example datasets from different platforms, including use of `r BiocStyle::Biocpkg("BiocFileCache")` for efficient data management. - [SpatialData.demo](https://github.com/HelenaLC/SpatialData.demo): companion repository with vignettes on analyzing datasets from different platforms, including transcriptomics and proteomics. ## Session info {- .smaller} ```{r sessionInfo} sessionInfo() ``` ## References