--- title: "Cloud Storage Access for GDS Files" author: "Xiuwen Zheng" date: "2026-05-01" output: BiocStyle::html_document: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Cloud Storage Access for GDS Files} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(collapse=TRUE) ``` # Introduction GDS (Genomic Data Structure) is a high-performance file format for storing and accessing large-scale genomic data, implemented by the [gdsfmt](https://bioconductor.org/packages/gdsfmt) package. It supports hierarchical data organization with efficient random access and data compression. The **gdscloud** package extends gdsfmt to provide transparent read-only access to GDS files stored on cloud storage services. It uses HTTP Range requests via libcurl with an efficient LRU block cache to minimize network overhead. Supported backends: - **HTTP/HTTPS** — any `http://` or `https://` URL (public or authenticated) - **Amazon S3** — URLs of the form `s3://bucket/key` - **Google Cloud Storage** — URLs of the form `gs://bucket/key` - **Azure Blob Storage** — URLs of the form `az://container/blob` # Installation ```R if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("gdscloud") ``` # Supported URL Schemes gdscloud recognizes the following URL schemes: ```{r schemes} library(gdscloud) gdsCloudSchemes() ``` # Quick Start Once gdscloud is loaded, `openfn.gds()` from the gdsfmt package automatically recognizes cloud URLs and opens them transparently: ```{r quickstart} library(gdscloud) # Open a GDS file from S3 — transparent via openfn.gds() gds <- openfn.gds("s3://gds-stat/download/hapmap/hapmap_r23a.gds") gds table(read.gdsn(index.gdsn(gds, "chromosome"))) summary(read.gdsn(index.gdsn(gds, "position"))) closefn.gds(gds) ``` Alternatively, use the explicit function: ```{r explicit, eval=FALSE} gds <- gdsCloudOpen("s3://my-bucket/data/example.gds") closefn.gds(gds) ``` # Authentication Each cloud backend requires credentials. Credentials can be set either via environment variables (recommended for non-interactive use) or via R functions (convenient for interactive sessions). R-configured credentials take priority over environment variables. ## HTTP/HTTPS For public URLs (e.g., files hosted on a web server with Range request support), no credentials are needed: ```{r open-http-public, eval=FALSE} gds <- gdsCloudOpen("https://example.com/path/to/file.gds") closefn.gds(gds) ``` For authenticated HTTP endpoints, set a Bearer token: **Environment variable:** ```bash export GDSCLOUD_HTTP_TOKEN=your_bearer_token ``` **R configuration:** ```{r auth-http} gdsCloudConfigHTTP(bearer_token = "your_bearer_token") # URL-specific token for a private server gdsCloudConfigHTTP( bearer_token = "token_for_private_server", url = "https://private.example.com/" ) ``` Then open: ```{r open-http, eval=FALSE} gds <- gdsCloudOpen("https://private.example.com/data/file.gds") closefn.gds(gds) ``` ## Amazon S3 **Environment variables:** ```bash export AWS_ACCESS_KEY_ID=your_key export AWS_SECRET_ACCESS_KEY=your_secret export AWS_DEFAULT_REGION=us-east-1 # Optional for temporary credentials: export AWS_SESSION_TOKEN=your_token ``` **R configuration:** ```{r auth-s3} gdsCloudConfigS3( aws_access_key_id = "your_key", aws_secret_access_key = "your_secret", region = "us-east-1" ) ``` Then open: ```{r open-s3, eval=FALSE} gds <- gdsCloudOpen("s3://my-bucket/path/to/file.gds") # ... work with the file ... closefn.gds(gds) ``` ## Google Cloud Storage **Environment variable:** ```bash export GCS_ACCESS_TOKEN=your_oauth2_token ``` **R configuration:** ```{r auth-gcs} gdsCloudConfigGCS(access_token = "your_oauth2_token") ``` Then open: ```{r open-gcs, eval=FALSE} gds <- gdsCloudOpen("gs://my-bucket/path/to/file.gds") closefn.gds(gds) ``` ## Azure Blob Storage **Environment variables:** ```bash export AZURE_STORAGE_ACCOUNT=your_account export AZURE_STORAGE_KEY=your_key # Or use a SAS token instead: export AZURE_STORAGE_SAS_TOKEN=your_sas_token ``` **R configuration:** ```{r auth-azure} gdsCloudConfigAzure( account_name = "mystorageaccount", account_key = "base64encodedkey==" ) # Or with SAS token: gdsCloudConfigAzure( account_name = "mystorageaccount", sas_token = "sv=2021-06-08&ss=b&srt=co&sp=r..." ) ``` Then open: ```{r open-azure, eval=FALSE} gds <- gdsCloudOpen("az://my-container/path/to/file.gds") closefn.gds(gds) ``` ## URL-specific credentials Sometimes different URLs need different credentials — for example, two S3 buckets owned by different accounts, or a mix of public and private resources. Each `gdsCloudConfig*()` function accepts an optional `url` argument that associates the supplied credentials with a URL prefix rather than the global defaults: ```{r auth-url} # Different keys for two S3 buckets gdsCloudConfigS3( aws_access_key_id = "KEY_A", aws_secret_access_key = "SECRET_A", url = "s3://bucket-a/" ) gdsCloudConfigS3( aws_access_key_id = "KEY_B", aws_secret_access_key = "SECRET_B", url = "s3://bucket-b/" ) # You can also scope credentials to a sub-prefix within a bucket gdsCloudConfigS3( aws_access_key_id = "KEY_SHARED", aws_secret_access_key = "SECRET_SHARED", url = "s3://bucket-a/shared/" ) # Remove a previously registered URL-specific entry gdsCloudConfigS3(url = "s3://bucket-a/") ``` When opening a URL, credentials are resolved in the following order (the first non-empty value wins for each field): 1. the URL-specific entry whose registered prefix is the longest prefix of the opened URL (within the same scheme); 2. the global values set by `gdsCloudConfig*()` with `url=NULL`; 3. the corresponding environment variable. The URL scheme passed via `url=` must match the function — `http://` or `https://` for `gdsCloudConfigHTTP()`, `s3://` for `gdsCloudConfigS3()`, `gs://` for `gdsCloudConfigGCS()`, and `az://` for `gdsCloudConfigAzure()`. Registered prefixes are normalized by appending a trailing `/` if missing, so `"s3://bucket"` and `"s3://bucket/"` behave identically. # Cache Control Cloud access uses an LRU block cache (1 MB blocks) to minimize HTTP requests. The default cache size is 64 MB per stream. ```{r cache} # Set the default cache size for new streams (in MB) gdsCloudCacheSize(128) # Clear all cached data gdsCloudCacheClear() # Display cache statistics gdsCloudCacheInfo() # List all open cloud streams gdsCloudList() ``` Increasing cache size is beneficial when working with large files that require many random seeks (e.g., subsetting genotype matrices by sample and variant). # Integration with SeqArray Since gdscloud works transparently through `openfn.gds()`, packages built on gdsfmt, such as [SeqArray](https://bioconductor.org/packages/SeqArray), can open cloud-hosted files directly. **Note:** SeqArray >= v1.53.1 is recommended for full cloud support. This version allows `seqParallel()` to automatically load cloud-related packages on worker processes, so parallel operations on cloud-hosted GDS files work seamlessly. ```{r seqarray, eval=FALSE} library(SeqArray) library(gdscloud) # Open a SeqArray GDS file from S3 gds <- seqOpen("s3://gds-stat/download/1000g/2022/1kGP_high_coverage_Illumina.allchr.filtered.SNV_INDEL_SV_phased_panel.gds") gds ## File: s3://gds-stat/download/1000g/2022/1kGP_high_coverage_Illumina.allchr.filtered.SNV_INDEL_SV_phased_panel.gds (2.4G) ## + [ ] * ## |--+ description [ ] * ## |--+ sample.id { Str8 3202 LZMA_ra(6.61%), 1.7K } * ## |--+ variant.id { Int32 73554796 LZMA_ra(2.36%), 6.6M } * ## |--+ position { Int32 73554796 LZMA_ra(27.9%), 78.4M } * ## |--+ chromosome { Str8 73554796 LZMA_ra(0.01%), 25.3K } * ## |--+ allele { Str8 73554796 LZMA_ra(15.8%), 51.4M } * ## |--+ genotype [ ] * ## | |--+ data { Bit2 2x3202x73554796 LZMA_ra(1.77%), 1.9G } * ## | |--+ extra.index { Int32 3x0 LZMA_ra, 18B } * ## | \--+ extra { Int16 0 LZMA_ra, 18B } ## |--+ phase [ ] ## | |--+ data { Bit1 3202x73554796 LZMA_ra(0.01%), 4.1M } * ## | |--+ extra.index { Int32 3x0 LZMA_ra, 18B } * ## | \--+ extra { Bit1 0 LZMA_ra, 18B } ## |--+ annotation [ ] ## | |--+ id { Str8 73554796 LZMA_ra(17.2%), 186.3M } * ## | |--+ qual { Float32 73554796 LZMA_ra(0.01%), 42.0K } * ## | |--+ filter { Int32,factor 73554796 LZMA_ra(0.01%), 42.0K } * ## | |--+ info [ ] ## | | |--+ AF { Float32 73554796 LZMA_ra(23.6%), 66.3M } * ## | | |--+ AC { Int32 73554796 LZMA_ra(22.3%), 62.5M } * ## | | |--+ CM { Float32 73554796 LZMA_ra(6.08%), 17.1M } * ## | | |--+ AN { Int32 73554796 LZMA_ra(0.01%), 42.0K } * ## | | \--+ SVTYPE { Str8 73554796 LZMA_ra(0.33%), 240.0K } * ## | \--+ format [ ] ## \--+ sample.annotation [ ] seqSummary(gds) ## File: s3://gds-stat/download/1000g/2022/1kGP_high_coverage_Illumina.allchr.filtered.SNV_INDEL_SV_phased_panel.gds ## Format Version: v1.0 ## Reference: unknown ## Ploidy: 2 ## Number of samples: 3,202 ## Number of variants: 73,554,796 ## Chromosomes: ## chr1 : 5759060, chr2 : 6088598, chr3 : 4983185, chr4 : 4875465, chr5 : 4536819, chr6 : 4315217 ## chr7 : 4137254, chr8 : 3886222, chr9 : 3165513, chr10: 3495473, chr11: 3423341, chr12: 3332788 ## chr13: 2509179, chr14: 2290400, chr15: 2109285, chr16: 2362361, chr17: 2073624, chr18: 1963845 ## chr19: 1670692, chr20: 1644384, chr21: 1002753, chr22: 1066557, chrX : 2862781 ## ... seqClose(gds) ``` No code changes are needed in downstream packages; loading gdscloud is sufficient to enable cloud URL support. # Session Information ```{r pkg, echo=FALSE} library(gdscloud, quietly=TRUE) ``` ```{r session} sessionInfo() ```