Cloud Storage Access for GDS Files

Introduction

GDS (Genomic Data Structure) is a high-performance file format for storing and accessing large-scale genomic data, implemented by the gdsfmt package. It supports hierarchical data organization with efficient random access and data compression.

The gdscloud package extends gdsfmt to provide transparent read-only access to GDS files stored on cloud storage services. It uses HTTP Range requests via libcurl with an efficient LRU block cache to minimize network overhead.

Supported backends:

  • HTTP/HTTPS — any http:// or https:// URL (public or authenticated)
  • Amazon S3 — URLs of the form s3://bucket/key
  • Google Cloud Storage — URLs of the form gs://bucket/key
  • Azure Blob Storage — URLs of the form az://container/blob

Installation

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("gdscloud")

Supported URL Schemes

gdscloud recognizes the following URL schemes:

library(gdscloud)
## Loading required package: gdsfmt
gdsCloudSchemes()
##                   http                  https                     s3 
##                 "HTTP"                "HTTPS"            "Amazon S3" 
##                     gs                     az 
## "Google Cloud Storage"   "Azure Blob Storage"

Quick Start

Once gdscloud is loaded, openfn.gds() from the gdsfmt package automatically recognizes cloud URLs and opens them transparently:

library(gdscloud)

# Open a GDS file from S3 — transparent via openfn.gds()
gds <- openfn.gds("s3://gds-stat/download/hapmap/hapmap_r23a.gds")
gds
## File: s3://gds-stat/download/hapmap/hapmap_r23a.gds (86.5M)
## +    [  ] *
## |--+ description   [  ] *
## |--+ sample.id   { Str8 270 LZMA_ra(17.3%), 381B } *
## |--+ variant.id   { Int32 4098136 LZMA_ra(3.19%), 511.4K } *
## |--+ position   { Int32 4098136 LZMA_ra(48.2%), 7.5M } *
## |--+ chromosome   { Str8 4098136 LZMA_ra(0.02%), 1.7K } *
## |--+ allele   { Str8 4098136 LZMA_ra(12.8%), 2.0M } *
## |--+ genotype   [  ] *
## |  |--+ data   { Bit2 2x270x4098136 LZMA_ra(12.4%), 65.4M } *
## |  |--+ extra.index   { Int32 3x0 LZMA_ra, 18B } *
## |  \--+ extra   { Int16 0 LZMA_ra, 18B }
## |--+ phase   [  ]
## |  |--+ data   { Bit1 270x4098136 LZMA_ra(0.01%), 19.8K } *
## |  |--+ extra.index   { Int32 3x0 LZMA_ra, 18B } *
## |  \--+ extra   { Bit1 0 LZMA_ra, 18B }
## |--+ annotation   [  ]
## |  |--+ id   { Str8 4098136 LZMA_ra(27.7%), 11.1M } *
## |  |--+ qual   { Float32 4098136 LZMA_ra(0.02%), 2.5K } *
## |  |--+ filter   { Int32,factor 4098136 LZMA_ra(0.02%), 2.5K } *
## |  |--+ info   [  ]
## |  \--+ format   [  ]
## \--+ sample.annotation   [  ]
##    |--+ family   { Str8 270 LZMA_ra(17.3%), 381B } *
##    |--+ father   { Str8 270 LZMA_ra(28.2%), 261B } *
##    |--+ mother   { Str8 270 LZMA_ra(28.2%), 261B } *
##    |--+ sex   { Str8 270 LZMA_ra(27.0%), 153B } *
##    \--+ phenotype   { Int32 270 LZMA_ra(8.70%), 101B } *

table(read.gdsn(index.gdsn(gds, "chromosome")))
## 
##      1     10     11     12     13     14     15     16     17     18     19 
## 318558 216535 209679 201179 161696 126523 109664 112428  91821 123089  58432 
##      2     20     21     22      3      4      5      6      7      8      9 
## 333056 122926  53454  58111 263547 252385 254297 278119 220384 222010 188661 
##     MT      X     XY      Y 
##    218 120679    362    323

summary(read.gdsn(index.gdsn(gds, "position")))
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##       410  32182751  68995421  78255683 114256766 247195690

closefn.gds(gds)

Alternatively, use the explicit function:

gds <- gdsCloudOpen("s3://my-bucket/data/example.gds")
closefn.gds(gds)

Authentication

Each cloud backend requires credentials. Credentials can be set either via environment variables (recommended for non-interactive use) or via R functions (convenient for interactive sessions). R-configured credentials take priority over environment variables.

HTTP/HTTPS

For public URLs (e.g., files hosted on a web server with Range request support), no credentials are needed:

gds <- gdsCloudOpen("https://example.com/path/to/file.gds")
closefn.gds(gds)

For authenticated HTTP endpoints, set a Bearer token:

Environment variable:

export GDSCLOUD_HTTP_TOKEN=your_bearer_token

R configuration:

gdsCloudConfigHTTP(bearer_token = "your_bearer_token")

# URL-specific token for a private server
gdsCloudConfigHTTP(
    bearer_token = "token_for_private_server",
    url = "https://private.example.com/"
)

Then open:

gds <- gdsCloudOpen("https://private.example.com/data/file.gds")
closefn.gds(gds)

Amazon S3

Environment variables:

export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-east-1
# Optional for temporary credentials:
export AWS_SESSION_TOKEN=your_token

R configuration:

gdsCloudConfigS3(
    aws_access_key_id = "your_key",
    aws_secret_access_key = "your_secret",
    region = "us-east-1"
)

Then open:

gds <- gdsCloudOpen("s3://my-bucket/path/to/file.gds")
# ... work with the file ...
closefn.gds(gds)

Google Cloud Storage

Environment variable:

export GCS_ACCESS_TOKEN=your_oauth2_token

R configuration:

gdsCloudConfigGCS(access_token = "your_oauth2_token")

Then open:

gds <- gdsCloudOpen("gs://my-bucket/path/to/file.gds")
closefn.gds(gds)

Azure Blob Storage

Environment variables:

export AZURE_STORAGE_ACCOUNT=your_account
export AZURE_STORAGE_KEY=your_key
# Or use a SAS token instead:
export AZURE_STORAGE_SAS_TOKEN=your_sas_token

R configuration:

gdsCloudConfigAzure(
    account_name = "mystorageaccount",
    account_key = "base64encodedkey=="
)
# Or with SAS token:
gdsCloudConfigAzure(
    account_name = "mystorageaccount",
    sas_token = "sv=2021-06-08&ss=b&srt=co&sp=r..."
)

Then open:

gds <- gdsCloudOpen("az://my-container/path/to/file.gds")
closefn.gds(gds)

URL-specific credentials

Sometimes different URLs need different credentials — for example, two S3 buckets owned by different accounts, or a mix of public and private resources. Each gdsCloudConfig*() function accepts an optional url argument that associates the supplied credentials with a URL prefix rather than the global defaults:

# Different keys for two S3 buckets
gdsCloudConfigS3(
    aws_access_key_id = "KEY_A",
    aws_secret_access_key = "SECRET_A",
    url = "s3://bucket-a/"
)
gdsCloudConfigS3(
    aws_access_key_id = "KEY_B",
    aws_secret_access_key = "SECRET_B",
    url = "s3://bucket-b/"
)

# You can also scope credentials to a sub-prefix within a bucket
gdsCloudConfigS3(
    aws_access_key_id = "KEY_SHARED",
    aws_secret_access_key = "SECRET_SHARED",
    url = "s3://bucket-a/shared/"
)

# Remove a previously registered URL-specific entry
gdsCloudConfigS3(url = "s3://bucket-a/")

When opening a URL, credentials are resolved in the following order (the first non-empty value wins for each field):

  1. the URL-specific entry whose registered prefix is the longest prefix of the opened URL (within the same scheme);
  2. the global values set by gdsCloudConfig*() with url=NULL;
  3. the corresponding environment variable.

The URL scheme passed via url= must match the function — http:// or https:// for gdsCloudConfigHTTP(), s3:// for gdsCloudConfigS3(), gs:// for gdsCloudConfigGCS(), and az:// for gdsCloudConfigAzure(). Registered prefixes are normalized by appending a trailing / if missing, so "s3://bucket" and "s3://bucket/" behave identically.

Cache Control

Cloud access uses an LRU block cache (1 MB blocks) to minimize HTTP requests. The default cache size is 64 MB per stream.

# Set the default cache size for new streams (in MB)
gdsCloudCacheSize(128)

# Clear all cached data
gdsCloudCacheClear()

# Display cache statistics
gdsCloudCacheInfo()
## gdscloud cache settings:
##   Default cache size: 128 MB
##   Block size: 1 MB
##   Open cloud streams: 0 
##   Global cache hits: 0 
##   Global cache misses: 0

# List all open cloud streams
gdsCloudList()
## [1] url          file_size    cache_blocks cache_hits   cache_misses
## <0 rows> (or 0-length row.names)

Increasing cache size is beneficial when working with large files that require many random seeks (e.g., subsetting genotype matrices by sample and variant).

Integration with SeqArray

Since gdscloud works transparently through openfn.gds(), packages built on gdsfmt, such as SeqArray, can open cloud-hosted files directly.

Note: SeqArray >= v1.53.1 is recommended for full cloud support. This version allows seqParallel() to automatically load cloud-related packages on worker processes, so parallel operations on cloud-hosted GDS files work seamlessly.

library(SeqArray)
library(gdscloud)

# Open a SeqArray GDS file from S3
gds <- seqOpen("s3://gds-stat/download/1000g/2022/1kGP_high_coverage_Illumina.allchr.filtered.SNV_INDEL_SV_phased_panel.gds")
gds
## File: s3://gds-stat/download/1000g/2022/1kGP_high_coverage_Illumina.allchr.filtered.SNV_INDEL_SV_phased_panel.gds (2.4G)
## +    [  ] *
## |--+ description   [  ] *
## |--+ sample.id   { Str8 3202 LZMA_ra(6.61%), 1.7K } *
## |--+ variant.id   { Int32 73554796 LZMA_ra(2.36%), 6.6M } *
## |--+ position   { Int32 73554796 LZMA_ra(27.9%), 78.4M } *
## |--+ chromosome   { Str8 73554796 LZMA_ra(0.01%), 25.3K } *
## |--+ allele   { Str8 73554796 LZMA_ra(15.8%), 51.4M } *
## |--+ genotype   [  ] *
## |  |--+ data   { Bit2 2x3202x73554796 LZMA_ra(1.77%), 1.9G } *
## |  |--+ extra.index   { Int32 3x0 LZMA_ra, 18B } *
## |  \--+ extra   { Int16 0 LZMA_ra, 18B }
## |--+ phase   [  ]
## |  |--+ data   { Bit1 3202x73554796 LZMA_ra(0.01%), 4.1M } *
## |  |--+ extra.index   { Int32 3x0 LZMA_ra, 18B } *
## |  \--+ extra   { Bit1 0 LZMA_ra, 18B }
## |--+ annotation   [  ]
## |  |--+ id   { Str8 73554796 LZMA_ra(17.2%), 186.3M } *
## |  |--+ qual   { Float32 73554796 LZMA_ra(0.01%), 42.0K } *
## |  |--+ filter   { Int32,factor 73554796 LZMA_ra(0.01%), 42.0K } *
## |  |--+ info   [  ]
## |  |  |--+ AF   { Float32 73554796 LZMA_ra(23.6%), 66.3M } *
## |  |  |--+ AC   { Int32 73554796 LZMA_ra(22.3%), 62.5M } *
## |  |  |--+ CM   { Float32 73554796 LZMA_ra(6.08%), 17.1M } *
## |  |  |--+ AN   { Int32 73554796 LZMA_ra(0.01%), 42.0K } *
## |  |  \--+ SVTYPE   { Str8 73554796 LZMA_ra(0.33%), 240.0K } *
## |  \--+ format   [  ]
## \--+ sample.annotation   [  ]

seqSummary(gds)
## File: s3://gds-stat/download/1000g/2022/1kGP_high_coverage_Illumina.allchr.filtered.SNV_INDEL_SV_phased_panel.gds
## Format Version: v1.0
## Reference: unknown
## Ploidy: 2
## Number of samples: 3,202
## Number of variants: 73,554,796
## Chromosomes:
##     chr1 : 5759060, chr2 : 6088598, chr3 : 4983185, chr4 : 4875465, chr5 : 4536819, chr6 : 4315217
##     chr7 : 4137254, chr8 : 3886222, chr9 : 3165513, chr10: 3495473, chr11: 3423341, chr12: 3332788
##     chr13: 2509179, chr14: 2290400, chr15: 2109285, chr16: 2362361, chr17: 2073624, chr18: 1963845
##     chr19: 1670692, chr20: 1644384, chr21: 1002753, chr22: 1066557, chrX : 2862781
## ...

seqClose(gds)

No code changes are needed in downstream packages; loading gdscloud is sufficient to enable cloud URL support.

Session Information

sessionInfo()
## R version 4.6.0 (2026-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] gdscloud_0.99.2  gdsfmt_1.49.3    BiocStyle_2.41.0
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.39       R6_2.6.1            fastmap_1.2.0      
##  [4] xfun_0.58           maketools_1.3.2     cachem_1.1.0       
##  [7] knitr_1.51          htmltools_0.5.9     rmarkdown_2.31     
## [10] buildtools_1.0.0    lifecycle_1.0.5     cli_3.6.6          
## [13] sass_0.4.10         jquerylib_0.1.4     compiler_4.6.0     
## [16] sys_3.4.3           tools_4.6.0         bslib_0.11.0       
## [19] evaluate_1.0.5      yaml_2.3.12         otel_0.2.0         
## [22] BiocManager_1.30.27 crayon_1.5.3        jsonlite_2.0.0     
## [25] rlang_1.2.0