GDS (Genomic Data Structure) is a high-performance file format for storing and accessing large-scale genomic data, implemented by the gdsfmt package. It supports hierarchical data organization with efficient random access and data compression.
The gdscloud package extends gdsfmt to provide transparent read-only access to GDS files stored on cloud storage services. It uses HTTP Range requests via libcurl with an efficient LRU block cache to minimize network overhead.
Supported backends:
http:// or
https:// URL (public or authenticated)s3://bucket/keygs://bucket/keyaz://container/blobgdscloud recognizes the following URL schemes:
Once gdscloud is loaded, openfn.gds() from the gdsfmt
package automatically recognizes cloud URLs and opens them
transparently:
library(gdscloud)
# Open a GDS file from S3 — transparent via openfn.gds()
gds <- openfn.gds("s3://gds-stat/download/hapmap/hapmap_r23a.gds")
gds
## File: s3://gds-stat/download/hapmap/hapmap_r23a.gds (86.5M)
## + [ ] *
## |--+ description [ ] *
## |--+ sample.id { Str8 270 LZMA_ra(17.3%), 381B } *
## |--+ variant.id { Int32 4098136 LZMA_ra(3.19%), 511.4K } *
## |--+ position { Int32 4098136 LZMA_ra(48.2%), 7.5M } *
## |--+ chromosome { Str8 4098136 LZMA_ra(0.02%), 1.7K } *
## |--+ allele { Str8 4098136 LZMA_ra(12.8%), 2.0M } *
## |--+ genotype [ ] *
## | |--+ data { Bit2 2x270x4098136 LZMA_ra(12.4%), 65.4M } *
## | |--+ extra.index { Int32 3x0 LZMA_ra, 18B } *
## | \--+ extra { Int16 0 LZMA_ra, 18B }
## |--+ phase [ ]
## | |--+ data { Bit1 270x4098136 LZMA_ra(0.01%), 19.8K } *
## | |--+ extra.index { Int32 3x0 LZMA_ra, 18B } *
## | \--+ extra { Bit1 0 LZMA_ra, 18B }
## |--+ annotation [ ]
## | |--+ id { Str8 4098136 LZMA_ra(27.7%), 11.1M } *
## | |--+ qual { Float32 4098136 LZMA_ra(0.02%), 2.5K } *
## | |--+ filter { Int32,factor 4098136 LZMA_ra(0.02%), 2.5K } *
## | |--+ info [ ]
## | \--+ format [ ]
## \--+ sample.annotation [ ]
## |--+ family { Str8 270 LZMA_ra(17.3%), 381B } *
## |--+ father { Str8 270 LZMA_ra(28.2%), 261B } *
## |--+ mother { Str8 270 LZMA_ra(28.2%), 261B } *
## |--+ sex { Str8 270 LZMA_ra(27.0%), 153B } *
## \--+ phenotype { Int32 270 LZMA_ra(8.70%), 101B } *
table(read.gdsn(index.gdsn(gds, "chromosome")))
##
## 1 10 11 12 13 14 15 16 17 18 19
## 318558 216535 209679 201179 161696 126523 109664 112428 91821 123089 58432
## 2 20 21 22 3 4 5 6 7 8 9
## 333056 122926 53454 58111 263547 252385 254297 278119 220384 222010 188661
## MT X XY Y
## 218 120679 362 323
summary(read.gdsn(index.gdsn(gds, "position")))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 410 32182751 68995421 78255683 114256766 247195690
closefn.gds(gds)Alternatively, use the explicit function:
Each cloud backend requires credentials. Credentials can be set either via environment variables (recommended for non-interactive use) or via R functions (convenient for interactive sessions). R-configured credentials take priority over environment variables.
For public URLs (e.g., files hosted on a web server with Range request support), no credentials are needed:
For authenticated HTTP endpoints, set a Bearer token:
Environment variable:
R configuration:
gdsCloudConfigHTTP(bearer_token = "your_bearer_token")
# URL-specific token for a private server
gdsCloudConfigHTTP(
bearer_token = "token_for_private_server",
url = "https://private.example.com/"
)Then open:
Environment variables:
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-east-1
# Optional for temporary credentials:
export AWS_SESSION_TOKEN=your_tokenR configuration:
gdsCloudConfigS3(
aws_access_key_id = "your_key",
aws_secret_access_key = "your_secret",
region = "us-east-1"
)Then open:
Environment variable:
R configuration:
Then open:
Environment variables:
export AZURE_STORAGE_ACCOUNT=your_account
export AZURE_STORAGE_KEY=your_key
# Or use a SAS token instead:
export AZURE_STORAGE_SAS_TOKEN=your_sas_tokenR configuration:
gdsCloudConfigAzure(
account_name = "mystorageaccount",
account_key = "base64encodedkey=="
)
# Or with SAS token:
gdsCloudConfigAzure(
account_name = "mystorageaccount",
sas_token = "sv=2021-06-08&ss=b&srt=co&sp=r..."
)Then open:
Sometimes different URLs need different credentials — for example,
two S3 buckets owned by different accounts, or a mix of public and
private resources. Each gdsCloudConfig*() function accepts
an optional url argument that associates the supplied
credentials with a URL prefix rather than the global defaults:
# Different keys for two S3 buckets
gdsCloudConfigS3(
aws_access_key_id = "KEY_A",
aws_secret_access_key = "SECRET_A",
url = "s3://bucket-a/"
)
gdsCloudConfigS3(
aws_access_key_id = "KEY_B",
aws_secret_access_key = "SECRET_B",
url = "s3://bucket-b/"
)
# You can also scope credentials to a sub-prefix within a bucket
gdsCloudConfigS3(
aws_access_key_id = "KEY_SHARED",
aws_secret_access_key = "SECRET_SHARED",
url = "s3://bucket-a/shared/"
)
# Remove a previously registered URL-specific entry
gdsCloudConfigS3(url = "s3://bucket-a/")When opening a URL, credentials are resolved in the following order (the first non-empty value wins for each field):
gdsCloudConfig*() with
url=NULL;The URL scheme passed via url= must match the function —
http:// or https:// for
gdsCloudConfigHTTP(), s3:// for
gdsCloudConfigS3(), gs:// for
gdsCloudConfigGCS(), and az:// for
gdsCloudConfigAzure(). Registered prefixes are normalized
by appending a trailing / if missing, so
"s3://bucket" and "s3://bucket/" behave
identically.
Cloud access uses an LRU block cache (1 MB blocks) to minimize HTTP requests. The default cache size is 64 MB per stream.
# Set the default cache size for new streams (in MB)
gdsCloudCacheSize(128)
# Clear all cached data
gdsCloudCacheClear()
# Display cache statistics
gdsCloudCacheInfo()
## gdscloud cache settings:
## Default cache size: 128 MB
## Block size: 1 MB
## Open cloud streams: 0
## Global cache hits: 0
## Global cache misses: 0
# List all open cloud streams
gdsCloudList()
## [1] url file_size cache_blocks cache_hits cache_misses
## <0 rows> (or 0-length row.names)Increasing cache size is beneficial when working with large files that require many random seeks (e.g., subsetting genotype matrices by sample and variant).
Since gdscloud works transparently through openfn.gds(),
packages built on gdsfmt, such as SeqArray, can open
cloud-hosted files directly.
Note: SeqArray >= v1.53.1 is recommended for full
cloud support. This version allows seqParallel() to
automatically load cloud-related packages on worker processes, so
parallel operations on cloud-hosted GDS files work seamlessly.
library(SeqArray)
library(gdscloud)
# Open a SeqArray GDS file from S3
gds <- seqOpen("s3://gds-stat/download/1000g/2022/1kGP_high_coverage_Illumina.allchr.filtered.SNV_INDEL_SV_phased_panel.gds")
gds
## File: s3://gds-stat/download/1000g/2022/1kGP_high_coverage_Illumina.allchr.filtered.SNV_INDEL_SV_phased_panel.gds (2.4G)
## + [ ] *
## |--+ description [ ] *
## |--+ sample.id { Str8 3202 LZMA_ra(6.61%), 1.7K } *
## |--+ variant.id { Int32 73554796 LZMA_ra(2.36%), 6.6M } *
## |--+ position { Int32 73554796 LZMA_ra(27.9%), 78.4M } *
## |--+ chromosome { Str8 73554796 LZMA_ra(0.01%), 25.3K } *
## |--+ allele { Str8 73554796 LZMA_ra(15.8%), 51.4M } *
## |--+ genotype [ ] *
## | |--+ data { Bit2 2x3202x73554796 LZMA_ra(1.77%), 1.9G } *
## | |--+ extra.index { Int32 3x0 LZMA_ra, 18B } *
## | \--+ extra { Int16 0 LZMA_ra, 18B }
## |--+ phase [ ]
## | |--+ data { Bit1 3202x73554796 LZMA_ra(0.01%), 4.1M } *
## | |--+ extra.index { Int32 3x0 LZMA_ra, 18B } *
## | \--+ extra { Bit1 0 LZMA_ra, 18B }
## |--+ annotation [ ]
## | |--+ id { Str8 73554796 LZMA_ra(17.2%), 186.3M } *
## | |--+ qual { Float32 73554796 LZMA_ra(0.01%), 42.0K } *
## | |--+ filter { Int32,factor 73554796 LZMA_ra(0.01%), 42.0K } *
## | |--+ info [ ]
## | | |--+ AF { Float32 73554796 LZMA_ra(23.6%), 66.3M } *
## | | |--+ AC { Int32 73554796 LZMA_ra(22.3%), 62.5M } *
## | | |--+ CM { Float32 73554796 LZMA_ra(6.08%), 17.1M } *
## | | |--+ AN { Int32 73554796 LZMA_ra(0.01%), 42.0K } *
## | | \--+ SVTYPE { Str8 73554796 LZMA_ra(0.33%), 240.0K } *
## | \--+ format [ ]
## \--+ sample.annotation [ ]
seqSummary(gds)
## File: s3://gds-stat/download/1000g/2022/1kGP_high_coverage_Illumina.allchr.filtered.SNV_INDEL_SV_phased_panel.gds
## Format Version: v1.0
## Reference: unknown
## Ploidy: 2
## Number of samples: 3,202
## Number of variants: 73,554,796
## Chromosomes:
## chr1 : 5759060, chr2 : 6088598, chr3 : 4983185, chr4 : 4875465, chr5 : 4536819, chr6 : 4315217
## chr7 : 4137254, chr8 : 3886222, chr9 : 3165513, chr10: 3495473, chr11: 3423341, chr12: 3332788
## chr13: 2509179, chr14: 2290400, chr15: 2109285, chr16: 2362361, chr17: 2073624, chr18: 1963845
## chr19: 1670692, chr20: 1644384, chr21: 1002753, chr22: 1066557, chrX : 2862781
## ...
seqClose(gds)No code changes are needed in downstream packages; loading gdscloud is sufficient to enable cloud URL support.
sessionInfo()
## R version 4.6.0 (2026-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gdscloud_0.99.2 gdsfmt_1.49.3 BiocStyle_2.41.0
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.39 R6_2.6.1 fastmap_1.2.0
## [4] xfun_0.58 maketools_1.3.2 cachem_1.1.0
## [7] knitr_1.51 htmltools_0.5.9 rmarkdown_2.31
## [10] buildtools_1.0.0 lifecycle_1.0.5 cli_3.6.6
## [13] sass_0.4.10 jquerylib_0.1.4 compiler_4.6.0
## [16] sys_3.4.3 tools_4.6.0 bslib_0.11.0
## [19] evaluate_1.0.5 yaml_2.3.12 otel_0.2.0
## [22] BiocManager_1.30.27 crayon_1.5.3 jsonlite_2.0.0
## [25] rlang_1.2.0