PLINK is one of
the most widely used toolsets in statistical genetics, providing fast
and memory-efficient methods for quality control, association testing,
and population-stratification analysis of large-scale genotype data. Its
binary file format — comprising .pgen (genotypes),
.pvar (variant metadata), and .psam (sample
metadata) files — offers compact storage and rapid access for datasets
with millions of variants and hundreds of thousands of samples.
Genomic Data Structure (GDS) is a hierarchical, array-oriented container format built on the CoreArray C++ library. The SeqArray package extends GDS to store sequence and genotyping data following a schema that mirrors VCF fields (genotype, annotation/info, annotation/format, etc.), while providing efficient random access, built-in compression, and tight integration with Bioconductor workflows for downstream analyses such as GWAS, PCA, and relatedness estimation.
The pgen2gds package bridges these two ecosystems by converting PLINK2 PGEN files into SeqArray GDS files, enabling users to leverage the rich Bioconductor infrastructure for analysis while starting from data produced by PLINK2.
The package ships with small example PLINK2 files:
pgen_fn <- system.file("extdata", "plink2_gen.pgen", package = "pgen2gds")
pvar_fn <- system.file("extdata", "plink2_gen.pvar", package = "pgen2gds")
psam_fn <- system.file("extdata", "plink2_gen.psam", package = "pgen2gds")
pgen_fn
#> [1] "/tmp/RtmpfNXw4X/Rinstb1d2d33dd5/pgen2gds/extdata/plink2_gen.pgen"seqReadPVAR() reads a .pvar file and
returns a data frame of variant metadata:
pvar <- seqReadPVAR(pvar_fn)
head(pvar)
#> chrom pos allele rsid
#> 1 21 9411239 G,A rs559462325
#> 2 21 9411245 C,A rs181691356
#> 3 21 9411264 A,C rs548263598
#> 4 21 9411267 G,T rs561987868
#> 5 21 9411302 G,T rs531010746
#> 6 21 9411313 G,A rs550852792
dim(pvar)
#> [1] 482 4You can also subset variants using a logical or numeric index:
The main function, seqPGEN2GDS(), converts a set of
PLINK2 files into a SeqArray GDS file. When only the .pgen
path is provided, the .pvar and .psam paths
are derived automatically:
gds_fn <- tempfile(fileext = ".gds")
seqPGEN2GDS(pgen_fn, out.gdsfn = gds_fn)
#> ##< 2026-06-06 10:06:45
#> PLINK2 PGEN to SeqArray GDS:
#> pgen file (14.3K):
#> /tmp/RtmpfNXw4X/Rinstb1d2d33dd5/pgen2gds/extdata/plink2_gen.pgen
#> pvar file (12.7K):
#> /tmp/RtmpfNXw4X/Rinstb1d2d33dd5/pgen2gds/extdata/plink2_gen.pvar
#> genome reference: <unset>
#> psam file (34.3K):
#> /tmp/RtmpfNXw4X/Rinstb1d2d33dd5/pgen2gds/extdata/plink2_gen.psam
#> reading ...
#> # of samples: 2504
#> # of variants: 482
#> Output:
#> /tmp/RtmpxyW6QW/filebf85a4af509.gds
#> sample.id [md5: 6bfa4747f3a74ee6ea95fb4f9bb190f0]
#> variant.id [md5: f5bbec87f78a68699c0c77bc7fd3a7c9]
#> chromosome [md5: ba1feab84dd97a835df1860d25b36f96]
#> position [md5: 0706cfca437412eddd9289a72e3ab380]
#> allele [md5: 229495f4d0a625c25e3aa65380d434bc]
#> annotation/id [md5: 8dedb3bb6e563284abe0a00f8a5d78ca]
#> genotype [2026-06-06 10:06:45] ...
#> [md5: 8b7f3cc73c6aa631eb33e7b8b5c8f09c]
#> annotation/qual [md5: f6dc4919603f79a64259ffa632bc7c4e]
#> annotation/filter [md5: 1981ad309f31ef5366e5e62bebc02bf5]
#> Done. # 2026-06-06 10:06:45
#> Optimize the access efficiency ...
#> Clean up the fragments of GDS file:
#> open the file '/tmp/RtmpxyW6QW/filebf85a4af509.gds' (24.4K)
#> # of fragments: 88
#> save to '/tmp/RtmpxyW6QW/filebf85a4af509.gds.tmp'
#> rename '/tmp/RtmpxyW6QW/filebf85a4af509.gds.tmp' (23.8K, reduced: 564B)
#> # of fragments: 41
#> ##> 2026-06-06 10:06:45Open the resulting file with SeqArray and inspect its contents:
gds <- seqOpen(gds_fn)
gds
#> Object of class "SeqVarGDSClass"
#> File: /tmp/RtmpxyW6QW/filebf85a4af509.gds (23.8K)
#> + [ ] *
#> |--+ description [ ] *
#> |--+ sample.id { Str8 2504 LZMA_ra(7.66%), 1.5K } *
#> |--+ variant.id { Int32 482 LZMA_ra(23.3%), 457B } *
#> |--+ chromosome { Str8 482 LZMA_ra(6.78%), 105B } *
#> |--+ position { Int32 482 LZMA_ra(32.1%), 625B } *
#> |--+ allele { Str8 482 LZMA_ra(27.8%), 549B } *
#> |--+ genotype [ ] *
#> | |--+ data { Bit2 2x2504x482 LZMA_ra(2.23%), 13.1K } *
#> | |--+ extra.index { Int32 3x0 LZMA_ra, 18B } *
#> | \--+ extra { Int16 0 LZMA_ra, 18B }
#> |--+ phase [ ]
#> | |--+ data { Bit1 2504x482 LZMA_ra(0.11%), 177B } *
#> | |--+ extra.index { Int32 3x0 LZMA_ra, 18B } *
#> | \--+ extra { Bit1 0 LZMA_ra, 18B }
#> \--+ annotation [ ]
#> |--+ id { Str8 482 LZMA_ra(38.5%), 2.2K } *
#> |--+ qual { Float32 482 LZMA_ra(5.08%), 105B } *
#> |--+ filter { Int32,factor 482 LZMA_ra(5.08%), 105B } *
#> |--+ info [ ]
#> \--+ format [ ]# Read genotypes for the first 5 variants
seqSetFilter(gds, variant.sel = 1:5)
#> # of selected variants: 5
geno <- seqGetData(gds, "genotype")
dim(geno) # ploidy x samples x variants
#> [1] 2 2504 5
geno[, 1:6, ]
#> , , 1
#>
#> sample
#> allele [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 0 0 0 0 0 0
#> [2,] 0 0 0 0 0 0
#>
#> , , 2
#>
#> sample
#> allele [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 0 0 0 0 0 0
#> [2,] 0 0 0 0 0 0
#>
#> , , 3
#>
#> sample
#> allele [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 0 0 0 0 0 0
#> [2,] 0 0 0 0 0 0
#>
#> , , 4
#>
#> sample
#> allele [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 0 0 0 0 0 0
#> [2,] 0 0 0 0 0 0
#>
#> , , 5
#>
#> sample
#> allele [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 0 0 0 0 0 0
#> [2,] 0 0 0 0 0 0Use variant.sel to convert only specific variants:
For very large files you can import a contiguous range of variants:
gds_range <- tempfile(fileext = ".gds")
seqPGEN2GDS(pgen_fn, out.gdsfn = gds_range, start=100, count=50, verbose=FALSE)
f <- seqOpen(gds_range)
vid <- seqGetData(f, "variant.id")
cat("Variant IDs:", head(vid), "...", tail(vid), "\n")
#> Variant IDs: 100 101 102 103 104 105 ... 144 145 146 147 148 149
cat("Total:", length(vid), "variants\n")
#> Total: 50 variants
seqClose(f)
unlink(gds_range, force = TRUE)sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] pgen2gds_0.99.2 SeqArray_1.53.1 gdsfmt_1.49.2 BiocStyle_2.41.0
#>
#> loaded via a namespace (and not attached):
#> [1] crayon_1.5.3 cli_3.6.6 knitr_1.51
#> [4] rlang_1.2.0 xfun_0.58 otel_0.2.0
#> [7] generics_0.1.4 jsonlite_2.0.0 S4Vectors_0.51.3
#> [10] buildtools_1.0.0 Biostrings_2.81.3 htmltools_0.5.9
#> [13] maketools_1.3.2 sys_3.4.3 sass_0.4.10
#> [16] stats4_4.6.0 rmarkdown_2.31 Seqinfo_1.3.0
#> [19] evaluate_1.0.5 jquerylib_0.1.4 fastmap_1.2.0
#> [22] IRanges_2.47.2 yaml_2.3.12 lifecycle_1.0.5
#> [25] BiocManager_1.30.27 compiler_4.6.0 Rcpp_1.1.1-1.1
#> [28] pgenlibr_0.6.2 XVector_0.53.0 digest_0.6.39
#> [31] R6_2.6.1 parallel_4.6.0 GenomicRanges_1.65.0
#> [34] bslib_0.11.0 tools_4.6.0 BiocGenerics_0.59.6
#> [37] cachem_1.1.0