Format Conversion from PLINK2 PGEN to SeqArray GDS

Introduction

PLINK is one of the most widely used toolsets in statistical genetics, providing fast and memory-efficient methods for quality control, association testing, and population-stratification analysis of large-scale genotype data. Its binary file format — comprising .pgen (genotypes), .pvar (variant metadata), and .psam (sample metadata) files — offers compact storage and rapid access for datasets with millions of variants and hundreds of thousands of samples.

Genomic Data Structure (GDS) is a hierarchical, array-oriented container format built on the CoreArray C++ library. The SeqArray package extends GDS to store sequence and genotyping data following a schema that mirrors VCF fields (genotype, annotation/info, annotation/format, etc.), while providing efficient random access, built-in compression, and tight integration with Bioconductor workflows for downstream analyses such as GWAS, PCA, and relatedness estimation.

The pgen2gds package bridges these two ecosystems by converting PLINK2 PGEN files into SeqArray GDS files, enabling users to leverage the rich Bioconductor infrastructure for analysis while starting from data produced by PLINK2.

Installation

pgen2gds requires gdsfmt, SeqArray, and pgenlibr.

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("pgen2gds")

Quick Start

library(SeqArray)
#> Loading required package: gdsfmt
library(pgen2gds)

Locate example files

The package ships with small example PLINK2 files:

pgen_fn <- system.file("extdata", "plink2_gen.pgen", package = "pgen2gds")
pvar_fn <- system.file("extdata", "plink2_gen.pvar", package = "pgen2gds")
psam_fn <- system.file("extdata", "plink2_gen.psam", package = "pgen2gds")

pgen_fn
#> [1] "/tmp/RtmpfNXw4X/Rinstb1d2d33dd5/pgen2gds/extdata/plink2_gen.pgen"

Read variant information

seqReadPVAR() reads a .pvar file and returns a data frame of variant metadata:

pvar <- seqReadPVAR(pvar_fn)
head(pvar)
#>   chrom     pos allele        rsid
#> 1    21 9411239    G,A rs559462325
#> 2    21 9411245    C,A rs181691356
#> 3    21 9411264    A,C rs548263598
#> 4    21 9411267    G,T rs561987868
#> 5    21 9411302    G,T rs531010746
#> 6    21 9411313    G,A rs550852792
dim(pvar)
#> [1] 482   4

You can also subset variants using a logical or numeric index:

# Select the first 5 variants
head(seqReadPVAR(pvar_fn, sel = 1:5))
#>   chrom     pos allele        rsid
#> 1    21 9411239    G,A rs559462325
#> 2    21 9411245    C,A rs181691356
#> 3    21 9411264    A,C rs548263598
#> 4    21 9411267    G,T rs561987868
#> 5    21 9411302    G,T rs531010746

Convert PGEN to GDS

The main function, seqPGEN2GDS(), converts a set of PLINK2 files into a SeqArray GDS file. When only the .pgen path is provided, the .pvar and .psam paths are derived automatically:

gds_fn <- tempfile(fileext = ".gds")

seqPGEN2GDS(pgen_fn, out.gdsfn = gds_fn)
#> ##< 2026-06-06 10:06:45
#> PLINK2 PGEN to SeqArray GDS:
#>     pgen file (14.3K):
#>         /tmp/RtmpfNXw4X/Rinstb1d2d33dd5/pgen2gds/extdata/plink2_gen.pgen
#>     pvar file (12.7K):
#>         /tmp/RtmpfNXw4X/Rinstb1d2d33dd5/pgen2gds/extdata/plink2_gen.pvar
#>     genome reference: <unset>
#>     psam file (34.3K):
#>         /tmp/RtmpfNXw4X/Rinstb1d2d33dd5/pgen2gds/extdata/plink2_gen.psam
#>         reading ...
#>     # of samples: 2504
#>     # of variants: 482
#>     Output:
#>         /tmp/RtmpxyW6QW/filebf85a4af509.gds
#>     sample.id  [md5: 6bfa4747f3a74ee6ea95fb4f9bb190f0]
#>     variant.id  [md5: f5bbec87f78a68699c0c77bc7fd3a7c9]
#>     chromosome  [md5: ba1feab84dd97a835df1860d25b36f96]
#>     position  [md5: 0706cfca437412eddd9289a72e3ab380]
#>     allele  [md5: 229495f4d0a625c25e3aa65380d434bc]
#>     annotation/id  [md5: 8dedb3bb6e563284abe0a00f8a5d78ca]
#>     genotype [2026-06-06 10:06:45] ...
#>         [md5: 8b7f3cc73c6aa631eb33e7b8b5c8f09c]
#>     annotation/qual  [md5: f6dc4919603f79a64259ffa632bc7c4e]
#>     annotation/filter  [md5: 1981ad309f31ef5366e5e62bebc02bf5]
#> Done.  # 2026-06-06 10:06:45
#> Optimize the access efficiency ...
#> Clean up the fragments of GDS file:
#>     open the file '/tmp/RtmpxyW6QW/filebf85a4af509.gds' (24.4K)
#>     # of fragments: 88
#>     save to '/tmp/RtmpxyW6QW/filebf85a4af509.gds.tmp'
#>     rename '/tmp/RtmpxyW6QW/filebf85a4af509.gds.tmp' (23.8K, reduced: 564B)
#>     # of fragments: 41
#> ##> 2026-06-06 10:06:45

Explore the GDS file

Open the resulting file with SeqArray and inspect its contents:

gds <- seqOpen(gds_fn)
gds
#> Object of class "SeqVarGDSClass"
#> File: /tmp/RtmpxyW6QW/filebf85a4af509.gds (23.8K)
#> +    [  ] *
#> |--+ description   [  ] *
#> |--+ sample.id   { Str8 2504 LZMA_ra(7.66%), 1.5K } *
#> |--+ variant.id   { Int32 482 LZMA_ra(23.3%), 457B } *
#> |--+ chromosome   { Str8 482 LZMA_ra(6.78%), 105B } *
#> |--+ position   { Int32 482 LZMA_ra(32.1%), 625B } *
#> |--+ allele   { Str8 482 LZMA_ra(27.8%), 549B } *
#> |--+ genotype   [  ] *
#> |  |--+ data   { Bit2 2x2504x482 LZMA_ra(2.23%), 13.1K } *
#> |  |--+ extra.index   { Int32 3x0 LZMA_ra, 18B } *
#> |  \--+ extra   { Int16 0 LZMA_ra, 18B }
#> |--+ phase   [  ]
#> |  |--+ data   { Bit1 2504x482 LZMA_ra(0.11%), 177B } *
#> |  |--+ extra.index   { Int32 3x0 LZMA_ra, 18B } *
#> |  \--+ extra   { Bit1 0 LZMA_ra, 18B }
#> \--+ annotation   [  ]
#>    |--+ id   { Str8 482 LZMA_ra(38.5%), 2.2K } *
#>    |--+ qual   { Float32 482 LZMA_ra(5.08%), 105B } *
#>    |--+ filter   { Int32,factor 482 LZMA_ra(5.08%), 105B } *
#>    |--+ info   [  ]
#>    \--+ format   [  ]
# Sample and variant counts
cat("Samples: ", length(seqGetData(gds, "sample.id")), "\n")
#> Samples:  2504
cat("Variants:", length(seqGetData(gds, "variant.id")), "\n")
#> Variants: 482
# Chromosome distribution
table(seqGetData(gds, "chromosome"))
#> 
#>  21 
#> 482

Access genotype data

# Read genotypes for the first 5 variants
seqSetFilter(gds, variant.sel = 1:5)
#> # of selected variants: 5
geno <- seqGetData(gds, "genotype")
dim(geno)  # ploidy x samples x variants
#> [1]    2 2504    5
geno[, 1:6, ]
#> , , 1
#> 
#>       sample
#> allele [,1] [,2] [,3] [,4] [,5] [,6]
#>   [1,]    0    0    0    0    0    0
#>   [2,]    0    0    0    0    0    0
#> 
#> , , 2
#> 
#>       sample
#> allele [,1] [,2] [,3] [,4] [,5] [,6]
#>   [1,]    0    0    0    0    0    0
#>   [2,]    0    0    0    0    0    0
#> 
#> , , 3
#> 
#>       sample
#> allele [,1] [,2] [,3] [,4] [,5] [,6]
#>   [1,]    0    0    0    0    0    0
#>   [2,]    0    0    0    0    0    0
#> 
#> , , 4
#> 
#>       sample
#> allele [,1] [,2] [,3] [,4] [,5] [,6]
#>   [1,]    0    0    0    0    0    0
#>   [2,]    0    0    0    0    0    0
#> 
#> , , 5
#> 
#>       sample
#> allele [,1] [,2] [,3] [,4] [,5] [,6]
#>   [1,]    0    0    0    0    0    0
#>   [2,]    0    0    0    0    0    0
# close the file 
seqClose(gds)

Advanced Usage

Selecting a subset of variants

Use variant.sel to convert only specific variants:

gds_sub <- tempfile(fileext = ".gds")

# Convert only variants 10 through 20
seqPGEN2GDS(pgen_fn, out.gdsfn=gds_sub, variant.sel=10:20, verbose=FALSE)

f <- seqOpen(gds_sub)
cat("Variants:", length(seqGetData(f, "variant.id")), "\n")
#> Variants: 11
seqClose(f)

unlink(gds_sub, force = TRUE)

Importing a range with start/count

For very large files you can import a contiguous range of variants:

gds_range <- tempfile(fileext = ".gds")

seqPGEN2GDS(pgen_fn, out.gdsfn = gds_range, start=100, count=50, verbose=FALSE)

f <- seqOpen(gds_range)
vid <- seqGetData(f, "variant.id")
cat("Variant IDs:", head(vid), "...", tail(vid), "\n")
#> Variant IDs: 100 101 102 103 104 105 ... 144 145 146 147 148 149
cat("Total:", length(vid), "variants\n")
#> Total: 50 variants
seqClose(f)

unlink(gds_range, force = TRUE)

Parallel conversion

For large datasets, parallel processing can speed up the conversion:

# Use 2 cores
seqPGEN2GDS(pgen_fn, out.gdsfn = "output.gds", parallel=2)

Cleanup

# Remove the generated GDS file
unlink(gds_fn, force = TRUE)

Session Info

sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] pgen2gds_0.99.2  SeqArray_1.53.1  gdsfmt_1.49.2    BiocStyle_2.41.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] crayon_1.5.3         cli_3.6.6            knitr_1.51          
#>  [4] rlang_1.2.0          xfun_0.58            otel_0.2.0          
#>  [7] generics_0.1.4       jsonlite_2.0.0       S4Vectors_0.51.3    
#> [10] buildtools_1.0.0     Biostrings_2.81.3    htmltools_0.5.9     
#> [13] maketools_1.3.2      sys_3.4.3            sass_0.4.10         
#> [16] stats4_4.6.0         rmarkdown_2.31       Seqinfo_1.3.0       
#> [19] evaluate_1.0.5       jquerylib_0.1.4      fastmap_1.2.0       
#> [22] IRanges_2.47.2       yaml_2.3.12          lifecycle_1.0.5     
#> [25] BiocManager_1.30.27  compiler_4.6.0       Rcpp_1.1.1-1.1      
#> [28] pgenlibr_0.6.2       XVector_0.53.0       digest_0.6.39       
#> [31] R6_2.6.1             parallel_4.6.0       GenomicRanges_1.65.0
#> [34] bslib_0.11.0         tools_4.6.0          BiocGenerics_0.59.6 
#> [37] cachem_1.1.0