--- title: "Format Conversion from PLINK2 PGEN to SeqArray GDS" author: "Xiuwen Zheng" date: "Apr 2026" output: BiocStyle::html_document: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Format Conversion from PLINK2 PGEN to SeqArray GDS} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` # Introduction [PLINK](https://www.cog-genomics.org/plink/2.0/) is one of the most widely used toolsets in statistical genetics, providing fast and memory-efficient methods for quality control, association testing, and population-stratification analysis of large-scale genotype data. Its binary file format — comprising `.pgen` (genotypes), `.pvar` (variant metadata), and `.psam` (sample metadata) files — offers compact storage and rapid access for datasets with millions of variants and hundreds of thousands of samples. [Genomic Data Structure (GDS)](https://bioconductor.org/packages/gdsfmt) is a hierarchical, array-oriented container format built on the CoreArray C++ library. The [SeqArray](https://bioconductor.org/packages/SeqArray) package extends GDS to store sequence and genotyping data following a schema that mirrors VCF fields (genotype, annotation/info, annotation/format, etc.), while providing efficient random access, built-in compression, and tight integration with Bioconductor workflows for downstream analyses such as GWAS, PCA, and relatedness estimation. The **pgen2gds** package bridges these two ecosystems by converting PLINK2 PGEN files into SeqArray GDS files, enabling users to leverage the rich Bioconductor infrastructure for analysis while starting from data produced by PLINK2. # Installation **pgen2gds** requires [gdsfmt](https://bioconductor.org/packages/gdsfmt), [SeqArray](https://bioconductor.org/packages/SeqArray), and [pgenlibr](https://cran.r-project.org/package=pgenlibr). ```r if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("pgen2gds") ``` # Quick Start ```{r library} library(SeqArray) library(pgen2gds) ``` ## Locate example files The package ships with small example PLINK2 files: ```{r example-files} pgen_fn <- system.file("extdata", "plink2_gen.pgen", package = "pgen2gds") pvar_fn <- system.file("extdata", "plink2_gen.pvar", package = "pgen2gds") psam_fn <- system.file("extdata", "plink2_gen.psam", package = "pgen2gds") pgen_fn ``` ## Read variant information `seqReadPVAR()` reads a `.pvar` file and returns a data frame of variant metadata: ```{r read-pvar} pvar <- seqReadPVAR(pvar_fn) head(pvar) dim(pvar) ``` You can also subset variants using a logical or numeric index: ```{r read-pvar-subset} # Select the first 5 variants head(seqReadPVAR(pvar_fn, sel = 1:5)) ``` ## Convert PGEN to GDS The main function, `seqPGEN2GDS()`, converts a set of PLINK2 files into a SeqArray GDS file. When only the `.pgen` path is provided, the `.pvar` and `.psam` paths are derived automatically: ```{r convert} gds_fn <- tempfile(fileext = ".gds") seqPGEN2GDS(pgen_fn, out.gdsfn = gds_fn) ``` ## Explore the GDS file Open the resulting file with SeqArray and inspect its contents: ```{r open-gds} gds <- seqOpen(gds_fn) gds ``` ```{r basic-info} # Sample and variant counts cat("Samples: ", length(seqGetData(gds, "sample.id")), "\n") cat("Variants:", length(seqGetData(gds, "variant.id")), "\n") ``` ```{r chrom-table} # Chromosome distribution table(seqGetData(gds, "chromosome")) ``` ## Access genotype data ```{r genotype} # Read genotypes for the first 5 variants seqSetFilter(gds, variant.sel = 1:5) geno <- seqGetData(gds, "genotype") dim(geno) # ploidy x samples x variants geno[, 1:6, ] ``` ```{r close-gds} # close the file seqClose(gds) ``` # Advanced Usage ## Selecting a subset of variants Use `variant.sel` to convert only specific variants: ```{r variant-sel} gds_sub <- tempfile(fileext = ".gds") # Convert only variants 10 through 20 seqPGEN2GDS(pgen_fn, out.gdsfn=gds_sub, variant.sel=10:20, verbose=FALSE) f <- seqOpen(gds_sub) cat("Variants:", length(seqGetData(f, "variant.id")), "\n") seqClose(f) unlink(gds_sub, force = TRUE) ``` ## Importing a range with start/count For very large files you can import a contiguous range of variants: ```{r start-count} gds_range <- tempfile(fileext = ".gds") seqPGEN2GDS(pgen_fn, out.gdsfn = gds_range, start=100, count=50, verbose=FALSE) f <- seqOpen(gds_range) vid <- seqGetData(f, "variant.id") cat("Variant IDs:", head(vid), "...", tail(vid), "\n") cat("Total:", length(vid), "variants\n") seqClose(f) unlink(gds_range, force = TRUE) ``` ## Parallel conversion For large datasets, parallel processing can speed up the conversion: ```R # Use 2 cores seqPGEN2GDS(pgen_fn, out.gdsfn = "output.gds", parallel=2) ``` # Cleanup ```{r cleanup} # Remove the generated GDS file unlink(gds_fn, force = TRUE) ``` # Session Info ```{r session-info} sessionInfo() ```