--- title: "Local Analysis of Plant Genomes with PlantTxDbHub" author: "Kabilan S" date: "`r Sys.Date()`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{Local Analysis of Plant Genomes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(PlantTxDbHub) library(GenomicFeatures) library(GenomeInfoDb) # Ensure databases are cached (downloaded only once) db_dir <- PlantTxDbHub::downloadPlantTxDbs() ``` # Introduction This vignette demonstrates how to use the **PlantTxDbHub** package to analyse transcript‑level annotations for three plant species: - *Arabidopsis thaliana* (TAIR10) - *Glycine max* (Wm82 v2.1) - *Oryza sativa* (IRGSP‑1.0) The annotations were generated from Ensembl Plants release 62 GTF files and stored as standalone TxDb SQLite databases on Zenodo. The helper function `downloadPlantTxDbs()` caches the databases in your user data directory; subsequent analyses use standard `TxDb` methods. # Load a TxDb (Arabidopsis example) ```{r load_ath} ath_file <- file.path(db_dir, "TxDb.Athaliana.TAIR10.v62.sqlite") txdb_ath <- loadDb(ath_file) txdb_ath ``` # Available columns and keys TxDb objects work with the `select()` interface. ```{r cols} columns(txdb_ath) keytypes(txdb_ath) ``` The **column `TXTYPE`** indicates the transcript biotype (e.g. `protein_coding`). # Retrieve all genes `genes()` returns a `GRanges` with gene‑level information. Here the database provides the `gene_id` as the only metadata column. ```{r genes_ath} gene_gr <- genes(txdb_ath) head(gene_gr) ``` # Retrieve all transcripts ```{r transcripts_ath} tx_gr <- transcripts(txdb_ath) head(tx_gr) ``` # Retrieve all exons ```{r exons_ath} ex_gr <- exons(txdb_ath, columns = "exon_id") head(ex_gr) ``` # Filter by gene ID You can retrieve gene ranges using `filter`: ```{r filter_geneid} my_genes <- c("AT1G01010", "AT1G01020") genes(txdb_ath, filter = list(gene_id = my_genes)) ``` If `filter` is not available (older `GenomicFeatures`), fall back to `select()`: ```{r filter_geneid_select} sel <- select(txdb_ath, keys = my_genes, columns = c("GENEID", "TXID", "TXTYPE", "EXONID"), keytype = "GENEID" ) head(sel) ``` This returns a 1:many mapping between genes and their transcripts/exons. # Retrieving transcript types The column `TXTYPE` holds the transcript biotype (e.g. `protein_coding`). You can extract it for all transcripts with `select()`: ```{r txtype} tx_info <- select(txdb_ath, keys = keys(txdb_ath, "TXID"), columns = c("TXID", "TXTYPE"), keytype = "TXID" ) table(tx_info$TXTYPE) ``` # Working with chromosome names The database stores chromosome names as bare numbers (`1`, `2`, …) and organelle names as `Mt`, `Pt`. To add the standard *Arabidopsis* `Chr` prefix and restrict to nuclear chromosomes: ```{r chrom_ath} gene_gr_nuc <- keepSeqlevels(gene_gr, value = c("1", "2", "3", "4", "5"), pruning.mode = "coarse" ) seqlevels(gene_gr_nuc) <- paste0("Chr", seqlevels(gene_gr_nuc)) seqlevels(gene_gr_nuc) ``` # Soybean (*Glycine max*) example ```{r load_gmx} gmx_file <- file.path(db_dir, "TxDb.Gmax.Wm82.v62.sqlite") txdb_gmx <- loadDb(gmx_file) head(genes(txdb_gmx)) ``` # Rice (*Oryza sativa*) example ```{r load_osa} osa_file <- file.path(db_dir, "TxDb.Osativa.IRGSP.v62.sqlite") txdb_osa <- loadDb(osa_file) head(genes(txdb_osa)) ``` # Session information ```{r sessionInfo} sessionInfo() ```