---
title: "Local Analysis of Plant Genomes with PlantTxDbHub"
author: "Kabilan S"
date: "`r Sys.Date()`"
output: BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{Local Analysis of Plant Genomes}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(PlantTxDbHub)
library(GenomicFeatures)
library(GenomeInfoDb)

# Ensure databases are cached (downloaded only once)
db_dir <- PlantTxDbHub::downloadPlantTxDbs()
```

# Introduction

This vignette demonstrates how to use the **PlantTxDbHub** package
to analyse transcript‑level annotations for three plant species:

- *Arabidopsis thaliana* (TAIR10)
- *Glycine max* (Wm82 v2.1)
- *Oryza sativa* (IRGSP‑1.0)

The annotations were generated from Ensembl Plants release 62 GTF files
and stored as standalone TxDb SQLite databases on Zenodo.  
The helper function `downloadPlantTxDbs()` caches the databases in your
user data directory; subsequent analyses use standard `TxDb` methods.

# Load a TxDb (Arabidopsis example)

```{r load_ath}
ath_file <- file.path(db_dir, "TxDb.Athaliana.TAIR10.v62.sqlite")
txdb_ath <- loadDb(ath_file)
txdb_ath
```

# Available columns and keys

TxDb objects work with the `select()` interface.

```{r cols}
columns(txdb_ath)
keytypes(txdb_ath)
```

The **column `TXTYPE`** indicates the transcript biotype (e.g. `protein_coding`).

# Retrieve all genes

`genes()` returns a `GRanges` with gene‑level information.  
Here the database provides the `gene_id` as the only metadata column.

```{r genes_ath}
gene_gr <- genes(txdb_ath)
head(gene_gr)
```

# Retrieve all transcripts

```{r transcripts_ath}
tx_gr <- transcripts(txdb_ath)
head(tx_gr)
```

# Retrieve all exons

```{r exons_ath}
ex_gr <- exons(txdb_ath, columns = "exon_id")
head(ex_gr)
```

# Filter by gene ID

You can retrieve gene ranges using `filter`:

```{r filter_geneid}
my_genes <- c("AT1G01010", "AT1G01020")
genes(txdb_ath, filter = list(gene_id = my_genes))
```

If `filter` is not available (older `GenomicFeatures`), fall back to
`select()`:

```{r filter_geneid_select}
sel <- select(txdb_ath,
  keys = my_genes,
  columns = c("GENEID", "TXID", "TXTYPE", "EXONID"),
  keytype = "GENEID"
)
head(sel)
```

This returns a 1:many mapping between genes and their transcripts/exons.

# Retrieving transcript types

The column `TXTYPE` holds the transcript biotype (e.g. `protein_coding`).
You can extract it for all transcripts with `select()`:

```{r txtype}
tx_info <- select(txdb_ath,
  keys = keys(txdb_ath, "TXID"),
  columns = c("TXID", "TXTYPE"),
  keytype = "TXID"
)
table(tx_info$TXTYPE)
```

# Working with chromosome names

The database stores chromosome names as bare numbers (`1`, `2`, …)
and organelle names as `Mt`, `Pt`. To add the standard *Arabidopsis*
`Chr` prefix and restrict to nuclear chromosomes:

```{r chrom_ath}
gene_gr_nuc <- keepSeqlevels(gene_gr,
  value = c("1", "2", "3", "4", "5"),
  pruning.mode = "coarse"
)
seqlevels(gene_gr_nuc) <- paste0("Chr", seqlevels(gene_gr_nuc))
seqlevels(gene_gr_nuc)
```

# Soybean (*Glycine max*) example

```{r load_gmx}
gmx_file <- file.path(db_dir, "TxDb.Gmax.Wm82.v62.sqlite")
txdb_gmx <- loadDb(gmx_file)
head(genes(txdb_gmx))
```

# Rice (*Oryza sativa*) example

```{r load_osa}
osa_file <- file.path(db_dir, "TxDb.Osativa.IRGSP.v62.sqlite")
txdb_osa <- loadDb(osa_file)
head(genes(txdb_osa))
```

# Session information

```{r sessionInfo}
sessionInfo()
```