This vignette describes how to go from variants (and optionally gene fusions) to the Minigene Library report.
We will usually work with standard processing pipelines for variant calling, RNA-seq quantification, and RNA fusion calling. The best-known collection of these workflows is NF-core.
It provides the following pipelines for these tasks:
sarek for variant
calling, usually using HaplotypeCaller and
Mutect2rnaseq for
RNA-seq quantification, usually using STAR and
Salmonrnafusion for
RNA fusion quantification using multiple algorithmsRunning these will usually require access to high performance computing facilities, and you already get the processed files from your core facility or sequencing provider.
Here, we are using example files provided with the package:
library(pepitope)
# SNPs and small indels, e.g. from 'sarek' nf-core pipeline
variant_vcf_file = system.file("my_variants.vcf", package="pepitope")
# Combined fusion VCF file, e.g. from 'rnafusion' nf-core pipeline
fusion_vcf_file = system.file("my_fusions.vcf", package="pepitope")The variant calling and RNA-seq counts were mapped to a reference genome and gene annotations. It is important to keep these consistent between the NF-core processing pipelines and the Minigene Library annotation.
We are usually working with BSgenome (for the reference
genome) and EnsDb (for the gene annotations) objects. For
human data, the most widely used reference genome is
GRCh38, and a recent Ensembl annotation
release.
With our test data, we know that GRCh38 is the correct
reference genome and Ensembl 106 is the correct version of gene
annotations. We can get both objects from the BSgenome and
AnnotationHub Bioconductor packages, respectively.
ens106 = AnnotationHub::AnnotationHub()[["AH100643"]]
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> require("ensembldb")
asm = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38A caveat here is that the chromosome prefixes need to be consistent between the variants in the VCF file and the genome/gene annotations. There are two “styles”, either UCSC (includes “chr” prefix) or NCBI/Ensembl (without “chr” prefix).
The sarek pipline uses UCSC prefixes on the GRCh38
genome, so we need to switch the gene annotation styles:
The correct styles will depend on how your VCF files were generated.
The NF-core
rnaseq workflow will provide two gene expression files,
one for raw read counts and one for transcripts per million (TPM). These
contain all samples in a run, so we need to subset them to the current
sample we are interested in. These files are usually called:
salmon.merged.gene_counts.tsvsalmon.merged.gene_tpm.tsvWe can combine and subset them the following way:
# note that this is not run in this example because we don't have RNA-seq data
counts = readr::read_tsv("salmon.merged.gene_counts.tsv") |>
dplyr::select(gene_id, gene_name, count=SAMPLE)
tpm = readr::read_tsv("salmon.merged.gene_tpm.tsv") |>
dplyr::select(gene_id, gene_name, tpm=SAMPLE)
rna_sample = inner_join(counts, tpm)Here, we can use the following filters:
min_cov = 2 – a variant needs to be covered by at least
2 readsmin_af = 0.05 – a variant needs to occur in at leat 5%
of readspass = TRUE – a variant needs to pass the standard QC
filterssample = SAMPLE – only annotate the sample
SAMPLEchrs = "default" – only use the common chromosomes for
this organismThe resulting vr1 object looks like the following:
ann = annotate_coding(vr1, ens106, asm)
subs = ann |>
# filter_expressed(rna_sample, min_reads=1, min_tpm=0) |>
subset_context(15)Here, we are using the following filters:
min_reads = 1 – the gene needs to have at least one RNA
readmin_tpm = 0 – we do not apply an additional TPM
filterIn addition, we set the region of interest (context) to 15 codons up- and downstream of the variant. Hence, a SNP will have a total length of 93 nucleotides (15*3 + the SNP codon itself). An insertion will have the inserted sequence and 15 codons, a deletion only 15 codons both sides. A frameshift will have 15 codons upstream and the entire sequence downstream until a STOP codon is reached. The latter may extend into the 3’ UTR.
The subs dataframe looks like the following:
First we want to read the fusion genes from a combined
vcf file like the one produced by the rnafusion NF-core
pipeline:
vr2 = readVcfAsVRanges(fusion_vcf_file) |>
filter_fusions(min_reads=2, min_split_reads=1, min_tools=1)
seqlevelsStyle(vr2) = "UCSC"Here, we are using the following filters:
min_reads = 2 – the fusion needs to be supported by 2
split or pair distance readsmin_split_reads = 1 – the fusion needs to be supported
by at least one split readmin_tools = 1 – the fusion needs to be reported by at
least one toolThe tiled peptide table looks like the following:
We can then combine our generated tables into a report save it with
e.g. the writexl package. We can include all tables listed
above:
This .xlsx file will contain the different tables as
sheets. We will use it as an annotation file in the quality control and
screen steps.