Detecting Homologous Recombination in Family Pedigrees

Introduction

This package provides tools for analyzing meiotic recombination, runs of homozygosity, and long-distance haplotype phase in family-based genetic data.

Installation

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("inferRecom")

Detecting recombination

inferRecom enables rapid, accurate detection of meiotic recombination in nuclear families. The following vignette detects recombination in a simulated set of pedigrees of varying sizes based on the CEU cohort from the 1000 Genomes Project using the package sim1000G and formatted in PLINK binary format. The sex-specific recombination rate maps in Europeans used in this example were generated by Bhérer, et al., and are available via Github.

This vignette also showcases several auxiliary functions that facilitate visualization, analysis of regions of homozygosity, and chromosome-scale phasing of informative markers.

Detecting putative recombination intervals

The first section demonstrates the core function xoDetect(), which localizes and individualizes recombination in families with both parents and three or more children, and localizes recombination in families with both parents and exactly two children. It filters artifacts arising from genotyping error and other experimental error using the default SNP-based criterion of five consecutive informative SNPs to confirm a recombination event, as well as a map-based criterion of one centiMorgan to confirm a recombination event.

# Detect recombinations in families with two parents and three or more children
#    and families with two parents and exactly two children
# Filter for recombination with SNP-based and cM-based filters at default values
# Option to select for only SNPs that have an rs number is disabled by default

exampleDir <- system.file("extdata", package = "inferRecom")
simCEU <- file.path(exampleDir, "simCEU")
mapFemale <- file.path(exampleDir, "female_chr4.txt")
mapMale <- file.path(exampleDir, "male_chr4.txt")

pat3 <- xoDetect(simCEU, mapMale, familySize = 3, parent = "father", caseControl = FALSE)
mat3 <- xoDetect(simCEU, mapFemale, familySize = 3, parent = "mother", caseControl = FALSE)
pat2 <- xoDetect(simCEU, mapMale, familySize = 2, parent = "father")
mat2 <- xoDetect(simCEU, mapFemale, familySize = 2, parent = "mother")

pat3
#> GRanges object with 3 ranges and 6 metadata columns:
#>       seqnames            ranges strand |     childId    familyId    startSnp
#>          <Rle>         <IRanges>  <Rle> | <character> <character> <character>
#>   [1]     chr4 78432058-78432285      * |        F6C2          F6  rs10026702
#>   [2]     chr4 78393878-78424195      * |        F8C3          F8   rs4450960
#>   [3]     chr4 78084453-78100731      * |        F8C5          F8  rs17327923
#>         finishSnp   startCm  finishCm
#>       <character> <numeric> <numeric>
#>   [1]  rs10026918   57.3462   57.3462
#>   [2]  rs17400095   57.3424   57.3462
#>   [3]   rs2645687   57.3172   57.3178
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

The output of the functions is a GRanges object or list showing the individual/family IDs, SNP IDs, locations, and recombination rate map centiMorgan distances for each recombination. Given that families with two children do not have sufficient information to infer the recombinant child, the ‘CHILD’ column is omitted in the output. If the caseControl option is specified, a list of GRanges separated by case/control status is returned. The results with the default filtering parameters are shown below.

Filtering for double recombination

Filtering for closely-spaced events can remove genotyping error and other artefacts in the data that are not consistent with true homologous recombination. The default value for the SNP filter is 5, and for the genetic-distance filter is 1cM. Setting the genetic distance-based filter to 0 removes the genetic distance condition altogether. For comparison, the output below shows paternal recombinations in families with three or more children without the genetic distance filter.

pat3_0cM <- xoDetect(simCEU,
    mapMale,
    familySize = 3,
    parent = "father",
    caseControl = FALSE,
    cmFilter = 0
)

pat3_0cM
#> GRanges object with 5 ranges and 6 metadata columns:
#>       seqnames            ranges strand |     childId    familyId    startSnp
#>          <Rle>         <IRanges>  <Rle> | <character> <character> <character>
#>   [1]     chr4 78059331-78077129      * |        F6C1          F6  rs12502028
#>   [2]     chr4 78828799-78833246      * |        F6C1          F6  rs10518181
#>   [3]     chr4 78432058-78432285      * |        F6C2          F6  rs10026702
#>   [4]     chr4 78393878-78424195      * |        F8C3          F8   rs4450960
#>   [5]     chr4 78084453-78100731      * |        F8C5          F8  rs17327923
#>         finishSnp   startCm  finishCm
#>       <character> <numeric> <numeric>
#>   [1]   rs1874564   57.2841   57.3170
#>   [2]   rs6836305   57.6484   57.6484
#>   [3]  rs10026918   57.3462   57.3462
#>   [4]  rs17400095   57.3424   57.3462
#>   [5]   rs2645687   57.3172   57.3178
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Removing the genetic distance filter results in the detection of two recombination intervals in child 1 in pedigree 6 that are separated by only 0.33cM (~750kb), which is unlikely to represent a true meiotic recombination due to interference in crossing over.

Visualizing GenomicRanges output from xoDetect() using trackViewer and rtracklayer

It is straightforward to visualize recombination intervals using the Bioconductor packages trackViewer and rtracklayer. Here, case and control intervals are plotted for basic comparison using trackViewer.

mat3CC <- xoDetect(simCEU,
    mapFemale,
    familySize = 3,
    parent = "mother",
    caseControl = TRUE
)

# Add feature column for compatibility with "gene" type in trackViewer
mat3CC$case$feature <- "interval"
mat3CC$control$feature <- "interval"

caseTrack <- new(
    "track",
    dat = mat3CC$case,
    type = "gene",
    format = "BED"
)

setTrackStyleParam(caseTrack, "color", "black")

controlTrack <- new(
    "track",
    dat = mat3CC$case,
    type = "gene",
    format = "BED"
)

setTrackStyleParam(controlTrack, "color", "blue")

tl <- trackList(caseTrack, controlTrack) #
viewRegion <- GRanges("chr4", IRanges(78000000, 79500000))
viewTracks(tl, gr = viewRegion, autoOptimizeStyle = TRUE)

While trackViewer provides convenient functionality for visualization entirely in an R environment, some intervals are not visible at this resolution. This and many other applications benefit from more interactive visualization directly in the UCSC Genome Browser. Example tracks as exported in the code below are shown in the browser alongside recombination maps and nearby genes.

export(mat3CC$case, "~/path/to/caseOutput.bed", format = "BED")

export(mat3CC$control, "~/path/to/controlOutput.bed", format = "BED")
#|echo: false
knitr::include_graphics("figures/example-browser.png")

Runs of homozygosity

One of the factors that affects the density of informative SNPs and therefore the resolution of recombination intervals is the degree of homozygosity in the sample. The function hzRun() is a tool to detect runs of homozygosity of a user-specified minimum length in Mb (default 1Mb). To account for variation in SNP density, it also includes a criterion for the minimum number of SNPs per run, with a default value of 5. Users can also provide a genetic map and specify a minimum centiMorgan length for runs. This example applies a sex-averaged map to all samples.

As the output is a GenomicRanges object or list, as in xoDetect(), the procedure to produce genome browser visualizations is identical to that of the recombination visualizations corresponding to xoDetect().

sexavg_chr4 <- file.path(exampleDir, "sexavg_chr4.txt")
hzRun(simCEU, sexavg_chr4)
#> GRanges object with 1 range and 6 metadata columns:
#>       seqnames            ranges strand |    sampleId    startSnp   finishSnp
#>          <Rle>         <IRanges>  <Rle> | <character> <character> <character>
#>   [1]        4 78002267-79180440      * |        F4C3  rs17261150   rs7340831
#>         numSnps   startCm  finishCm
#>       <integer> <numeric> <numeric>
#>   [1]      1028   82.7108   83.7803
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Phasing at Informative SNPs

Inferring recombination implies identity by descent in between the informative SNPs, allowing for the phasing of this subset of SNPs in the pedigrees with three or more offspring. Many population-based phasing programs are highly accurate over short distances, but phase accuracy deteriorates rapidly at distances greater than a few centiMorgans. The function xoPhase() implements a novel phasing algorithm to produce long-range haplotype phase information at informative SNPs. Using the information in the nuclear family structures, it is possible to obtain chromosome-scale haplotypes at the informative subset of SNPs. In these data, 36% of total SNPs are informative. Homozygous SNPs are also reported, and non-informative SNPs are shown as NA.

The function xoPhase() uses the sets of maternal and paternal recombinations to infer haplotype phase at all informative SNPs for all members of each the nuclear families with three or more children. There is also an option to specify a subset of family IDs for phasing.

Several options are available for output, including a list of DataFrames, a SummarizedExperiment object, or VCF. The paternal and maternal chromosomes transmitted in each child are labeled with the suffix ‘Pat’ and ‘Mat’ respectively, while each chromosome in the parents is labeled with a 1 or 2.

phased <- xoPhase(simCEU,
    pat3,
    mat3,
    output = "list"
)
phased$F6[1158:1162, ]
#> DataFrame with 5 rows and 12 columns
#>          rsID  location     F6C1Pat     F6C1Mat     F6C2Pat     F6C2Mat
#>   <character> <numeric> <character> <character> <character> <character>
#> 1  rs17002978  79292724           C           C           C           C
#> 2  rs10032397  79294804           A           A           A           A
#> 3  rs17420118  79294988           C           C           C           C
#> 4   rs6838661  79296422           G           G           G           G
#> 5  rs17002988  79298781           T           T           T           T
#>       F6C3Pat     F6C3Mat      F6P1_1      F6P1_2      F6P2_1      F6P2_2
#>   <character> <character> <character> <character> <character> <character>
#> 1           C           C           C           C           C           C
#> 2           A           A           A           A           A           A
#> 3           C           C           C           C           C           C
#> 4           C           G           G           C           G           G
#> 5           T           T           T           T           T           T

Session Information

The session information records the versions of all packages used in this vignette.

sessionInfo()