I wrote two helper functions, explained below
geneRanges <-
function(db, column="ENTREZID")
{
g <- genes(db, columns=column)
col <- mcols(g)[[column]]
genes <- granges(g)[rep(seq_along(g), elementLengths(col))]
mcols(genes)[[column]] <- as.character(unlist(col))
genes
}
splitColumnByOverlap <-
function(query, subject, column="ENTREZID", ...)
{
olaps <- findOverlaps(query, subject, ...)
f1 <- factor(subjectHits(olaps),
levels=seq_len(subjectLength(olaps)))
splitAsList(mcols(query)[[column]][queryHits(olaps)], f1)
}
Load the GenomicRanges package
library(GenomicRanges)
Get the CNV regions into a GRanges instance. For instance, if the data is in a BED file, perhaps using the rtracklayer package
cnv = rtracklayer::import("my.bed")
or if in a tab-delimited file
df = read.delim("my.tsv")
cnv = makeGRangesFromDataFrame(df)
(if you're a fan of dplyr / magrittr style pipes then cnv = read.delim("my.tsv") %>% makeGRangesFromDataFrame()
works, too). Here I make a couple of genomic ranges 'by hand'
> cnv = GRanges(c("chr1", "chr2"), IRanges(94312388,244006886))
> cnv
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 [94312388, 244006886] *
[2] chr2 [94312388, 244006886] *
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
Load the Homo.sapiens package
library(Homo.sapiens)
Find the genomic coordinates for each (entrez) gene, as well as the gene symbol
> gns = geneRanges(Homo.sapiens, column="SYMBOL")
> gns
GRanges object with 23056 ranges and 1 metadata column:
seqnames ranges strand | SYMBOL
<Rle> <IRanges> <Rle> | <character>
1 chr19 [ 58858172, 58874214] - | A1BG
10 chr8 [ 18248755, 18258723] + | NAT2
100 chr20 [ 43248163, 43280376] - | ADA
1000 chr18 [ 25530930, 25757445] - | CDH2
10000 chr1 [243651535, 244006886] - | AKT3
... ... ... ... ... ...
9991 chr9 [114979995, 115095944] - | PTBP3
9992 chr21 [ 35736323, 35743440] + | KCNE2
9993 chr22 [ 19023795, 19109967] - | DGCR2
9994 chr6 [ 90539619, 90584155] + | CASP8AP2
9997 chr22 [ 50961997, 50964905] - | SCO2
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
Find which genes overlap which copy number regions
> symInCnv = splitByOverlap(gns, cnv, "SYMBOL")
> symInCnv
CharacterList of length 2
[["1"]] AKT3 MIR942 MIR190B MIR760 MIR921 ... FAM20B LPGAT1 KIF14 RBM8A NR1I3
[["2"]] TANK SNORD11B MIR933 LOC100129175 CCNT2-AS1 ... ZEB2 FARP2 TLK1 CD302
The result is a CharacterList (list of character vectors) where each element contains the genes overlapping the corresponding CNV region.
A CharacterList is very convenient to work with, but use as.vector(symInCnv)
to work as a plain list-of-characters, or unstrsplit(symInCnv, sep=", ")
to paste the symbols in each CNV together.
Here's the first helper function, disected
geneRanges <-
function(db, column="ENTREZID")
'db' is a so-called OrganismDb, containing information about genes (e.g., ENTREZ or SYMBOL gene ids) as well as gene models (e.g., TXSTART, TXEND). The Homo.sapiens OrangismDb is based on the org.Hs.eg.db gene annotation package, and the TxDb.Hsapiens.UCSC.hg19.knownGene package; it is relatively easy to make a custom package if you have different annotations.
'column' is the type of gene identifier you are interested in mapping to. See columns(Homo.sapiens)
for some ideas.
{
g <- genes(db, columns=column)
This line extracts the coordinates (min and max) of each gene, as well as the column (e.g., SYMBOL associated with that gene.
The TxDb is organized around a primary key (the ENTREZID), and there may be several gene SYMBOLS per ENTREZID. The next few lines makes each SYMBOL map to a single genomic location.
col <- mcols(g)[[column]]
genes <- granges(g)[rep(seq_along(g), elementLengths(col))]
mcols(genes)[[column]] <- as.character(unlist(col))
genes
}
The end result is a GRanges instance, with a metadata column corresponding to the type of gene identifier of interest.
The basic idea behind splitByOverlap is that we can find which gene coordinates overlap which copy number variant coordinates, and then split the column of gene identifiers into lists corresponding to the regions of overlap
splitByOverlap <-
function(x, f, column="ENTREZID", ...)
'query' is the gene coordinates, 'subject' the copy number coordinates. 'column' needs to match 'column' in the geneRanges() function.
{
olaps <- findOverlaps(query, subject, ...)
findOverlaps() is a very powerful function. It returns a 'Hits' object that has two parallel vectors. The vectors can be extracted with queryHits() and subjectHits(). queryHits() are the indexes of the queries (the gene coordinates) that overlap the corresponding subjectHits(), i.e., the indexes of the subjects, the copy number regions.
The next lines line up the query column identifier (e.g., gene SYMBOL) that overlaps each subject.
f1 <- factor(subjectHits(olaps), levels=seq_len(subjectLength(olaps)))
splitAsList(mcols(x)[[column]][queryHits(olaps)], f1)
the use of factor() with exactly as many levels as there are subjects ensures that the splitAsList() command returns a 1:1 mapping between the subjects (copy number regions) and the genes in the corresponding CharacterList.
}