I wrote two helper functions, explained below
geneRanges <-
function(db, column="ENTREZID")
{
g <- genes(db, columns=column)
col <- mcols(g)[[column]]
genes <- granges(g)[rep(seq_along(g), elementLengths(col))]
mcols(genes)[[column]] <- as.character(unlist(col))
genes
}
splitColumnByOverlap <-
function(query, subject, column="ENTREZID", ...)
{
olaps <- findOverlaps(query, subject, ...)
f1 <- factor(subjectHits(olaps),
levels=seq_len(subjectLength(olaps)))
splitAsList(mcols(query)[[column]][queryHits(olaps)], f1)
}
Load the GenomicRanges package
library(GenomicRanges)
Get the CNV regions into a GRanges instance. For instance, if the data is in a BED file, perhaps using the rtracklayer package
cnv = rtracklayer::import("my.bed")
or if in a tab-delimited file
df = read.delim("my.tsv")
cnv = makeGRangesFromDataFrame(df)
(if you're a fan of dplyr / magrittr style pipes then cnv = read.delim("my.tsv") %>% makeGRangesFromDataFrame()
works, too). Here I make a couple of genomic ranges 'by hand'
> cnv = GRanges(c("chr1", "chr2"), IRanges(94312388,244006886))
> cnv
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 [94312388, 244006886] *
[2] chr2 [94312388, 244006886] *
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
Load the Homo.sapiens package
library(Homo.sapiens)
Find the genomic coordinates for each (entrez) gene, as well as the gene symbol
> gns = geneRanges(Homo.sapiens, column="SYMBOL")
> gns
GRanges object with 23056 ranges and 1 metadata column:
seqnames ranges strand | SYMBOL
<Rle> <IRanges> <Rle> | <character>
1 chr19 [ 58858172, 58874214] - | A1BG
10 chr8 [ 18248755, 18258723] + | NAT2
100 chr20 [ 43248163, 43280376] - | ADA
1000 chr18 [ 25530930, 25757445] - | CDH2
10000 chr1 [243651535, 244006886] - | AKT3
... ... ... ... ... ...
9991 chr9 [114979995, 115095944] - | PTBP3
9992 chr21 [ 35736323, 35743440] + | KCNE2
9993 chr22 [ 19023795, 19109967] - | DGCR2
9994 chr6 [ 90539619, 90584155] + | CASP8AP2
9997 chr22 [ 50961997, 50964905] - | SCO2
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
Find which genes overlap which copy number regions
> symInCnv = splitByOverlap(gns, cnv, "SYMBOL")
> symInCnv
CharacterList of length 2
[["1"]] AKT3 MIR942 MIR190B MIR760 MIR921 ... FAM20B LPGAT1 KIF14 RBM8A NR1I3
[["2"]] TANK SNORD11B MIR933 LOC100129175 CCNT2-AS1 ... ZEB2 FARP2 TLK1 CD302
The result is a CharacterList (list of character vectors) where each element contains the genes overlapping the corresponding CNV region.
A CharacterList is very convenient to work with, but use as.vector(symInCnv)
to work as a plain list-of-characters, or unstrsplit(symInCnv, sep=", ")
to paste the symbols in each CNV together.
Here's the first helper function, disected
geneRanges <-
function(db, column="ENTREZID")
'db' is a so-called OrganismDb, containing information about genes (e.g., ENTREZ or SYMBOL gene ids) as well as gene models (e.g., TXSTART, TXEND). The Homo.sapiens OrangismDb is based on the org.Hs.eg.db gene annotation package, and the TxDb.Hsapiens.UCSC.hg19.knownGene package; it is relatively easy to make a custom package if you have different annotations.
'column' is the type of gene identifier you are interested in mapping to. See columns(Homo.sapiens)
for some ideas.
{
g <- genes(db, columns=column)
This line extracts the coordinates (min and max) of each gene, as well as the column (e.g., SYMBOL associated with that gene.
The TxDb is organized around a primary key (the ENTREZID), and there may be several gene SYMBOLS per ENTREZID. The next few lines makes each SYMBOL map to a single genomic location.
col <- mcols(g)[[column]]
genes <- granges(g)[rep(seq_along(g), elementLengths(col))]
mcols(genes)[[column]] <- as.character(unlist(col))
genes
}
The end result is a GRanges instance, with a metadata column corresponding to the type of gene identifier of interest.
The basic idea behind splitByOverlap is that we can find which gene coordinates overlap which copy number variant coordinates, and then split the column of gene identifiers into lists corresponding to the regions of overlap
splitByOverlap <-
function(x, f, column="ENTREZID", ...)
'query' is the gene coordinates, 'subject' the copy number coordinates. 'column' needs to match 'column' in the geneRanges() function.
{
olaps <- findOverlaps(query, subject, ...)
findOverlaps() is a very powerful function. It returns a 'Hits' object that has two parallel vectors. The vectors can be extracted with queryHits() and subjectHits(). queryHits() are the indexes of the queries (the gene coordinates) that overlap the corresponding subjectHits(), i.e., the indexes of the subjects, the copy number regions.
The next lines line up the query column identifier (e.g., gene SYMBOL) that overlaps each subject.
f1 <- factor(subjectHits(olaps), levels=seq_len(subjectLength(olaps)))
splitAsList(mcols(x)[[column]][queryHits(olaps)], f1)
the use of factor() with exactly as many levels as there are subjects ensures that the splitAsList() command returns a 1:1 mapping between the subjects (copy number regions) and the genes in the corresponding CharacterList.
}
function
was deprecated in S4Vectors version update 0.16.0 (in 2016?). \ So I used:
https://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/S4Vectors/NEWS
@Martin Thank you so much.
It was really helpful, I learned many thing!
Thanks a lot for this post. This is really helpful. I was trying to annotate cnv regions using the code snippets you provided.
For some of the regions the annotation appears to be incorrect or I am doing something wrong.
Example:
The genes returned are not in chr 10 . Could you please guide me to understand what I am doing wrong?
What I have:
>cnv
>gns
symInCnv = splitColumnByOverlap(gns, cnv, "SYMBOL")
By way of reproducible example, I loaded the packages and created the range that you indicate is problematic
I then copy and pasted the two functions above (updated here) into my R session, and calculated the gene ranges and overlaps of my test range
I get
The versions of the various packages in use are
What do you get, what is your sessionInfo(), and which gene(s) are not on chr10?
It worked this time around. Not very sure what mistake I was doing last time. LRP1B was listed as gene on chr 10, which is not correct. It worked perfectly fine this time. Thanks a lot for your help.
It works fine @Martin, thank you very much. But I need to export the GRanges object with the SYMBOL column to a data frame, or to print it in a text file, but I can`t, it is possible? thank you very much in advance.
Convert a GRanges gr to data.frame with
as.data.frame(gr)
. Paste the elements of a CharacterList x into a character vector withpaste(x, collapse=", ")
. Think twice about whether either of these is something you really want to do.Thank you very much @MartinMorgan, but if I do that, the name of the genes is substituted by numbers, why? How can I solve that?
Thank you very much and sorry the annoyances
Maybe the relevant column is a factor() not a character() vector.
Yes, is a factor... I`ve already solve the problem. Thanks!!
Something to consider is that this code uses the Homo.sapiens package which uses hg19 as its reference genome. This method makes it unusable if your data uses hg38.
Adding comments to ancient posts is not particularly useful, given the changes that may have happened in the intervening period. In addition, while what you state is technically correct, it is not correct in a larger sense, as you can swap in any
TxDb
version you might like.@Martin thank you so much
I need to trasnform
because I need the genes for each cnv to joint as a column to other data frame (where I have my original cnvs with their chromosomal ranges)
Thank you very much
It would make more sense to have the CNV regions as GRanges objects, and simply add the copy number columns
ok, thanks. It works perfect. Thank you very much!!
Just one thing, you should put the update version of the helper functions (updated here) in your first post, because is confusing... ;)
thank you very much!!
regards
@Martin I'm trying to map CNV to genes, I had the following error when I applied your code
what should I do also I have 24 chromosomes , I would like to apply the code to the whole CNV file.
Use the function definitions at the top of the response --
splitColumnsByOverlap()
-- the code and explanation seem to have drifted a bit.@Martin Thanks for your answer, I got the following error
xx = splitColumnByOverlap(gns, cnv, "SYMBOL")
Warning message:
In .Seqinfo.mergexy(x, y) :
The 2 combined objects have no sequence levels in common. (Use
suppressWarnings() to suppress this warning.)
my cnv is the copy number variation file for tumor tissue only
GRanges object with 284458 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 [ 3218610, 95674710] *
[2] 1 [ 95676511, 95676518] *
[3] 1 [ 95680124, 167057183] *
[4] 1 [167057495, 167059336] *
[5] 1 [167059760, 181602002] *
... ... ... ...
[284454] 19 [ 284018, 58878226] *
[284455] 20 [ 455764, 62219837] *
[284456] 21 [15347621, 47678774] *
[284457] 22 [17423930, 49331012] *
[284458] 23 [ 3157107, 154905589] *
-------
seqinfo: 23 sequences from an unspecified genome; no seqlengths
my data file look like the following(part of it)
compare seqlevels(gns) and seqlevels(cnv). see ?"seqlevelsStyle<-" to coerce between 'styles' (e.g,. NCBI, UCSC), but make sure it is sensible to do so (e.g., gns and cnv based on the same annotation). You're really asking a different (i.e., new) question, so start a new question rather than extending this comment.