Entering edit mode
Howdy,
On Tue, Sep 28, 2010 at 12:28 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote:
> On 09/28/2010 08:09 AM, Paul Geeleher wrote:
>> I guess my next question is if there is a convenient way of
filtering
>> out the part of the gene
>> that overlaps another one using GenomicRanges?
I've been meaning to do something that basically performs this core
functionality for a while. I just finished putting together a first
cut of it and added it to my GenomicFeatures extensions package:
http://github.com/lianos/GenomicFeaturesX
I'm also not an IRanges guru, so I'm sure there are
better/more-magical ways to do this. I'll need to test this more
thoroughly, but it currently looks to do the right thing for the
example provided below.
The context I'm using it for is to create an "annotated" GRanges
object (1 per chromosome) that contains annotated intervals that (i)
span the length of the chromosome; and (ii) are annotated with exon
information for parts (or all) of exons that exclusively belong to a
given gene.
I think some example code + results will speak better about what I
mean than words can.
Note that if you have GenomicRanges installed and the "testthat"
library, you can easily play along by sourcing urls straight from
github, so let's set up the R environment
R> library(testthat)
R> library(GenomicRanges)
I have a list of GRanges objects (you can use a GRangesList if you
like) with exon annotations for genes. `lchr1` is an example object
for a pseudo "chr1" that only has a few (4) gene annotations on it and
is 10,000 bp long.
GeneA and GeneC have some exons that overlap, and will become disjoint
later.
GeneD will cause further "fractionation" if we toss out strand info:
R> source("http://github.com/lianos/GenomicFeaturesX/raw/master/inst/t
ests/test-annotateChromosome.R")
R> lchr1
GRanges with 4 ranges and 1 elementMetadata value
seqnames ranges strand | exon.anno
<rle> <iranges> <rle> | <character>
[1] chr1 [ 10, 30] + | utr5
[2] chr1 [ 40, 60] + | cds
[3] chr1 [ 80, 100] + | cds
[4] chr1 [120, 140] + | utr3
...
I've written a function `annotateChromosome` that will calculate the
ranges along a chromosome that exclusive belong to a gene as explained
above, and assigns each interval the appropriate exon annotation.
R> source("http://github.com/lianos/GenomicFeaturesX/raw/master/R/anno
tateGenome.R")
R> annotateChromosome(lchr1, 'chr1', 1000, stranded=TRUE)
GRanges with 14 ranges and 2 elementMetadata values
seqnames ranges strand | exon.anno symbol
<rle> <iranges> <rle> | <character> <character>
[1] chr1 [ 10, 30] + | utr5 GeneA
[2] chr1 [ 15, 45] - | cds GeneD
[3] chr1 [ 40, 49] + | cds GeneA
[4] chr1 [ 50, 100] - | utr5 GeneD
[5] chr1 [ 50, 60] + | overlap NA
[6] chr1 [ 61, 79] + | cds GeneC
[7] chr1 [ 80, 85] + | overlap NA
[8] chr1 [ 86, 100] + | cds GeneA
[9] chr1 [105, 115] + | cds GeneC
[10] chr1 [120, 140] + | utr3 GeneA
[11] chr1 [200, 230] + | utr5 GeneB
[12] chr1 [250, 280] + | cds GeneB
[13] chr1 [300, 330] + | cds GeneB
[14] chr1 [400, 430] + | utr3 GeneB
You can try setting `stranded=FALSE` to see how that changes the
results.
If you look at the function here:
http://github.com/lianos/GenomicFeaturesX/blob/master/R/annotateGenome
.R
The lapply block that generates the "interval.annos" object (~ line
34) does the bookkeeping involved in splitting apart overlapping
regions and keeping only the parts of exons that are unique to each
gene (look around the call to tabulate(...)).
The lapply block that generates "clean.list" (~ line 51) goes back to
the original annotated gene list and picks out the appropriate
annotations for each now-disjoint exon.
That's about it.
As I said earlier, this is a first cut at doing this. I'm sure it can
be cleaner and done more efficiently, but I thought you'd be
interested in the code that does the disjoining stuff (in the context
of this function, or just by itself).
It's only about 20 lines of code, so ... you might not have to do a
rewrite in Python after all :-) [not that I have anything against
Python .. I rather like it]
-steve
ps - the content/validity of those github links will (obviously)
change over time. In case they change before you get a change to play
with the contents of this message, you can get the code @ the time of
this commit from these links:
annotateChromosome function:
http://github.com/lianos/GenomicFeaturesX/blob/fa69b3c2/R/annotateGeno
me.R
code to generate the example lchr1 object:
http://github.com/lianos/GenomicFeaturesX/blob/fa69b3c2/inst/tests
/test-annotateChromosome.R
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
?| Memorial Sloan-Kettering Cancer Center
?| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact