The editor has been updated to markdown! Please see more info at: Tutorial: Updated Support Site Editor

Question: Flagging overlapping gene exons (was Re: Question about CSAMA10 ...)
gravatar for Steve Lianoglou
8.4 years ago by
Steve Lianoglou12k wrote:
Howdy, On Tue, Sep 28, 2010 at 12:28 PM, Martin Morgan <mtmorgan at=""""> wrote: > On 09/28/2010 08:09 AM, Paul Geeleher wrote: >> I guess my next question is if there is a convenient way of filtering >> out the part of the gene >> that overlaps another one using GenomicRanges? I've been meaning to do something that basically performs this core functionality for a while. I just finished putting together a first cut of it and added it to my GenomicFeatures extensions package: I'm also not an IRanges guru, so I'm sure there are better/more-magical ways to do this. I'll need to test this more thoroughly, but it currently looks to do the right thing for the example provided below. The context I'm using it for is to create an "annotated" GRanges object (1 per chromosome) that contains annotated intervals that (i) span the length of the chromosome; and (ii) are annotated with exon information for parts (or all) of exons that exclusively belong to a given gene. I think some example code + results will speak better about what I mean than words can. Note that if you have GenomicRanges installed and the "testthat" library, you can easily play along by sourcing urls straight from github, so let's set up the R environment R> library(testthat) R> library(GenomicRanges) I have a list of GRanges objects (you can use a GRangesList if you like) with exon annotations for genes. `lchr1` is an example object for a pseudo "chr1" that only has a few (4) gene annotations on it and is 10,000 bp long. GeneA and GeneC have some exons that overlap, and will become disjoint later. GeneD will cause further "fractionation" if we toss out strand info: R> source(" ests/test-annotateChromosome.R") R> lchr1 GRanges with 4 ranges and 1 elementMetadata value seqnames ranges strand | exon.anno <rle> <iranges> <rle> | <character> [1] chr1 [ 10, 30] + | utr5 [2] chr1 [ 40, 60] + | cds [3] chr1 [ 80, 100] + | cds [4] chr1 [120, 140] + | utr3 ... I've written a function `annotateChromosome` that will calculate the ranges along a chromosome that exclusive belong to a gene as explained above, and assigns each interval the appropriate exon annotation. R> source(" tateGenome.R") R> annotateChromosome(lchr1, 'chr1', 1000, stranded=TRUE) GRanges with 14 ranges and 2 elementMetadata values seqnames ranges strand | exon.anno symbol <rle> <iranges> <rle> | <character> <character> [1] chr1 [ 10, 30] + | utr5 GeneA [2] chr1 [ 15, 45] - | cds GeneD [3] chr1 [ 40, 49] + | cds GeneA [4] chr1 [ 50, 100] - | utr5 GeneD [5] chr1 [ 50, 60] + | overlap NA [6] chr1 [ 61, 79] + | cds GeneC [7] chr1 [ 80, 85] + | overlap NA [8] chr1 [ 86, 100] + | cds GeneA [9] chr1 [105, 115] + | cds GeneC [10] chr1 [120, 140] + | utr3 GeneA [11] chr1 [200, 230] + | utr5 GeneB [12] chr1 [250, 280] + | cds GeneB [13] chr1 [300, 330] + | cds GeneB [14] chr1 [400, 430] + | utr3 GeneB You can try setting `stranded=FALSE` to see how that changes the results. If you look at the function here: .R The lapply block that generates the "interval.annos" object (~ line 34) does the bookkeeping involved in splitting apart overlapping regions and keeping only the parts of exons that are unique to each gene (look around the call to tabulate(...)). The lapply block that generates "clean.list" (~ line 51) goes back to the original annotated gene list and picks out the appropriate annotations for each now-disjoint exon. That's about it. As I said earlier, this is a first cut at doing this. I'm sure it can be cleaner and done more efficiently, but I thought you'd be interested in the code that does the disjoining stuff (in the context of this function, or just by itself). It's only about 20 lines of code, so ... you might not have to do a rewrite in Python after all :-) [not that I have anything against Python .. I rather like it] -steve ps - the content/validity of those github links will (obviously) change over time. In case they change before you get a change to play with the contents of this message, you can get the code @ the time of this commit from these links: annotateChromosome function: me.R code to generate the example lchr1 object: /test-annotateChromosome.R -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info:
ADD COMMENTlink written 8.4 years ago by Steve Lianoglou12k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 271 users visited in the last hour