Question: Most efficient way to compute width of overlap of multiple features
1
5.6 years ago by
Charles Berry290
United States
Charles Berry290 wrote:
Vince S. Buffalo <vsbuffalo at="" ...=""> writes: > > Hi Bioconductor folks, > > I'm trying to create some GRanges summaries, but I think I may be missing > an obvious solution. I have fixed-width windows as a GRanges object, and > for each window/row I'd like to add a metadata column that indicates how > many base pairs of this window overlap features in another GRange object. > I'll need to add a few columns for different features in different GRanges > objects. > > I've tried using the approach of findOverlaps, followed by ranges() to > extract range widths. This creates an error: "'query' must be a Ranges of > length equal to number of queries". I saw in the source > that pintersect(query[queryHits(x)], subject[subjectHits(x)]) works too > (and does without error). This returns the overlapping ranges, but it'd > take a load of data munging to get it into the format I'd like ? it seems > like I may be overlooking an easier solution. > > thanks, > Vince > > pintersect() seems like the right approach. Use coverage() to normalize the subject GRanges, then tapply to add the bases covered: > set.seed(12345) > gr <- GRanges(seqnames=sample(c("a","b"),100,repl=TRUE), + IRanges(start=sample(100000,100),width=1000)) > gr2 <- GRanges(seqnames=sample(c("a","b"),100,repl=TRUE), + IRanges(start=sample(100000,100),width=1000)) > cover2.gr <- subset( as( coverage(gr2)>0, "GRanges"), score) > seqinfocover2.gr) <- seqinfo(gr2) > hits <- findOverlaps(gr,cover2.gr) > gr.over <- pintersect(gr[queryHits(hits)],cover2.gr[subjectHits(hits)]) > gr.counts <- tapply(gr.over,queryHits(hits),FUN=function(x) sum(width(x))) > gr$overlap<- 0 > gr$overlap[as.numeric(names(gr.counts))]<- unname(gr.counts) > gr GRanges with 100 ranges and 1 metadata column: seqnames ranges strand | overlap <rle> <iranges> <rle> | <numeric> [1] b [29447, 30446] * | 0 [2] b [61725, 62724] * | 601 [3] b [97426, 98425] * | 0 [4] b [61820, 62819] * | 696 [5] a [52135, 53134] * | 441 ... ... ... ... ... ... [96] b [ 693, 1692] * | 0 [97] b [27406, 28405] * | 0 [98] b [79728, 80727] * | 755 [99] a [24344, 25343] * | 353 [100] a [45782, 46781] * | 0 --- seqlengths: a b NA NA > length(findOverlaps(subset(gr,overlap==0),gr2))==0 # check [1] TRUE > length(findOverlaps(subset(gr,overlap!=0),gr2))>0 # check [1] TRUE > HTH, Chuck
• 2.3k views
modified 5.6 years ago by Michael Lawrence11k • written 5.6 years ago by Charles Berry290
Answer: Most efficient way to compute width of overlap of multiple features
0
5.6 years ago by
United States
Michael Lawrence11k wrote:
On Tue, Jan 7, 2014 at 9:30 AM, Charles Berry <ccberry@ucsd.edu> wrote: > Vince S. Buffalo <vsbuffalo@...> writes: > > > > > Hi Bioconductor folks, > > > > I'm trying to create some GRanges summaries, but I think I may be missing > > an obvious solution. I have fixed-width windows as a GRanges object, and > > for each window/row I'd like to add a metadata column that indicates how > > many base pairs of this window overlap features in another GRange object. > > I'll need to add a few columns for different features in different > GRanges > > objects. > > > > I've tried using the approach of findOverlaps, followed by ranges() to > > extract range widths. This creates an error: "'query' must be a Ranges of > > length equal to number of queries". I saw in the source > > that pintersect(query[queryHits(x)], subject[subjectHits(x)]) works too > > (and does without error). This returns the overlapping ranges, but it'd > > take a load of data munging to get it into the format I'd like  it seems > > like I may be overlooking an easier solution. > > > > thanks, > > Vince > > > > > > pintersect() seems like the right approach. > > Use coverage() to normalize the subject GRanges, then tapply to add the > bases covered: > > > set.seed(12345) > > gr <- GRanges(seqnames=sample(c("a","b"),100,repl=TRUE), > + IRanges(start=sample(100000,100),width=1000)) > > gr2 <- GRanges(seqnames=sample(c("a","b"),100,repl=TRUE), > + IRanges(start=sample(100000,100),width=1000)) > > cover2.gr <- subset( as( coverage(gr2)>0, "GRanges"), score) > > seqinfocover2.gr) <- seqinfo(gr2) > I think you just want cover2.gr <- reduce(gr2) instead of the above two lines. > > hits <- findOverlaps(gr,cover2.gr) > > gr.over <- pintersect(gr[queryHits(hits)],cover2.gr[subjectHits(hits)]) > w <- width(ranges(hits, ranges(gr), rangescover2.gr))) *should* work as an alternative to the pintersect call. > > gr.counts <- tapply(gr.over,queryHits(hits),FUN=function(x) > sum(width(x))) > This tapply is going to kill performance. Use something like: gr$overlap <- sum(splitAsList(w, factor(queryHits(hits), seq_len(queryLength(hits))))) Or even: gr$overlap <- sum(relist(w, as(hits, "List"))) But I'm not sure what I was thinking the day I made the Hits=>List coercion. Seems somewhat arbitrary to group by query. But hey, it works. > > gr$overlap<- 0 > > gr$overlap[as.numeric(names(gr.counts))]<- unname(gr.counts) > > gr > GRanges with 100 ranges and 1 metadata column: > seqnames ranges strand | overlap > <rle> <iranges> <rle> | <numeric> > [1] b [29447, 30446] * | 0 > [2] b [61725, 62724] * | 601 > [3] b [97426, 98425] * | 0 > [4] b [61820, 62819] * | 696 > [5] a [52135, 53134] * | 441 > ... ... ... ... ... ... > [96] b [ 693, 1692] * | 0 > [97] b [27406, 28405] * | 0 > [98] b [79728, 80727] * | 755 > [99] a [24344, 25343] * | 353 > [100] a [45782, 46781] * | 0 > --- > seqlengths: > a b > NA NA > > length(findOverlaps(subset(gr,overlap==0),gr2))==0 # check > [1] TRUE > > length(findOverlaps(subset(gr,overlap!=0),gr2))>0 # check > [1] TRUE > > > > HTH, > > Chuck > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]