Question: Most efficient way to compute width of overlap of multiple features
1
gravatar for Charles Berry
5.6 years ago by
Charles Berry290
United States
Charles Berry290 wrote:
Vince S. Buffalo <vsbuffalo at="" ...=""> writes: > > Hi Bioconductor folks, > > I'm trying to create some GRanges summaries, but I think I may be missing > an obvious solution. I have fixed-width windows as a GRanges object, and > for each window/row I'd like to add a metadata column that indicates how > many base pairs of this window overlap features in another GRange object. > I'll need to add a few columns for different features in different GRanges > objects. > > I've tried using the approach of findOverlaps, followed by ranges() to > extract range widths. This creates an error: "'query' must be a Ranges of > length equal to number of queries". I saw in the source > that pintersect(query[queryHits(x)], subject[subjectHits(x)]) works too > (and does without error). This returns the overlapping ranges, but it'd > take a load of data munging to get it into the format I'd like ? it seems > like I may be overlooking an easier solution. > > thanks, > Vince > > pintersect() seems like the right approach. Use coverage() to normalize the subject GRanges, then tapply to add the bases covered: > set.seed(12345) > gr <- GRanges(seqnames=sample(c("a","b"),100,repl=TRUE), + IRanges(start=sample(100000,100),width=1000)) > gr2 <- GRanges(seqnames=sample(c("a","b"),100,repl=TRUE), + IRanges(start=sample(100000,100),width=1000)) > cover2.gr <- subset( as( coverage(gr2)>0, "GRanges"), score) > seqinfocover2.gr) <- seqinfo(gr2) > hits <- findOverlaps(gr,cover2.gr) > gr.over <- pintersect(gr[queryHits(hits)],cover2.gr[subjectHits(hits)]) > gr.counts <- tapply(gr.over,queryHits(hits),FUN=function(x) sum(width(x))) > gr$overlap<- 0 > gr$overlap[as.numeric(names(gr.counts))]<- unname(gr.counts) > gr GRanges with 100 ranges and 1 metadata column: seqnames ranges strand | overlap <rle> <iranges> <rle> | <numeric> [1] b [29447, 30446] * | 0 [2] b [61725, 62724] * | 601 [3] b [97426, 98425] * | 0 [4] b [61820, 62819] * | 696 [5] a [52135, 53134] * | 441 ... ... ... ... ... ... [96] b [ 693, 1692] * | 0 [97] b [27406, 28405] * | 0 [98] b [79728, 80727] * | 755 [99] a [24344, 25343] * | 353 [100] a [45782, 46781] * | 0 --- seqlengths: a b NA NA > length(findOverlaps(subset(gr,overlap==0),gr2))==0 # check [1] TRUE > length(findOverlaps(subset(gr,overlap!=0),gr2))>0 # check [1] TRUE > HTH, Chuck
• 2.3k views
ADD COMMENTlink modified 5.6 years ago by Michael Lawrence11k • written 5.6 years ago by Charles Berry290
Answer: Most efficient way to compute width of overlap of multiple features
0
gravatar for Michael Lawrence
5.6 years ago by
United States
Michael Lawrence11k wrote:
On Tue, Jan 7, 2014 at 9:30 AM, Charles Berry <ccberry@ucsd.edu> wrote: > Vince S. Buffalo <vsbuffalo@...> writes: > > > > > Hi Bioconductor folks, > > > > I'm trying to create some GRanges summaries, but I think I may be missing > > an obvious solution. I have fixed-width windows as a GRanges object, and > > for each window/row I'd like to add a metadata column that indicates how > > many base pairs of this window overlap features in another GRange object. > > I'll need to add a few columns for different features in different > GRanges > > objects. > > > > I've tried using the approach of findOverlaps, followed by ranges() to > > extract range widths. This creates an error: "'query' must be a Ranges of > > length equal to number of queries". I saw in the source > > that pintersect(query[queryHits(x)], subject[subjectHits(x)]) works too > > (and does without error). This returns the overlapping ranges, but it'd > > take a load of data munging to get it into the format I'd like — it seems > > like I may be overlooking an easier solution. > > > > thanks, > > Vince > > > > > > pintersect() seems like the right approach. > > Use coverage() to normalize the subject GRanges, then tapply to add the > bases covered: > > > set.seed(12345) > > gr <- GRanges(seqnames=sample(c("a","b"),100,repl=TRUE), > + IRanges(start=sample(100000,100),width=1000)) > > gr2 <- GRanges(seqnames=sample(c("a","b"),100,repl=TRUE), > + IRanges(start=sample(100000,100),width=1000)) > > cover2.gr <- subset( as( coverage(gr2)>0, "GRanges"), score) > > seqinfocover2.gr) <- seqinfo(gr2) > I think you just want cover2.gr <- reduce(gr2) instead of the above two lines. > > hits <- findOverlaps(gr,cover2.gr) > > gr.over <- pintersect(gr[queryHits(hits)],cover2.gr[subjectHits(hits)]) > w <- width(ranges(hits, ranges(gr), rangescover2.gr))) *should* work as an alternative to the pintersect call. > > gr.counts <- tapply(gr.over,queryHits(hits),FUN=function(x) > sum(width(x))) > This tapply is going to kill performance. Use something like: gr$overlap <- sum(splitAsList(w, factor(queryHits(hits), seq_len(queryLength(hits))))) Or even: gr$overlap <- sum(relist(w, as(hits, "List"))) But I'm not sure what I was thinking the day I made the Hits=>List coercion. Seems somewhat arbitrary to group by query. But hey, it works. > > gr$overlap<- 0 > > gr$overlap[as.numeric(names(gr.counts))]<- unname(gr.counts) > > gr > GRanges with 100 ranges and 1 metadata column: > seqnames ranges strand | overlap > <rle> <iranges> <rle> | <numeric> > [1] b [29447, 30446] * | 0 > [2] b [61725, 62724] * | 601 > [3] b [97426, 98425] * | 0 > [4] b [61820, 62819] * | 696 > [5] a [52135, 53134] * | 441 > ... ... ... ... ... ... > [96] b [ 693, 1692] * | 0 > [97] b [27406, 28405] * | 0 > [98] b [79728, 80727] * | 755 > [99] a [24344, 25343] * | 353 > [100] a [45782, 46781] * | 0 > --- > seqlengths: > a b > NA NA > > length(findOverlaps(subset(gr,overlap==0),gr2))==0 # check > [1] TRUE > > length(findOverlaps(subset(gr,overlap!=0),gr2))>0 # check > [1] TRUE > > > > HTH, > > Chuck > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
ADD COMMENTlink written 5.6 years ago by Michael Lawrence11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 162 users visited in the last hour