Most efficient way to compute width of overlap of multiple features
1
1
Entering edit mode
Charles Berry ▴ 290
@charles-berry-5754
Last seen 5.1 years ago
United States
Vince S. Buffalo <vsbuffalo at="" ...=""> writes: > > Hi Bioconductor folks, > > I'm trying to create some GRanges summaries, but I think I may be missing > an obvious solution. I have fixed-width windows as a GRanges object, and > for each window/row I'd like to add a metadata column that indicates how > many base pairs of this window overlap features in another GRange object. > I'll need to add a few columns for different features in different GRanges > objects. > > I've tried using the approach of findOverlaps, followed by ranges() to > extract range widths. This creates an error: "'query' must be a Ranges of > length equal to number of queries". I saw in the source > that pintersect(query[queryHits(x)], subject[subjectHits(x)]) works too > (and does without error). This returns the overlapping ranges, but it'd > take a load of data munging to get it into the format I'd like ? it seems > like I may be overlooking an easier solution. > > thanks, > Vince > > pintersect() seems like the right approach. Use coverage() to normalize the subject GRanges, then tapply to add the bases covered: > set.seed(12345) > gr <- GRanges(seqnames=sample(c("a","b"),100,repl=TRUE), + IRanges(start=sample(100000,100),width=1000)) > gr2 <- GRanges(seqnames=sample(c("a","b"),100,repl=TRUE), + IRanges(start=sample(100000,100),width=1000)) > cover2.gr <- subset( as( coverage(gr2)>0, "GRanges"), score) > seqinfocover2.gr) <- seqinfo(gr2) > hits <- findOverlaps(gr,cover2.gr) > gr.over <- pintersect(gr[queryHits(hits)],cover2.gr[subjectHits(hits)]) > gr.counts <- tapply(gr.over,queryHits(hits),FUN=function(x) sum(width(x))) > gr$overlap<- 0 > gr$overlap[as.numeric(names(gr.counts))]<- unname(gr.counts) > gr GRanges with 100 ranges and 1 metadata column: seqnames ranges strand | overlap <rle> <iranges> <rle> | <numeric> [1] b [29447, 30446] * | 0 [2] b [61725, 62724] * | 601 [3] b [97426, 98425] * | 0 [4] b [61820, 62819] * | 696 [5] a [52135, 53134] * | 441 ... ... ... ... ... ... [96] b [ 693, 1692] * | 0 [97] b [27406, 28405] * | 0 [98] b [79728, 80727] * | 755 [99] a [24344, 25343] * | 353 [100] a [45782, 46781] * | 0 --- seqlengths: a b NA NA > length(findOverlaps(subset(gr,overlap==0),gr2))==0 # check [1] TRUE > length(findOverlaps(subset(gr,overlap!=0),gr2))>0 # check [1] TRUE > HTH, Chuck
• 3.7k views
ADD COMMENT
0
Entering edit mode
@michael-lawrence-3846
Last seen 2.4 years ago
United States
On Tue, Jan 7, 2014 at 9:30 AM, Charles Berry <ccberry@ucsd.edu> wrote: > Vince S. Buffalo <vsbuffalo@...> writes: > > > > > Hi Bioconductor folks, > > > > I'm trying to create some GRanges summaries, but I think I may be missing > > an obvious solution. I have fixed-width windows as a GRanges object, and > > for each window/row I'd like to add a metadata column that indicates how > > many base pairs of this window overlap features in another GRange object. > > I'll need to add a few columns for different features in different > GRanges > > objects. > > > > I've tried using the approach of findOverlaps, followed by ranges() to > > extract range widths. This creates an error: "'query' must be a Ranges of > > length equal to number of queries". I saw in the source > > that pintersect(query[queryHits(x)], subject[subjectHits(x)]) works too > > (and does without error). This returns the overlapping ranges, but it'd > > take a load of data munging to get it into the format I'd like — it seems > > like I may be overlooking an easier solution. > > > > thanks, > > Vince > > > > > > pintersect() seems like the right approach. > > Use coverage() to normalize the subject GRanges, then tapply to add the > bases covered: > > > set.seed(12345) > > gr <- GRanges(seqnames=sample(c("a","b"),100,repl=TRUE), > + IRanges(start=sample(100000,100),width=1000)) > > gr2 <- GRanges(seqnames=sample(c("a","b"),100,repl=TRUE), > + IRanges(start=sample(100000,100),width=1000)) > > cover2.gr <- subset( as( coverage(gr2)>0, "GRanges"), score) > > seqinfocover2.gr) <- seqinfo(gr2) > I think you just want cover2.gr <- reduce(gr2) instead of the above two lines. > > hits <- findOverlaps(gr,cover2.gr) > > gr.over <- pintersect(gr[queryHits(hits)],cover2.gr[subjectHits(hits)]) > w <- width(ranges(hits, ranges(gr), rangescover2.gr))) *should* work as an alternative to the pintersect call. > > gr.counts <- tapply(gr.over,queryHits(hits),FUN=function(x) > sum(width(x))) > This tapply is going to kill performance. Use something like: gr$overlap <- sum(splitAsList(w, factor(queryHits(hits), seq_len(queryLength(hits))))) Or even: gr$overlap <- sum(relist(w, as(hits, "List"))) But I'm not sure what I was thinking the day I made the Hits=>List coercion. Seems somewhat arbitrary to group by query. But hey, it works. > > gr$overlap<- 0 > > gr$overlap[as.numeric(names(gr.counts))]<- unname(gr.counts) > > gr > GRanges with 100 ranges and 1 metadata column: > seqnames ranges strand | overlap > <rle> <iranges> <rle> | <numeric> > [1] b [29447, 30446] * | 0 > [2] b [61725, 62724] * | 601 > [3] b [97426, 98425] * | 0 > [4] b [61820, 62819] * | 696 > [5] a [52135, 53134] * | 441 > ... ... ... ... ... ... > [96] b [ 693, 1692] * | 0 > [97] b [27406, 28405] * | 0 > [98] b [79728, 80727] * | 755 > [99] a [24344, 25343] * | 353 > [100] a [45782, 46781] * | 0 > --- > seqlengths: > a b > NA NA > > length(findOverlaps(subset(gr,overlap==0),gr2))==0 # check > [1] TRUE > > length(findOverlaps(subset(gr,overlap!=0),gr2))>0 # check > [1] TRUE > > > > HTH, > > Chuck > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
ADD COMMENT

Login before adding your answer.

Traffic: 679 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6