How to reduce two GRanges simutaneously
2
0
Entering edit mode
@niu-liang-nihniehs-e-6477
Last seen 7.2 years ago
Dear Herve, I have a question: suppose that I have two GRanges of the same length, say x and y, where x[i] and y[i] are two genomic ranges that are paired. What I want to do is to reduce x and y simultaneously, i.e., combine (x[i],y[i]) and (x[j],y[j]) when x[i] overlaps with x[j] and y[i] overlaps with y[j]. Ideally, I would like an output in the following format: range_x range_y count chr?:???-??? chr?:???-??? ? where the range_x and range_y are the reduced ranges, and count is the number of records associated with the chr?:???-??? and chr?:???-???. How can I do this? Thanks! Liang
• 609 views
0
Entering edit mode
@michael-lawrence-3846
Last seen 7 weeks ago
United States
This has got to be one of the most complex problems posed here. Here is one approach, but it may be completely wrong: hits_x <- findOverlaps(x) hits_y <- findOverlaps(y) hits_both <- intersect(hits_x, hits_y) I think the constraint of mutual overlap in x and y is now satisfied. In order to reduce, we have to think of the Hits object as a graph, where the connected components are those ranges that need to be reduced. g <- graph::graphNEL(as.character(seq_len(length(x))), split(subjectHits(hits_both), queryHits(hits_both))) comp <- unname(RBGL::connectedComp(g)) # unname just for efficiency That graphNEL constructor might kill your performance? There are other ways to build the graph. Now we can form GRangesLists from the connected components, and since we know they all overlap, just get the range, and unlist them back to GRanges: ids <- as.integer(unlist(comp)) reduced_x <- unlist(range(relist(x[ids], comp))) reduced_y <- unlist(range(relist(y[ids], comp))) And combine the result into a DataFrame with the range counts: answer <- DataFrame(x=reduced_x, y=reduced_y, count=elementLengths(comp)) Hopefully that gets you closer. This hasn't been tested at all. Michael On Mon, Apr 7, 2014 at 11:40 AM, Niu, Liang (NIH/NIEHS) [E] < liang.niu@nih.gov> wrote: > Dear Herve, > > I have a question: suppose that I have two GRanges of the same length, say > x and y, where x[i] and y[i] are two genomic ranges that are paired. What I > want to do is to reduce x and y simultaneously, i.e., combine (x[i],y[i]) > and (x[j],y[j]) when x[i] overlaps with x[j] and y[i] overlaps with y[j]. > Ideally, I would like an output in the following format: > > range_x range_y count > chr?:???-??? chr?:???-??? ? > > where the range_x and range_y are the reduced ranges, and count is the > number of records associated with the chr?:???-??? and chr?:???-???. How > can I do this? > > Thanks! > > Liang > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
0
Entering edit mode
Hi Liang, Michael, Yeah, an interesting problem indeed. Here is my take on it: # Reduce 'x' and 'y' simultaneously. Returns a list of 2 parallel # GRanges, the 1st one is reduced 'x' and the 2nd one reduced 'y'. reduceTwoGRanges <- function(x, y) { if (length(x) != length(y)) stop("'x' and 'y' must have the same length") rx_revmap <- mcols(reduce(x, with.revmap=TRUE))$revmap ry_revmap <- mcols(reduce(y, with.revmap=TRUE))$revmap rx_map <- integer(length(x)) rx_map[unlist(rx_revmap, use.names=FALSE)] <- togroup(rx_revmap) ry_map <- integer(length(y)) ry_map[unlist(ry_revmap, use.names=FALSE)] <- togroup(ry_revmap) sm <- IRanges:::selfmatchIntegerPairs(rx_map, ry_map) ans1 <- unlist(range(split(x, sm)), use.names=FALSE) ans2 <- unlist(range(split(y, sm)), use.names=FALSE) list(ans1, ans2) } Cheers, H. On 04/08/2014 05:37 AM, Michael Lawrence wrote: > This has got to be one of the most complex problems posed here. > > Here is one approach, but it may be completely wrong: > > hits_x <- findOverlaps(x) > hits_y <- findOverlaps(y) > hits_both <- intersect(hits_x, hits_y) > > I think the constraint of mutual overlap in x and y is now satisfied. In > order to reduce, we have to think of the Hits object as a graph, where > the connected components are those ranges that need to be reduced. > > g <- graph::graphNEL(as.character(seq_len(length(x))), > split(subjectHits(hits_both), queryHits(hits_both))) > comp <- unname(RBGL::connectedComp(g)) # unname just for efficiency > > That graphNEL constructor might kill your performance? There are other > ways to build the graph. Now we can form GRangesLists from the connected > components, and since we know they all overlap, just get the range, and > unlist them back to GRanges: > > ids <- as.integer(unlist(comp)) > reduced_x <- unlist(range(relist(x[ids], comp))) > reduced_y <- unlist(range(relist(y[ids], comp))) > > And combine the result into a DataFrame with the range counts: > > answer <- DataFrame(x=reduced_x, y=reduced_y, count=elementLengths(comp)) > > Hopefully that gets you closer. This hasn't been tested at all. > > Michael > > > > On Mon, Apr 7, 2014 at 11:40 AM, Niu, Liang (NIH/NIEHS) [E] > <liang.niu at="" nih.gov="" <mailto:liang.niu="" at="" nih.gov="">> wrote: > > Dear Herve, > > I have a question: suppose that I have two GRanges of the same > length, say x and y, where x[i] and y[i] are two genomic ranges that > are paired. What I want to do is to reduce x and y simultaneously, > i.e., combine (x[i],y[i]) and (x[j],y[j]) when x[i] overlaps with > x[j] and y[i] overlaps with y[j]. Ideally, I would like an output in > the following format: > > range_x range_y count > chr?:???-??? chr?:???-??? ? > > where the range_x and range_y are the reduced ranges, and count is > the number of records associated with the chr?:???-??? and > chr?:???-???. How can I do this? > > Thanks! > > Liang > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
0
Entering edit mode
@herve-pages-1542
Last seen 3 days ago
Seattle, WA, United States
mmmh... fast but incorrect :-/ > x <- GRanges("chr1", IRanges(1:3, width=1)) > y <- GRanges("chr1", IRanges(c(10,20,10), width=1)) > reduceTwoGRanges(x, y) [] GRanges with 2 ranges and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle>  chr1 [1, 3] *  chr1 [2, 2] * --- seqlengths: chr1 NA [] GRanges with 2 ranges and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle>  chr1 [10, 10] *  chr1 [20, 20] * --- seqlengths: chr1 NA Michael's approach based on detection of the connected components of the graph of overlapping pairs is the way to go. Note that if you want to stay close to the semantic of reduce(), you need to call findOverlaps() with 'maxgap=1' though, so adjacent ranges are considered to overlap. (The fact that we have to use 'maxgap=1' for this is confusing because when ranges are adjacent the gap between them is 0 not 1, but that's another story.) Cheers, H. On 04/08/2014 12:55 PM, Niu, Liang (NIH/NIEHS) [E] wrote: > Thanks! > ________________________________________ > From: Hervé Pagès [hpages at fhcrc.org] > Sent: Tuesday, April 08, 2014 3:42 PM > To: Michael Lawrence; Niu, Liang (NIH/NIEHS) [E] > Cc: bioconductor at r-project.org > Subject: Re: [BioC] How to reduce two GRanges simutaneously > > Hi Liang, Michael, > > Yeah, an interesting problem indeed. > > Here is my take on it: > > # Reduce 'x' and 'y' simultaneously. Returns a list of 2 parallel > # GRanges, the 1st one is reduced 'x' and the 2nd one reduced 'y'. > reduceTwoGRanges <- function(x, y) > { > if (length(x) != length(y)) > stop("'x' and 'y' must have the same length") > rx_revmap <- mcols(reduce(x, with.revmap=TRUE))$revmap > ry_revmap <- mcols(reduce(y, with.revmap=TRUE))$revmap > rx_map <- integer(length(x)) > rx_map[unlist(rx_revmap, use.names=FALSE)] <- togroup(rx_revmap) > ry_map <- integer(length(y)) > ry_map[unlist(ry_revmap, use.names=FALSE)] <- togroup(ry_revmap) > sm <- IRanges:::selfmatchIntegerPairs(rx_map, ry_map) > ans1 <- unlist(range(split(x, sm)), use.names=FALSE) > ans2 <- unlist(range(split(y, sm)), use.names=FALSE) > list(ans1, ans2) > } > > Cheers, > H. > > On 04/08/2014 05:37 AM, Michael Lawrence wrote: >> This has got to be one of the most complex problems posed here. >> >> Here is one approach, but it may be completely wrong: >> >> hits_x <- findOverlaps(x) >> hits_y <- findOverlaps(y) >> hits_both <- intersect(hits_x, hits_y) >> >> I think the constraint of mutual overlap in x and y is now satisfied. In >> order to reduce, we have to think of the Hits object as a graph, where >> the connected components are those ranges that need to be reduced. >> >> g <- graph::graphNEL(as.character(seq_len(length(x))), >> split(subjectHits(hits_both), queryHits(hits_both))) >> comp <- unname(RBGL::connectedComp(g)) # unname just for efficiency >> >> That graphNEL constructor might kill your performance? There are other >> ways to build the graph. Now we can form GRangesLists from the connected >> components, and since we know they all overlap, just get the range, and >> unlist them back to GRanges: >> >> ids <- as.integer(unlist(comp)) >> reduced_x <- unlist(range(relist(x[ids], comp))) >> reduced_y <- unlist(range(relist(y[ids], comp))) >> >> And combine the result into a DataFrame with the range counts: >> >> answer <- DataFrame(x=reduced_x, y=reduced_y, count=elementLengths(comp)) >> >> Hopefully that gets you closer. This hasn't been tested at all. >> >> Michael >> >> >> >> On Mon, Apr 7, 2014 at 11:40 AM, Niu, Liang (NIH/NIEHS) [E] >> <liang.niu at="" nih.gov="" <mailto:liang.niu="" at="" nih.gov="">> wrote: >> >> Dear Herve, >> >> I have a question: suppose that I have two GRanges of the same >> length, say x and y, where x[i] and y[i] are two genomic ranges that >> are paired. What I want to do is to reduce x and y simultaneously, >> i.e., combine (x[i],y[i]) and (x[j],y[j]) when x[i] overlaps with >> x[j] and y[i] overlaps with y[j]. Ideally, I would like an output in >> the following format: >> >> range_x range_y count >> chr?:???-??? chr?:???-??? ? >> >> where the range_x and range_y are the reduced ranges, and count is >> the number of records associated with the chr?:???-??? and >> chr?:???-???. How can I do this? >> >> Thanks! >> >> Liang >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319