stranded findOverlaps
3
0
Entering edit mode
Robert Castelo ★ 3.4k
@rcastelo
Last seen 4 weeks ago
Barcelona/Universitat Pompeu Fabra
dear list, and particularly, the IRanges developers, i'm using the function findOverlaps from the IRanges package because i need to find what stranded genomic intervals from one set (as a RangedData object) overlap with what stranded genomic intervals from another set (as another RangedData object). the problem is that i don't what to consider overlaps between genomic intervals from different strands. i've been looking to the help page of findOverlaps (devel version, see my sessionInfo() below) and searched through the BioC mailinglist and my preliminary conclusion is that such an operation is not yet supported. i've been thinking of using rdapply to break down the RangedData objects into spaces and then again by the two strands but the problem is that the query and subject indexes resulting of findOverlaps will not match the dimension of the original RangedData objects. so, i'd like to suggest that some option is added to this useful function to restrict the overlapping search by strand. of course, if this is somehow already implemented and i just missed it, then i'll be very grateful if you let me know what function/parameter i should be using. thanks a lot!! robert. sessionInfo() R version 2.11.0 Under development (unstable) (2009-10-06 r49948) x86_64-unknown-linux-gnu locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] IRanges_1.5.16
IRanges IRanges • 1.5k views
ADD COMMENT
0
Entering edit mode
@michael-dondrup-3849
Last seen 10.4 years ago
Hi Robert, just a quick guess, maybe somebody knowing IRanges better may correct me. I believe it's not directly possible to access the strand from the IRanges objects, because always start < end in the IRanges object. Thus, the direction of the interval has to be taken care of while the IRanges for the Ranged data are constructed, that's maybe the reason why there is no parameter for in-strand overlap. Following approach might be simple enough though (sorry no code example): - sort the data set of ranges (alignments, genes, sequencing reads) into two groups by their strand (I assume you have this info somewhere) - construct two IRanges objects per set (aka query, reference), one for plus one for minus - make one IRangesList per set, add corresponding IRanges objects, name them "plus" and "minus" in the list - compute the overlap of the IRangesLists ( aka.: overlap(set1, set2) ) -> you'll get the overlaps in strand if you have chromosomes you construct two IRanges per chromosome in set Does this make sense? Michael Zitat von Robert Castelo <robert.castelo at="" upf.edu="">: > dear list, and particularly, the IRanges developers, > > i'm using the function findOverlaps from the IRanges package because i > need to find what stranded genomic intervals from one set (as a > RangedData object) overlap with what stranded genomic intervals from > another set (as another RangedData object). the problem is that i don't > what to consider overlaps between genomic intervals from different > strands. > > i've been looking to the help page of findOverlaps (devel version, see > my sessionInfo() below) and searched through the BioC mailinglist and my > preliminary conclusion is that such an operation is not yet supported. > > i've been thinking of using rdapply to break down the RangedData objects > into spaces and then again by the two strands but the problem is that > the query and subject indexes resulting of findOverlaps will not match > the dimension of the original RangedData objects. > > so, i'd like to suggest that some option is added to this useful > function to restrict the overlapping search by strand. of course, if > this is somehow already implemented and i just missed it, then i'll be > very grateful if you let me know what function/parameter i should be > using. > > > thanks a lot!! > robert. > > sessionInfo() > R version 2.11.0 Under development (unstable) (2009-10-06 r49948) > x86_64-unknown-linux-gnu > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] IRanges_1.5.16 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT
0
Entering edit mode
@julien-gagneur-2045
Last seen 10.4 years ago
Dear Robert, for dealing with genomic intervals, you can also consider the genomeIntervals package. A class for stranded genomic intervals is available together with an interval overlap function that behaves in a strand-specific manner. Hope this helps, Julien
ADD COMMENT
0
Entering edit mode
@michael-lawrence-3846
Last seen 3.1 years ago
United States
On Fri, Jan 22, 2010 at 11:41 AM, Robert Castelo <robert.castelo@upf.edu>wrote: > dear list, and particularly, the IRanges developers, > > i'm using the function findOverlaps from the IRanges package because i > need to find what stranded genomic intervals from one set (as a > RangedData object) overlap with what stranded genomic intervals from > another set (as another RangedData object). the problem is that i don't > what to consider overlaps between genomic intervals from different > strands. > > i've been looking to the help page of findOverlaps (devel version, see > my sessionInfo() below) and searched through the BioC mailinglist and my > preliminary conclusion is that such an operation is not yet supported. > > i've been thinking of using rdapply to break down the RangedData objects > into spaces and then again by the two strands but the problem is that > the query and subject indexes resulting of findOverlaps will not match > the dimension of the original RangedData objects. > > so, i'd like to suggest that some option is added to this useful > function to restrict the overlapping search by strand. of course, if > this is somehow already implemented and i just missed it, then i'll be > very grateful if you let me know what function/parameter i should be > using. > > Well, IRanges knows nothing about Biology, so a 'strand' option would be out of place, in my opinion. That said, I can think of at least two approaches. 1) Simply filter the results for matches that are the the same strand. This is something as simple as: result <- findOverlaps(a, b) mat <- as.matrix(result) mat <- mat[a$strand[mat[,1L]] == b$strand[mat[,2L]],] 2) Out of recognition that we are really treating the two strands as separate spaces, break down the RangedData into chrom*strand spaces, as in: rd <- RangedData(...) rd <- do.call(c, split(rd, rd$strand)) result <- findOverlaps(rd, ...) ## then maybe eventually go back chromosome spaces rds <- split(rd, rd$strand) names(rds[[1]]) <- chromNames names(rds[[2]]) <- chromNames rd <- do.call(rbind, rds) The second approach would be very convenient if you always want to treat the strands separately. The separation could be specified at construction time, e.g.: RangedData(ranges, strand, space = interaction(chrom, strand)) But in general neither of these are awfully convenient, and I've always had the suspicion that we'd eventually need multiple space variables. Yes, we could add some argument to the findOverlaps method for RangedData that takes a vector of variable names for splitting into subspaces, but I think we would want a more general solution, where the RangedData itself has the notion of subspaces. This would be a non-trivial change. Would it behave like a nested list in some ways? Hopefully others have better ideas... Michael > > thanks a lot!! > robert. > > sessionInfo() > R version 2.11.0 Under development (unstable) (2009-10-06 r49948) > x86_64-unknown-linux-gnu > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] IRanges_1.5.16 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Hi, On Mon, Jan 25, 2010 at 12:56 PM, Michael Lawrence <lawrence.michael at="" gene.com=""> wrote: > On Fri, Jan 22, 2010 at 11:41 AM, Robert Castelo <robert.castelo at="" upf.edu="">wrote: > >> dear list, and particularly, the IRanges developers, >> >> i'm using the function findOverlaps from the IRanges package because i >> need to find what stranded genomic intervals from one set (as a >> RangedData object) overlap with what stranded genomic intervals from >> another set (as another RangedData object). the problem is that i don't >> what to consider overlaps between genomic intervals from different >> strands. >> >> i've been looking to the help page of findOverlaps (devel version, see >> my sessionInfo() below) and searched through the BioC mailinglist and my >> preliminary conclusion is that such an operation is not yet supported. >> >> i've been thinking of using rdapply to break down the RangedData objects >> into spaces and then again by the two strands but the problem is that >> the query and subject indexes resulting of findOverlaps will not match >> the dimension of the original RangedData objects. >> >> so, i'd like to suggest that some option is added to this useful >> function to restrict the overlapping search by strand. of course, if >> this is somehow already implemented and i just missed it, then i'll be >> very grateful if you let me know what function/parameter i should be >> using. >> >> > Well, IRanges knows nothing about Biology, so a 'strand' option would be out > of place, in my opinion. That said, I can think of at least two approaches. > > 1) Simply filter the results for matches that are the the same strand. This > is something as simple as: > result <- findOverlaps(a, b) > mat <- as.matrix(result) > mat <- mat[a$strand[mat[,1L]] == b$strand[mat[,2L]],] > > 2) Out of recognition that we are really treating the two strands as > separate spaces, break down the RangedData into chrom*strand spaces, as in: > rd <- RangedData(...) > rd <- do.call(c, split(rd, rd$strand)) > result <- findOverlaps(rd, ...) > ## then maybe eventually go back chromosome spaces > rds <- split(rd, rd$strand) > names(rds[[1]]) <- chromNames > names(rds[[2]]) <- chromNames > rd <- do.call(rbind, rds) > > The second approach would be very convenient if you always want to treat the > strands separately. The separation could be specified at construction time, > e.g.: > RangedData(ranges, strand, space = interaction(chrom, strand)) > > But in general neither of these are awfully convenient, and I've always had > the suspicion that we'd eventually need multiple space variables. Yes, we > could add some argument to the findOverlaps method for RangedData that takes > a vector of variable names for splitting into subspaces, but I think we > would want a more general solution, where the RangedData itself has the > notion of subspaces. This would be a non-trivial change. Would it behave > like a nested list in some ways? > > Hopefully others have better ideas... How about defining findOverlaps on "AlignedRead" objects (from the ShortRead), and having "easy" ways to create an AlignedRead object out of IRanges/RangesList objects (with appropriate additional metadata)? I reckon you'd need a juiced up findOverlaps function to add params specifying how to (or not) deal with the metadata in the AlignedRead objects, though, among other things. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY
0
Entering edit mode
On Jan 25, 2010, at 12:56 PM, Michael Lawrence wrote: > On Fri, Jan 22, 2010 at 11:41 AM, Robert Castelo <robert.castelo at="" upf.edu="">wrote: > >> dear list, and particularly, the IRanges developers, >> >> i'm using the function findOverlaps from the IRanges package because i >> need to find what stranded genomic intervals from one set (as a >> RangedData object) overlap with what stranded genomic intervals from >> another set (as another RangedData object). the problem is that i don't >> what to consider overlaps between genomic intervals from different >> strands. >> >> i've been looking to the help page of findOverlaps (devel version, see >> my sessionInfo() below) and searched through the BioC mailinglist and my >> preliminary conclusion is that such an operation is not yet supported. >> >> i've been thinking of using rdapply to break down the RangedData objects >> into spaces and then again by the two strands but the problem is that >> the query and subject indexes resulting of findOverlaps will not match >> the dimension of the original RangedData objects. >> >> so, i'd like to suggest that some option is added to this useful >> function to restrict the overlapping search by strand. of course, if >> this is somehow already implemented and i just missed it, then i'll be >> very grateful if you let me know what function/parameter i should be >> using. >> >> > Well, IRanges knows nothing about Biology, so a 'strand' option would be out > of place, in my opinion. That said, I can think of at least two approaches. > > 1) Simply filter the results for matches that are the the same strand. This > is something as simple as: > result <- findOverlaps(a, b) > mat <- as.matrix(result) > mat <- mat[a$strand[mat[,1L]] == b$strand[mat[,2L]],] > > 2) Out of recognition that we are really treating the two strands as > separate spaces, break down the RangedData into chrom*strand spaces, as in: > rd <- RangedData(...) > rd <- do.call(c, split(rd, rd$strand)) > result <- findOverlaps(rd, ...) > ## then maybe eventually go back chromosome spaces > rds <- split(rd, rd$strand) > names(rds[[1]]) <- chromNames > names(rds[[2]]) <- chromNames > rd <- do.call(rbind, rds) > > The second approach would be very convenient if you always want to treat the > strands separately. The separation could be specified at construction time, > e.g.: > RangedData(ranges, strand, space = interaction(chrom, strand)) > > But in general neither of these are awfully convenient, and I've always had > the suspicion that we'd eventually need multiple space variables. Yes, we > could add some argument to the findOverlaps method for RangedData that takes > a vector of variable names for splitting into subspaces, but I think we > would want a more general solution, where the RangedData itself has the > notion of subspaces. This would be a non-trivial change. Would it behave > like a nested list in some ways? > > Hopefully others have better ideas... We will need good support for stranded genomic intervals. This is a very important case to handle, and will be even more important in the future where a number of assays will be stranded. We need support for doing operations on such objects, both ignoring strand and not ignoring strand. An example could be that we take (stranded) genome annotation and what to perform a per-chromosome reduce(). Users might want to do a reduce respecting strand information where we would get one IRanges per chromosome * strand or we might want to do a reduce(anno , ignoreStrand = TRUE) which yields one IRanges per chromosome. I agree that the general design might be to allow for any number of nested subspaces, but we do have a very important special case where we know that the second level of nestedness only have two components. I believe a lot of value would be gained from being able to operate easily on such objects. Kasper
ADD REPLY
0
Entering edit mode
It seems that RangedData (with a strand variable) has fallen into this role within the IRanges framework. At one point, there was a GenomicData subclass that did allow for special strand options. Unfortunately, the GenomicData() convenience constructor of RangedData is all that is left of that. So we could bring that back. Whatever happened to the proposed GenomeRanges package? Michael On Mon, Jan 25, 2010 at 11:02 AM, Kasper Daniel Hansen < khansen@stat.berkeley.edu> wrote: > > On Jan 25, 2010, at 12:56 PM, Michael Lawrence wrote: > > > On Fri, Jan 22, 2010 at 11:41 AM, Robert Castelo <robert.castelo@upf.edu> >wrote: > > > >> dear list, and particularly, the IRanges developers, > >> > >> i'm using the function findOverlaps from the IRanges package because i > >> need to find what stranded genomic intervals from one set (as a > >> RangedData object) overlap with what stranded genomic intervals from > >> another set (as another RangedData object). the problem is that i don't > >> what to consider overlaps between genomic intervals from different > >> strands. > >> > >> i've been looking to the help page of findOverlaps (devel version, see > >> my sessionInfo() below) and searched through the BioC mailinglist and my > >> preliminary conclusion is that such an operation is not yet supported. > >> > >> i've been thinking of using rdapply to break down the RangedData objects > >> into spaces and then again by the two strands but the problem is that > >> the query and subject indexes resulting of findOverlaps will not match > >> the dimension of the original RangedData objects. > >> > >> so, i'd like to suggest that some option is added to this useful > >> function to restrict the overlapping search by strand. of course, if > >> this is somehow already implemented and i just missed it, then i'll be > >> very grateful if you let me know what function/parameter i should be > >> using. > >> > >> > > Well, IRanges knows nothing about Biology, so a 'strand' option would be > out > > of place, in my opinion. That said, I can think of at least two > approaches. > > > > 1) Simply filter the results for matches that are the the same strand. > This > > is something as simple as: > > result <- findOverlaps(a, b) > > mat <- as.matrix(result) > > mat <- mat[a$strand[mat[,1L]] == b$strand[mat[,2L]],] > > > > 2) Out of recognition that we are really treating the two strands as > > separate spaces, break down the RangedData into chrom*strand spaces, as > in: > > rd <- RangedData(...) > > rd <- do.call(c, split(rd, rd$strand)) > > result <- findOverlaps(rd, ...) > > ## then maybe eventually go back chromosome spaces > > rds <- split(rd, rd$strand) > > names(rds[[1]]) <- chromNames > > names(rds[[2]]) <- chromNames > > rd <- do.call(rbind, rds) > > > > The second approach would be very convenient if you always want to treat > the > > strands separately. The separation could be specified at construction > time, > > e.g.: > > RangedData(ranges, strand, space = interaction(chrom, strand)) > > > > But in general neither of these are awfully convenient, and I've always > had > > the suspicion that we'd eventually need multiple space variables. Yes, we > > could add some argument to the findOverlaps method for RangedData that > takes > > a vector of variable names for splitting into subspaces, but I think we > > would want a more general solution, where the RangedData itself has the > > notion of subspaces. This would be a non-trivial change. Would it behave > > like a nested list in some ways? > > > > Hopefully others have better ideas... > > We will need good support for stranded genomic intervals. This is a very > important case to handle, and will be even more important in the future > where a number of assays will be stranded. We need support for doing > operations on such objects, both ignoring strand and not ignoring strand. > > An example could be that we take (stranded) genome annotation and what to > perform a per-chromosome reduce(). Users might want to do a reduce > respecting strand information where we would get one IRanges per chromosome > * strand or we might want to do a reduce(anno , ignoreStrand = TRUE) which > yields one IRanges per chromosome. > > I agree that the general design might be to allow for any number of nested > subspaces, but we do have a very important special case where we know that > the second level of nestedness only have two components. I believe a lot of > value would be gained from being able to operate easily on such objects. > > Kasper [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Michael, Before creating a new class to capture the strandedness of a RangedData objects, it would be useful to have a list of methods in which strandedness can be used: Inter-interval ops: disjoin, gaps, reduce, range, coverage Between interval set ops: intersect, setdiff, union, findOverlaps, %in%, match For each of these operations, is strandedness a separate and unique categorization of the data or are there other categories users would like to group intervals during these operations? For example, I added a "by" argument to the reduce method for RangedData because I was in correspondence with someone who wanted to use both "strand" and "score" columns to match during the reduction exercise. My instinct is that all the inter-interval operations could use grouping capabilities and these groupings are fluid where at one pass you might want to use a "score" column and at another you would like to ignore that information. So in short, I would argue to not add any more classes and instead add "by" arguments to all the operations listed above that could support it and for those between interval set operation that couldn't support it use all the columns in the RangedData objects and not just the strand so if you wanted to find the intersection of two RangedData objects, the entire row of data would have to match and not just the intervals or the stranded intervals. Patrick Michael Lawrence wrote: > It seems that RangedData (with a strand variable) has fallen into this role > within the IRanges framework. At one point, there was a GenomicData subclass > that did allow for special strand options. Unfortunately, the GenomicData() > convenience constructor of RangedData is all that is left of that. So we > could bring that back. Whatever happened to the proposed GenomeRanges > package? > > Michael > > On Mon, Jan 25, 2010 at 11:02 AM, Kasper Daniel Hansen < > khansen at stat.berkeley.edu> wrote: > > >> On Jan 25, 2010, at 12:56 PM, Michael Lawrence wrote: >> >> >>> On Fri, Jan 22, 2010 at 11:41 AM, Robert Castelo <robert.castelo at="" upf.edu="">>> wrote: >>> >>> >>>> dear list, and particularly, the IRanges developers, >>>> >>>> i'm using the function findOverlaps from the IRanges package because i >>>> need to find what stranded genomic intervals from one set (as a >>>> RangedData object) overlap with what stranded genomic intervals from >>>> another set (as another RangedData object). the problem is that i don't >>>> what to consider overlaps between genomic intervals from different >>>> strands. >>>> >>>> i've been looking to the help page of findOverlaps (devel version, see >>>> my sessionInfo() below) and searched through the BioC mailinglist and my >>>> preliminary conclusion is that such an operation is not yet supported. >>>> >>>> i've been thinking of using rdapply to break down the RangedData objects >>>> into spaces and then again by the two strands but the problem is that >>>> the query and subject indexes resulting of findOverlaps will not match >>>> the dimension of the original RangedData objects. >>>> >>>> so, i'd like to suggest that some option is added to this useful >>>> function to restrict the overlapping search by strand. of course, if >>>> this is somehow already implemented and i just missed it, then i'll be >>>> very grateful if you let me know what function/parameter i should be >>>> using. >>>> >>>> >>>> >>> Well, IRanges knows nothing about Biology, so a 'strand' option would be >>> >> out >> >>> of place, in my opinion. That said, I can think of at least two >>> >> approaches. >> >>> 1) Simply filter the results for matches that are the the same strand. >>> >> This >> >>> is something as simple as: >>> result <- findOverlaps(a, b) >>> mat <- as.matrix(result) >>> mat <- mat[a$strand[mat[,1L]] == b$strand[mat[,2L]],] >>> >>> 2) Out of recognition that we are really treating the two strands as >>> separate spaces, break down the RangedData into chrom*strand spaces, as >>> >> in: >> >>> rd <- RangedData(...) >>> rd <- do.call(c, split(rd, rd$strand)) >>> result <- findOverlaps(rd, ...) >>> ## then maybe eventually go back chromosome spaces >>> rds <- split(rd, rd$strand) >>> names(rds[[1]]) <- chromNames >>> names(rds[[2]]) <- chromNames >>> rd <- do.call(rbind, rds) >>> >>> The second approach would be very convenient if you always want to treat >>> >> the >> >>> strands separately. The separation could be specified at construction >>> >> time, >> >>> e.g.: >>> RangedData(ranges, strand, space = interaction(chrom, strand)) >>> >>> But in general neither of these are awfully convenient, and I've always >>> >> had >> >>> the suspicion that we'd eventually need multiple space variables. Yes, we >>> could add some argument to the findOverlaps method for RangedData that >>> >> takes >> >>> a vector of variable names for splitting into subspaces, but I think we >>> would want a more general solution, where the RangedData itself has the >>> notion of subspaces. This would be a non-trivial change. Would it behave >>> like a nested list in some ways? >>> >>> Hopefully others have better ideas... >>> >> We will need good support for stranded genomic intervals. This is a very >> important case to handle, and will be even more important in the future >> where a number of assays will be stranded. We need support for doing >> operations on such objects, both ignoring strand and not ignoring strand. >> >> An example could be that we take (stranded) genome annotation and what to >> perform a per-chromosome reduce(). Users might want to do a reduce >> respecting strand information where we would get one IRanges per chromosome >> * strand or we might want to do a reduce(anno , ignoreStrand = TRUE) which >> yields one IRanges per chromosome. >> >> I agree that the general design might be to allow for any number of nested >> subspaces, but we do have a very important special case where we know that >> the second level of nestedness only have two components. I believe a lot of >> value would be gained from being able to operate easily on such objects. >> >> Kasper >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
On Tue, Jan 26, 2010 at 10:06 AM, Patrick Aboyoun <paboyoun@fhcrc.org>wrote: > Michael, > Before creating a new class to capture the strandedness of a RangedData > objects, it would be useful to have a list of methods in which strandedness > can be used: > > Inter-interval ops: disjoin, gaps, reduce, range, coverage > Between interval set ops: intersect, setdiff, union, findOverlaps, %in%, > match > > For each of these operations, is strandedness a separate and unique > categorization of the data or are there other categories users would like to > group intervals during these operations? For example, I added a "by" > argument to the reduce method for RangedData because I was in correspondence > with someone who wanted to use both "strand" and "score" columns to match > during the reduction exercise. My instinct is that all the inter- interval > operations could use grouping capabilities and these groupings are fluid > where at one pass you might want to use a "score" column and at another you > would like to ignore that information. So in short, I would argue to not add > any more classes and instead add "by" arguments to all the operations listed > above that could support it and for those between interval set operation > that couldn't support it use all the columns in the RangedData objects and > not just the strand so if you wanted to find the intersection of two > RangedData objects, the entire row of data would have to match and not just > the intervals or the stranded intervals. > > Well this idea has certainly crossed my mind, but it sounds like a big headache. Every element that operates over the ranges will need this argument. Isn't this what the by() function is for? I guess we have implicit iteration, so it is not such a big jump to implicit iteration over transient ragged arrays? I realize the GenomicRanges would need to override every method, which is a pain, but at least it is optimized (at least at the user level) for the special use case. Anyway, I guess this idea has my vote. Lot of work though. Michael > Patrick > > > > Michael Lawrence wrote: > >> It seems that RangedData (with a strand variable) has fallen into this >> role >> within the IRanges framework. At one point, there was a GenomicData >> subclass >> that did allow for special strand options. Unfortunately, the >> GenomicData() >> convenience constructor of RangedData is all that is left of that. So we >> could bring that back. Whatever happened to the proposed GenomeRanges >> package? >> >> Michael >> >> On Mon, Jan 25, 2010 at 11:02 AM, Kasper Daniel Hansen < >> khansen@stat.berkeley.edu> wrote: >> >> >> >>> On Jan 25, 2010, at 12:56 PM, Michael Lawrence wrote: >>> >>> >>> >>>> On Fri, Jan 22, 2010 at 11:41 AM, Robert Castelo < >>>> robert.castelo@upf.edu >>>> wrote: >>>> >>>> >>>> >>>>> dear list, and particularly, the IRanges developers, >>>>> >>>>> i'm using the function findOverlaps from the IRanges package because i >>>>> need to find what stranded genomic intervals from one set (as a >>>>> RangedData object) overlap with what stranded genomic intervals from >>>>> another set (as another RangedData object). the problem is that i don't >>>>> what to consider overlaps between genomic intervals from different >>>>> strands. >>>>> >>>>> i've been looking to the help page of findOverlaps (devel version, see >>>>> my sessionInfo() below) and searched through the BioC mailinglist and >>>>> my >>>>> preliminary conclusion is that such an operation is not yet supported. >>>>> >>>>> i've been thinking of using rdapply to break down the RangedData >>>>> objects >>>>> into spaces and then again by the two strands but the problem is that >>>>> the query and subject indexes resulting of findOverlaps will not match >>>>> the dimension of the original RangedData objects. >>>>> >>>>> so, i'd like to suggest that some option is added to this useful >>>>> function to restrict the overlapping search by strand. of course, if >>>>> this is somehow already implemented and i just missed it, then i'll be >>>>> very grateful if you let me know what function/parameter i should be >>>>> using. >>>>> >>>>> >>>>> >>>>> >>>> Well, IRanges knows nothing about Biology, so a 'strand' option would be >>>> >>>> >>> out >>> >>> >>>> of place, in my opinion. That said, I can think of at least two >>>> >>>> >>> approaches. >>> >>> >>>> 1) Simply filter the results for matches that are the the same strand. >>>> >>>> >>> This >>> >>> >>>> is something as simple as: >>>> result <- findOverlaps(a, b) >>>> mat <- as.matrix(result) >>>> mat <- mat[a$strand[mat[,1L]] == b$strand[mat[,2L]],] >>>> >>>> 2) Out of recognition that we are really treating the two strands as >>>> separate spaces, break down the RangedData into chrom*strand spaces, as >>>> >>>> >>> in: >>> >>> >>>> rd <- RangedData(...) >>>> rd <- do.call(c, split(rd, rd$strand)) >>>> result <- findOverlaps(rd, ...) >>>> ## then maybe eventually go back chromosome spaces >>>> rds <- split(rd, rd$strand) >>>> names(rds[[1]]) <- chromNames >>>> names(rds[[2]]) <- chromNames >>>> rd <- do.call(rbind, rds) >>>> >>>> The second approach would be very convenient if you always want to treat >>>> >>>> >>> the >>> >>> >>>> strands separately. The separation could be specified at construction >>>> >>>> >>> time, >>> >>> >>>> e.g.: >>>> RangedData(ranges, strand, space = interaction(chrom, strand)) >>>> >>>> But in general neither of these are awfully convenient, and I've always >>>> >>>> >>> had >>> >>> >>>> the suspicion that we'd eventually need multiple space variables. Yes, >>>> we >>>> could add some argument to the findOverlaps method for RangedData that >>>> >>>> >>> takes >>> >>> >>>> a vector of variable names for splitting into subspaces, but I think we >>>> would want a more general solution, where the RangedData itself has the >>>> notion of subspaces. This would be a non-trivial change. Would it behave >>>> like a nested list in some ways? >>>> >>>> Hopefully others have better ideas... >>>> >>>> >>> We will need good support for stranded genomic intervals. This is a very >>> important case to handle, and will be even more important in the future >>> where a number of assays will be stranded. We need support for doing >>> operations on such objects, both ignoring strand and not ignoring strand. >>> >>> An example could be that we take (stranded) genome annotation and what to >>> perform a per-chromosome reduce(). Users might want to do a reduce >>> respecting strand information where we would get one IRanges per >>> chromosome >>> * strand or we might want to do a reduce(anno , ignoreStrand = TRUE) >>> which >>> yields one IRanges per chromosome. >>> >>> I agree that the general design might be to allow for any number of >>> nested >>> subspaces, but we do have a very important special case where we know >>> that >>> the second level of nestedness only have two components. I believe a lot >>> of >>> value would be gained from being able to operate easily on such objects. >>> >>> Kasper >>> >>> >> >> [[alternative HTML version deleted]] >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 649 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6