Search
Question: countMatches() (was: table for GenomicRanges)
0
gravatar for Hervé Pagès
4.9 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:
Hi, I added findMatches() and countMatches() to the latest IRanges / GenomicRanges packages (in BioC devel only). findMatches(x, table): An enhanced version of ?match? that returns all the matches in a Hits object. countMatches(x, table): Returns an integer vector of the length of ?x?, containing the number of matches in ?table? for each element in ?x?. countMatches() is what you can use to tally/count/tabulate (choose your preferred term) the unique elements in a GRanges object: library(GenomicRanges) set.seed(33) gr <- GRanges("chr1", IRanges(sample(15,20,replace=TRUE), width=5)) Then: > gr_levels <- sort(unique(gr)) > countMatches(gr_levels, gr) [1] 1 1 1 2 4 2 2 1 2 2 2 Note that findMatches() and countMatches() also work on IRanges and DNAStringSet objects, as well as on ordinary atomic vectors: library(hgu95av2probe) library(Biostrings) probes <- DNAStringSet(hgu95av2probe) unique_probes <- unique(probes) count <- countMatches(unique_probes, probes) max(count) # 7 I made other changes in IRanges/GenomicRanges so that the notion of "match" between elements of a vector-like object now consistently means "equality" instead of "overlap", even for range-based objects like IRanges or GRanges objects. This notion of "equality" is the same that is used by ==. The most visible consequence of those changes is that using %in% between 2 IRanges or GRanges objects 'query' and 'subject' in order to do overlaps was replaced by overlapsAny(query, subject). overlapsAny(query, subject): Finds the ranges in ?query? that overlap any of the ranges in ?subject?. There are warnings and deprecation messages in place to help smooth the transition. Cheers, H. -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENTlink modified 4.9 years ago by Michael Lawrence9.8k • written 4.9 years ago by Hervé Pagès ♦♦ 13k
0
gravatar for Michael Lawrence
4.9 years ago by
United States
Michael Lawrence9.8k wrote:
The change to the behavior of %in% is a pretty big one. Are you thinking that all set-based operations should behave this way? For example, setdiff and intersect? I really liked the syntax of "peaks %in% genes". In my experience, it's way more common to ask questions about overlap than about equality, so I'd rather optimize the API for that use case. But again, that's just my personal bias. Michael On Fri, Jan 4, 2013 at 1:11 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi, > > I added findMatches() and countMatches() to the latest IRanges / > GenomicRanges packages (in BioC devel only). > > findMatches(x, table): An enhanced version of ‘match’ that > returns all the matches in a Hits object. > > countMatches(x, table): Returns an integer vector of the length > of ‘x’, containing the number of matches in ‘table’ for > each element in ‘x’. > > countMatches() is what you can use to tally/count/tabulate (choose your > preferred term) the unique elements in a GRanges object: > > library(GenomicRanges) > set.seed(33) > gr <- GRanges("chr1", IRanges(sample(15,20,replace=**TRUE), width=5)) > > Then: > > > gr_levels <- sort(unique(gr)) > > countMatches(gr_levels, gr) > [1] 1 1 1 2 4 2 2 1 2 2 2 > > Note that findMatches() and countMatches() also work on IRanges and > DNAStringSet objects, as well as on ordinary atomic vectors: > > library(hgu95av2probe) > library(Biostrings) > probes <- DNAStringSet(hgu95av2probe) > unique_probes <- unique(probes) > count <- countMatches(unique_probes, probes) > max(count) # 7 > > I made other changes in IRanges/GenomicRanges so that the notion > of "match" between elements of a vector-like object now consistently > means "equality" instead of "overlap", even for range-based objects > like IRanges or GRanges objects. This notion of "equality" is the > same that is used by ==. The most visible consequence of those > changes is that using %in% between 2 IRanges or GRanges objects > 'query' and 'subject' in order to do overlaps was replaced by > overlapsAny(query, subject). > > overlapsAny(query, subject): Finds the ranges in ‘query’ that > overlap any of the ranges in ‘subject’. > > There are warnings and deprecation messages in place to help smooth > the transition. > > Cheers, > H. > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]
ADD COMMENTlink written 4.9 years ago by Michael Lawrence9.8k
On Fri, Jan 4, 2013 at 4:32 PM, Michael Lawrence <lawrence.michael at="" gene.com=""> wrote: > The change to the behavior of %in% is a pretty big one. Are you thinking > that all set-based operations should behave this way? For example, setdiff > and intersect? I really liked the syntax of "peaks %in% genes". In my > experience, it's way more common to ask questions about overlap than about > equality, so I'd rather optimize the API for that use case. But again, > that's just my personal bias. For what it is worth, I share Michael's personal bias here. Sean > Michael > > > On Fri, Jan 4, 2013 at 1:11 PM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: > >> Hi, >> >> I added findMatches() and countMatches() to the latest IRanges / >> GenomicRanges packages (in BioC devel only). >> >> findMatches(x, table): An enhanced version of ?match? that >> returns all the matches in a Hits object. >> >> countMatches(x, table): Returns an integer vector of the length >> of ?x?, containing the number of matches in ?table? for >> each element in ?x?. >> >> countMatches() is what you can use to tally/count/tabulate (choose your >> preferred term) the unique elements in a GRanges object: >> >> library(GenomicRanges) >> set.seed(33) >> gr <- GRanges("chr1", IRanges(sample(15,20,replace=**TRUE), width=5)) >> >> Then: >> >> > gr_levels <- sort(unique(gr)) >> > countMatches(gr_levels, gr) >> [1] 1 1 1 2 4 2 2 1 2 2 2 >> >> Note that findMatches() and countMatches() also work on IRanges and >> DNAStringSet objects, as well as on ordinary atomic vectors: >> >> library(hgu95av2probe) >> library(Biostrings) >> probes <- DNAStringSet(hgu95av2probe) >> unique_probes <- unique(probes) >> count <- countMatches(unique_probes, probes) >> max(count) # 7 >> >> I made other changes in IRanges/GenomicRanges so that the notion >> of "match" between elements of a vector-like object now consistently >> means "equality" instead of "overlap", even for range-based objects >> like IRanges or GRanges objects. This notion of "equality" is the >> same that is used by ==. The most visible consequence of those >> changes is that using %in% between 2 IRanges or GRanges objects >> 'query' and 'subject' in order to do overlaps was replaced by >> overlapsAny(query, subject). >> >> overlapsAny(query, subject): Finds the ranges in ?query? that >> overlap any of the ranges in ?subject?. >> >> There are warnings and deprecation messages in place to help smooth >> the transition. >> >> Cheers, >> H. >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLYlink written 4.9 years ago by Sean Davis21k
To address Sean and Michael's points, I wonder if queryGR %in% subjectGR could just mean, quite literally, the comparison findOverlaps(queryGR, subjectGR, type='within') and then to make things explicit, perhaps the operators queryGR %within% subjectGR queryGR %overlaps% subjectGR queryGR %equals% subjectGR could be introduced for readability? This would be good programming hygiene anyways, as it removes some ambiguity for new users. I routinely use %d%, %i%, %u% as shorthand, for the binary operations setdiff(x, y), intersect(x, y), and union(x, y), at least when doing such operations in base R. Wouldn't break my heart to add operators for explicitly doing comparisons of GRs either On Fri, Jan 4, 2013 at 1:37 PM, Sean Davis <sdavis2@mail.nih.gov> wrote: > On Fri, Jan 4, 2013 at 4:32 PM, Michael Lawrence > <lawrence.michael@gene.com> wrote: > > The change to the behavior of %in% is a pretty big one. Are you thinking > > that all set-based operations should behave this way? For example, > setdiff > > and intersect? I really liked the syntax of "peaks %in% genes". In my > > experience, it's way more common to ask questions about overlap than > about > > equality, so I'd rather optimize the API for that use case. But again, > > that's just my personal bias. > > For what it is worth, I share Michael's personal bias here. > > Sean > > > > Michael > > > > > > On Fri, Jan 4, 2013 at 1:11 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > > > >> Hi, > >> > >> I added findMatches() and countMatches() to the latest IRanges / > >> GenomicRanges packages (in BioC devel only). > >> > >> findMatches(x, table): An enhanced version of ‘match’ that > >> returns all the matches in a Hits object. > >> > >> countMatches(x, table): Returns an integer vector of the length > >> of ‘x’, containing the number of matches in ‘table’ for > >> each element in ‘x’. > >> > >> countMatches() is what you can use to tally/count/tabulate (choose your > >> preferred term) the unique elements in a GRanges object: > >> > >> library(GenomicRanges) > >> set.seed(33) > >> gr <- GRanges("chr1", IRanges(sample(15,20,replace=**TRUE), width=5)) > >> > >> Then: > >> > >> > gr_levels <- sort(unique(gr)) > >> > countMatches(gr_levels, gr) > >> [1] 1 1 1 2 4 2 2 1 2 2 2 > >> > >> Note that findMatches() and countMatches() also work on IRanges and > >> DNAStringSet objects, as well as on ordinary atomic vectors: > >> > >> library(hgu95av2probe) > >> library(Biostrings) > >> probes <- DNAStringSet(hgu95av2probe) > >> unique_probes <- unique(probes) > >> count <- countMatches(unique_probes, probes) > >> max(count) # 7 > >> > >> I made other changes in IRanges/GenomicRanges so that the notion > >> of "match" between elements of a vector-like object now consistently > >> means "equality" instead of "overlap", even for range-based objects > >> like IRanges or GRanges objects. This notion of "equality" is the > >> same that is used by ==. The most visible consequence of those > >> changes is that using %in% between 2 IRanges or GRanges objects > >> 'query' and 'subject' in order to do overlaps was replaced by > >> overlapsAny(query, subject). > >> > >> overlapsAny(query, subject): Finds the ranges in ‘query’ that > >> overlap any of the ranges in ‘subject’. > >> > >> There are warnings and deprecation messages in place to help smooth > >> the transition. > >> > >> Cheers, > >> H. > >> > >> -- > >> Hervé Pagès > >> > >> Program in Computational Biology > >> Division of Public Health Sciences > >> Fred Hutchinson Cancer Research Center > >> 1100 Fairview Ave. N, M1-B514 > >> P.O. Box 19024 > >> Seattle, WA 98109-1024 > >> > >> E-mail: hpages@fhcrc.org > >> Phone: (206) 667-5791 > >> Fax: (206) 667-1319 > >> > > > > [[alternative HTML version deleted]] > > > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]
ADD REPLYlink written 4.9 years ago by Tim Triche4.2k
More explicitly, I note that: R> selectMethod('%in%', c('GenomicRanges','GenomicRanges')) Method Definition: function (x, table) { warning(IRanges:::`%in%.warning.msg`("GenomicRanges")) !is.na(match(x, table, match.if.overlap = FALSE)) } <environment: namespace:genomicranges=""> is certainly explicit... that said, what I am thinking of, in MM parlance, is identical( x %within% table, countOverlaps(x, table, type='within') > 0 ) == TRUE identical( x %overlaps% table, countOverlaps(x, table, type='any') > 0 ) == TRUE identical( x %equals% table, countOverlaps(x, table, type='equal') > 0 ) == TRUE Perhaps the latter would be better written as x %identical% table or x %isElementOf% table or some such? Anyways. Just some thoughts. It can be a bit nebulous what, precisely, is being tabulated when one first starts using Ranges for comparisons IMO On Fri, Jan 4, 2013 at 1:44 PM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: > To address Sean and Michael's points, I wonder if > > queryGR %in% subjectGR > > could just mean, quite literally, the comparison > > findOverlaps(queryGR, subjectGR, type='within') > > and then to make things explicit, perhaps the operators > > queryGR %within% subjectGR > queryGR %overlaps% subjectGR > queryGR %equals% subjectGR > > could be introduced for readability? This would be good programming > hygiene anyways, as it removes some ambiguity for new users. > > > I routinely use %d%, %i%, %u% as shorthand, > for the binary operations setdiff(x, y), intersect(x, y), and union(x, y), > at least when doing such operations in base R. Wouldn't break my heart to > add operators for explicitly doing comparisons of GRs either > > > > > > > On Fri, Jan 4, 2013 at 1:37 PM, Sean Davis <sdavis2@mail.nih.gov> wrote: > >> On Fri, Jan 4, 2013 at 4:32 PM, Michael Lawrence >> <lawrence.michael@gene.com> wrote: >> > The change to the behavior of %in% is a pretty big one. Are you thinking >> > that all set-based operations should behave this way? For example, >> setdiff >> > and intersect? I really liked the syntax of "peaks %in% genes". In my >> > experience, it's way more common to ask questions about overlap than >> about >> > equality, so I'd rather optimize the API for that use case. But again, >> > that's just my personal bias. >> >> For what it is worth, I share Michael's personal bias here. >> >> Sean >> >> >> > Michael >> > >> > >> > On Fri, Jan 4, 2013 at 1:11 PM, Hervé Pagès <hpages@fhcrc.org> wrote: >> > >> >> Hi, >> >> >> >> I added findMatches() and countMatches() to the latest IRanges / >> >> GenomicRanges packages (in BioC devel only). >> >> >> >> findMatches(x, table): An enhanced version of ‘match’ that >> >> returns all the matches in a Hits object. >> >> >> >> countMatches(x, table): Returns an integer vector of the length >> >> of ‘x’, containing the number of matches in ‘table’ for >> >> each element in ‘x’. >> >> >> >> countMatches() is what you can use to tally/count/tabulate (choose your >> >> preferred term) the unique elements in a GRanges object: >> >> >> >> library(GenomicRanges) >> >> set.seed(33) >> >> gr <- GRanges("chr1", IRanges(sample(15,20,replace=**TRUE), width=5)) >> >> >> >> Then: >> >> >> >> > gr_levels <- sort(unique(gr)) >> >> > countMatches(gr_levels, gr) >> >> [1] 1 1 1 2 4 2 2 1 2 2 2 >> >> >> >> Note that findMatches() and countMatches() also work on IRanges and >> >> DNAStringSet objects, as well as on ordinary atomic vectors: >> >> >> >> library(hgu95av2probe) >> >> library(Biostrings) >> >> probes <- DNAStringSet(hgu95av2probe) >> >> unique_probes <- unique(probes) >> >> count <- countMatches(unique_probes, probes) >> >> max(count) # 7 >> >> >> >> I made other changes in IRanges/GenomicRanges so that the notion >> >> of "match" between elements of a vector-like object now consistently >> >> means "equality" instead of "overlap", even for range-based objects >> >> like IRanges or GRanges objects. This notion of "equality" is the >> >> same that is used by ==. The most visible consequence of those >> >> changes is that using %in% between 2 IRanges or GRanges objects >> >> 'query' and 'subject' in order to do overlaps was replaced by >> >> overlapsAny(query, subject). >> >> >> >> overlapsAny(query, subject): Finds the ranges in ‘query’ that >> >> overlap any of the ranges in ‘subject’. >> >> >> >> There are warnings and deprecation messages in place to help smooth >> >> the transition. >> >> >> >> Cheers, >> >> H. >> >> >> >> -- >> >> Hervé Pagès >> >> >> >> Program in Computational Biology >> >> Division of Public Health Sciences >> >> Fred Hutchinson Cancer Research Center >> >> 1100 Fairview Ave. N, M1-B514 >> >> P.O. Box 19024 >> >> Seattle, WA 98109-1024 >> >> >> >> E-mail: hpages@fhcrc.org >> >> Phone: (206) 667-5791 >> >> Fax: (206) 667-1319 >> >> >> > >> > [[alternative HTML version deleted]] >> > >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor@r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > > -- > *A model is a lie that helps you see the truth.* > * > * > Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]
ADD REPLYlink written 4.9 years ago by Tim Triche4.2k
Hiya, For what it is worth... I think the change to %in% is warranted. If I understand correctly, this change restores the relationship between the semantics of `%in` and the semantics of `match`.
ADD REPLYlink written 4.9 years ago by Malcolm Cook1.4k
On Fri, Jan 4, 2013 at 1:56 PM, Cook, Malcolm <mec@stowers.org> wrote: > Hiya, > > For what it is worth... > > I think the change to %in% is warranted. > > If I understand correctly, this change restores the relationship between > the semantics of `%in` and the semantics of `match`. > > From the docs: > > '"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0' > > Herve's change restores this relationship. > > match and %in% were initially consistent (both considering any overlap); Herve has changed both of them together. The whole idea behind IRanges is that ranges are special data types with special semantics. We have reimplemented much of the existing R vector API using those semantics; this extends beyond match/%in%. I am hesitant about making such sweeping changes to the API so late in the life-cycle of the package. There was a feature request for a way to count identical ranges in a set of ranges. Let's please not get carried away and start redesigning the API for this one, albeit useful, request. There are all sorts of inconsistencies in the API, and many of them were conscious decisions that considered practical use cases. Michael Herve, I suspect you were you as a result able to completely drop all the > `%in%,BiocClass1,BiocClass2` definitions and depend upon base::%in% > > Am I right? > > If so, may I suggest that Herve stay the course, with the addition of > '"%ol%" <- function(a, b) findOverlaps(a, b, maxgap=0L, minoverlap=1L, > type='any', select='all') > 0' > > This would provide a perspicacious idiom, thereby optimizing the API for > Michaels observed common use case. > > Just sayin' > > ~Malcolm > > > .-----Original Message----- > .From: bioconductor-bounces@r-project.org [mailto: > bioconductor-bounces@r-project.org] On Behalf Of Sean Davis > .Sent: Friday, January 04, 2013 3:37 PM > .To: Michael Lawrence > .Cc: Tim Triche, Jr.; Vedran Franke; bioconductor@r-project.org > .Subject: Re: [BioC] countMatches() (was: table for GenomicRanges) > . > .On Fri, Jan 4, 2013 at 4:32 PM, Michael Lawrence > .<lawrence.michael@gene.com> wrote: > .> The change to the behavior of %in% is a pretty big one. Are you > thinking > .> that all set-based operations should behave this way? For example, > setdiff > .> and intersect? I really liked the syntax of "peaks %in% genes". In my > .> experience, it's way more common to ask questions about overlap than > about > .> equality, so I'd rather optimize the API for that use case. But again, > .> that's just my personal bias. > . > .For what it is worth, I share Michael's personal bias here. > . > .Sean > . > . > .> Michael > .> > .> > .> On Fri, Jan 4, 2013 at 1:11 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > .> > .>> Hi, > .>> > .>> I added findMatches() and countMatches() to the latest IRanges / > .>> GenomicRanges packages (in BioC devel only). > .>> > .>> findMatches(x, table): An enhanced version of ‘match’ that > .>> returns all the matches in a Hits object. > .>> > .>> countMatches(x, table): Returns an integer vector of the length > .>> of ‘x’, containing the number of matches in ‘table’ for > .>> each element in ‘x’. > .>> > .>> countMatches() is what you can use to tally/count/tabulate (choose > your > .>> preferred term) the unique elements in a GRanges object: > .>> > .>> library(GenomicRanges) > .>> set.seed(33) > .>> gr <- GRanges("chr1", IRanges(sample(15,20,replace=**TRUE), > width=5)) > .>> > .>> Then: > .>> > .>> > gr_levels <- sort(unique(gr)) > .>> > countMatches(gr_levels, gr) > .>> [1] 1 1 1 2 4 2 2 1 2 2 2 > .>> > .>> Note that findMatches() and countMatches() also work on IRanges and > .>> DNAStringSet objects, as well as on ordinary atomic vectors: > .>> > .>> library(hgu95av2probe) > .>> library(Biostrings) > .>> probes <- DNAStringSet(hgu95av2probe) > .>> unique_probes <- unique(probes) > .>> count <- countMatches(unique_probes, probes) > .>> max(count) # 7 > .>> > .>> I made other changes in IRanges/GenomicRanges so that the notion > .>> of "match" between elements of a vector-like object now consistently > .>> means "equality" instead of "overlap", even for range-based objects > .>> like IRanges or GRanges objects. This notion of "equality" is the > .>> same that is used by ==. The most visible consequence of those > .>> changes is that using %in% between 2 IRanges or GRanges objects > .>> 'query' and 'subject' in order to do overlaps was replaced by > .>> overlapsAny(query, subject). > .>> > .>> overlapsAny(query, subject): Finds the ranges in ‘query’ that > .>> overlap any of the ranges in ‘subject’. > .>> > .>> There are warnings and deprecation messages in place to help smooth > .>> the transition. > .>> > .>> Cheers, > .>> H. > .>> > .>> -- > .>> Hervé Pagès > .>> > .>> Program in Computational Biology > .>> Division of Public Health Sciences > .>> Fred Hutchinson Cancer Research Center > .>> 1100 Fairview Ave. N, M1-B514 > .>> P.O. Box 19024 > .>> Seattle, WA 98109-1024 > .>> > .>> E-mail: hpages@fhcrc.org > .>> Phone: (206) 667-5791 > .>> Fax: (206) 667-1319 > .>> > .> > .> [[alternative HTML version deleted]] > .> > .> > .> _______________________________________________ > .> Bioconductor mailing list > .> Bioconductor@r-project.org > .> https://stat.ethz.ch/mailman/listinfo/bioconductor > .> Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > . > ._______________________________________________ > .Bioconductor mailing list > .Bioconductor@r-project.org > .https://stat.ethz.ch/mailman/listinfo/bioconductor > .Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLYlink written 4.9 years ago by Michael Lawrence9.8k
So why not leave %in% the same as it was, revert to setMethod('%in%', c('GenomicRanges','GenomicRanges'), function (x, table) { warning(IRanges:::`%in%.warning.msg`("GenomicRanges")) !is.na(match(x, table, match.if.overlap = TRUE)) }) and introduce the explicit %within%, %overlaps%, %equals% generic operators for clarity? Should avoid the massive churn to the API while still allowing people to tabulate things cleanly, no? On Fri, Jan 4, 2013 at 3:10 PM, Michael Lawrence <lawrence.michael@gene.com>wrote: > > > > On Fri, Jan 4, 2013 at 1:56 PM, Cook, Malcolm <mec@stowers.org> wrote: > >> Hiya, >> >> For what it is worth... >> >> I think the change to %in% is warranted. >> >> If I understand correctly, this change restores the relationship between >> the semantics of `%in` and the semantics of `match`. >> >> From the docs: >> >> '"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0' >> >> Herve's change restores this relationship. >> >> > match and %in% were initially consistent (both considering any overlap); > Herve has changed both of them together. The whole idea behind IRanges is > that ranges are special data types with special semantics. We have > reimplemented much of the existing R vector API using those semantics; this > extends beyond match/%in%. I am hesitant about making such sweeping changes > to the API so late in the life-cycle of the package. There was a feature > request for a way to count identical ranges in a set of ranges. Let's > please not get carried away and start redesigning the API for this one, > albeit useful, request. There are all sorts of inconsistencies in the API, > and many of them were conscious decisions that considered practical use > cases. > > Michael > > > > Herve, I suspect you were you as a result able to completely drop all the >> `%in%,BiocClass1,BiocClass2` definitions and depend upon base::%in% >> >> Am I right? >> >> If so, may I suggest that Herve stay the course, with the addition of >> '"%ol%" <- function(a, b) findOverlaps(a, b, maxgap=0L, minoverlap=1L, >> type='any', select='all') > 0' >> >> This would provide a perspicacious idiom, thereby optimizing the API for >> Michaels observed common use case. >> >> Just sayin' >> >> ~Malcolm >> >> >> .-----Original Message----- >> .From: bioconductor-bounces@r-project.org [mailto: >> bioconductor-bounces@r-project.org] On Behalf Of Sean Davis >> .Sent: Friday, January 04, 2013 3:37 PM >> .To: Michael Lawrence >> .Cc: Tim Triche, Jr.; Vedran Franke; bioconductor@r-project.org >> .Subject: Re: [BioC] countMatches() (was: table for GenomicRanges) >> . >> .On Fri, Jan 4, 2013 at 4:32 PM, Michael Lawrence >> .<lawrence.michael@gene.com> wrote: >> .> The change to the behavior of %in% is a pretty big one. Are you >> thinking >> .> that all set-based operations should behave this way? For example, >> setdiff >> .> and intersect? I really liked the syntax of "peaks %in% genes". In my >> .> experience, it's way more common to ask questions about overlap than >> about >> .> equality, so I'd rather optimize the API for that use case. But again, >> .> that's just my personal bias. >> . >> .For what it is worth, I share Michael's personal bias here. >> . >> .Sean >> . >> . >> .> Michael >> .> >> .> >> .> On Fri, Jan 4, 2013 at 1:11 PM, Hervé Pagès <hpages@fhcrc.org> wrote: >> .> >> .>> Hi, >> .>> >> .>> I added findMatches() and countMatches() to the latest IRanges / >> .>> GenomicRanges packages (in BioC devel only). >> .>> >> .>> findMatches(x, table): An enhanced version of ‘match’ that >> .>> returns all the matches in a Hits object. >> .>> >> .>> countMatches(x, table): Returns an integer vector of the length >> .>> of ‘x’, containing the number of matches in ‘table’ for >> .>> each element in ‘x’. >> .>> >> .>> countMatches() is what you can use to tally/count/tabulate (choose >> your >> .>> preferred term) the unique elements in a GRanges object: >> .>> >> .>> library(GenomicRanges) >> .>> set.seed(33) >> .>> gr <- GRanges("chr1", IRanges(sample(15,20,replace=**TRUE), >> width=5)) >> .>> >> .>> Then: >> .>> >> .>> > gr_levels <- sort(unique(gr)) >> .>> > countMatches(gr_levels, gr) >> .>> [1] 1 1 1 2 4 2 2 1 2 2 2 >> .>> >> .>> Note that findMatches() and countMatches() also work on IRanges and >> .>> DNAStringSet objects, as well as on ordinary atomic vectors: >> .>> >> .>> library(hgu95av2probe) >> .>> library(Biostrings) >> .>> probes <- DNAStringSet(hgu95av2probe) >> .>> unique_probes <- unique(probes) >> .>> count <- countMatches(unique_probes, probes) >> .>> max(count) # 7 >> .>> >> .>> I made other changes in IRanges/GenomicRanges so that the notion >> .>> of "match" between elements of a vector-like object now consistently >> .>> means "equality" instead of "overlap", even for range-based objects >> .>> like IRanges or GRanges objects. This notion of "equality" is the >> .>> same that is used by ==. The most visible consequence of those >> .>> changes is that using %in% between 2 IRanges or GRanges objects >> .>> 'query' and 'subject' in order to do overlaps was replaced by >> .>> overlapsAny(query, subject). >> .>> >> .>> overlapsAny(query, subject): Finds the ranges in ‘query’ that >> .>> overlap any of the ranges in ‘subject’. >> .>> >> .>> There are warnings and deprecation messages in place to help smooth >> .>> the transition. >> .>> >> .>> Cheers, >> .>> H. >> .>> >> .>> -- >> .>> Hervé Pagès >> .>> >> .>> Program in Computational Biology >> .>> Division of Public Health Sciences >> .>> Fred Hutchinson Cancer Research Center >> .>> 1100 Fairview Ave. N, M1-B514 >> .>> P.O. Box 19024 >> .>> Seattle, WA 98109-1024 >> .>> >> .>> E-mail: hpages@fhcrc.org >> .>> Phone: (206) 667-5791 >> .>> Fax: (206) 667-1319 >> .>> >> .> >> .> [[alternative HTML version deleted]] >> .> >> .> >> .> _______________________________________________ >> .> Bioconductor mailing list >> .> Bioconductor@r-project.org >> .> https://stat.ethz.ch/mailman/listinfo/bioconductor >> .> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> . >> ._______________________________________________ >> .Bioconductor mailing list >> .Bioconductor@r-project.org >> .https://stat.ethz.ch/mailman/listinfo/bioconductor >> .Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]
ADD REPLYlink written 4.9 years ago by Tim Triche4.2k
Hiya again, I am definitely a late comer to BioC, so I definitely easily defer to the tide of history. But I do think you miss my point Michael about the proposed change making the relationship between %in% and match for {G,I}Ranges{List} mimic that between other vectors, and I do think that changing the API would make other late-comers take to BioC easier/faster. That said, I NEVER use %in% so I really have no stake in the matter, and I DEFINITELY appreciate the argument to not changing the API just for sematic sweetness. That that said, Herve is _so good_ about deprecations and warnings that make such changes fairly easily digestible. That that that.... enough.... I bow out of this one....!!!! Always learning and Happy New Year to all lurkers, ~Malcolm From: Michael Lawrence [mailto:lawrence.michael@gene.com] Sent: Friday, January 04, 2013 5:11 PM To: Cook, Malcolm Cc: Sean Davis; Michael Lawrence; Hervé Pagès (hpages@fhcrc.org); Tim Triche, Jr.; Vedran Franke; bioconductor@r-project.org Subject: Re: [BioC] countMatches() (was: table for GenomicRanges) On Fri, Jan 4, 2013 at 1:56 PM, Cook, Malcolm <mec@stowers.org<mailto:mec@stowers.org>> wrote: Hiya, For what it is worth... I think the change to %in% is warranted. If I understand correctly, this change restores the relationship between the semantics of `%in` and the semantics of `match`. >From the docs: '"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0' Herve's change restores this relationship. match and %in% were initially consistent (both considering any overlap); Herve has changed both of them together. The whole idea behind IRanges is that ranges are special data types with special semantics. We have reimplemented much of the existing R vector API using those semantics; this extends beyond match/%in%. I am hesitant about making such sweeping changes to the API so late in the life- cycle of the package. There was a feature request for a way to count identical ranges in a set of ranges. Let's please not get carried away and start redesigning the API for this one, albeit useful, request. There are all sorts of inconsistencies in the API, and many of them were conscious decisions that considered practical use cases. Michael Herve, I suspect you were you as a result able to completely drop all the `%in%,BiocClass1,BiocClass2` definitions and depend upon base::%in% Am I right? If so, may I suggest that Herve stay the course, with the addition of '"%ol%" <- function(a, b) findOverlaps(a, b, maxgap=0L, minoverlap=1L, type='any', select='all') > 0' This would provide a perspicacious idiom, thereby optimizing the API for Michaels observed common use case. Just sayin' ~Malcolm .-----Original Message----- .From: bioconductor-bounces@r-project.org<mailto:bioconductor- bounces@r-project.org=""> [mailto:bioconductor- bounces@r-project.org<mailto:bioconductor-bounces@r-project.org>] On Behalf Of Sean Davis .Sent: Friday, January 04, 2013 3:37 PM .To: Michael Lawrence .Cc: Tim Triche, Jr.; Vedran Franke; bioconductor@r-project.org<mailto:bioconductor@r-project.org> .Subject: Re: [BioC] countMatches() (was: table for GenomicRanges) . .On Fri, Jan 4, 2013 at 4:32 PM, Michael Lawrence .<lawrence.michael@gene.com<mailto:lawrence.michael@gene.com>> wrote: .> The change to the behavior of %in% is a pretty big one. Are you thinking .> that all set-based operations should behave this way? For example, setdiff .> and intersect? I really liked the syntax of "peaks %in% genes". In my .> experience, it's way more common to ask questions about overlap than about .> equality, so I'd rather optimize the API for that use case. But again, .> that's just my personal bias. . .For what it is worth, I share Michael's personal bias here. . .Sean . . .> Michael .> .> .> On Fri, Jan 4, 2013 at 1:11 PM, Hervé Pagès <hpages@fhcrc.org<mailto:hpages@fhcrc.org>> wrote: .> .>> Hi, .>> .>> I added findMatches() and countMatches() to the latest IRanges / .>> GenomicRanges packages (in BioC devel only). .>> .>> findMatches(x, table): An enhanced version of 'match' that .>> returns all the matches in a Hits object. .>> .>> countMatches(x, table): Returns an integer vector of the length .>> of 'x', containing the number of matches in 'table' for .>> each element in 'x'. .>> .>> countMatches() is what you can use to tally/count/tabulate (choose your .>> preferred term) the unique elements in a GRanges object: .>> .>> library(GenomicRanges) .>> set.seed(33) .>> gr <- GRanges("chr1", IRanges(sample(15,20,replace=**TRUE), width=5)) .>> .>> Then: .>> .>> > gr_levels <- sort(unique(gr)) .>> > countMatches(gr_levels, gr) .>> [1] 1 1 1 2 4 2 2 1 2 2 2 .>> .>> Note that findMatches() and countMatches() also work on IRanges and .>> DNAStringSet objects, as well as on ordinary atomic vectors: .>> .>> library(hgu95av2probe) .>> library(Biostrings) .>> probes <- DNAStringSet(hgu95av2probe) .>> unique_probes <- unique(probes) .>> count <- countMatches(unique_probes, probes) .>> max(count) # 7 .>> .>> I made other changes in IRanges/GenomicRanges so that the notion .>> of "match" between elements of a vector-like object now consistently .>> means "equality" instead of "overlap", even for range-based objects .>> like IRanges or GRanges objects. This notion of "equality" is the .>> same that is used by ==. The most visible consequence of those .>> changes is that using %in% between 2 IRanges or GRanges objects .>> 'query' and 'subject' in order to do overlaps was replaced by .>> overlapsAny(query, subject). .>> .>> overlapsAny(query, subject): Finds the ranges in 'query' that .>> overlap any of the ranges in 'subject'. .>> .>> There are warnings and deprecation messages in place to help smooth .>> the transition. .>> .>> Cheers, .>> H. .>> .>> -- .>> Hervé Pagès .>> .>> Program in Computational Biology .>> Division of Public Health Sciences .>> Fred Hutchinson Cancer Research Center .>> 1100 Fairview Ave. N, M1-B514 .>> P.O. Box 19024 .>> Seattle, WA 98109-1024 .>> .>> E-mail: hpages@fhcrc.org<mailto:hpages@fhcrc.org> .>> Phone: (206) 667-5791<tel:%28206%29%20667-5791> .>> Fax: (206) 667-1319<tel:%28206%29%20667-1319> .>> .> .> [[alternative HTML version deleted]] .> .> .> _______________________________________________ .> Bioconductor mailing list .> Bioconductor@r-project.org<mailto:bioconductor@r-project.org> .> https://stat.ethz.ch/mailman/listinfo/bioconductor .> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor . ._______________________________________________ .Bioconductor mailing list .Bioconductor@r-project.org<mailto:bioconductor@r-project.org> .https://stat.ethz.ch/mailman/listinfo/bioconductor .Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
ADD REPLYlink written 4.9 years ago by Malcolm Cook1.4k
Yes 'peaks %in% genes' is cute and was probably doing the right thing for most users (although not all). But 'exons %in% genes' is cute too and was probably doing the wrong thing for all users. Advanced users like you guys would have no problem switching to !is.na(findOverlaps(peaks, genes, type="within", select="any")) or !is.na(findOverlaps(peaks, genes, type="equal", select="any")) in case 'peaks %in% genes' was not doing exactly what you wanted, but most users would not find this particularly friendly. Even worse, some users probably didn't realize that 'peaks %in% genes' was not doing exactly what they thought it did because "peaks in genes" in English suggests that the peaks are within the genes, but it's not what 'peaks %in% genes' does. Having overlapsAny(), with exactly the same extra arguments as countOverlaps() and subsetByOverlaps() (i.e. 'maxgap', 'minoverlap', 'type', 'ignore.strand'), all of them documented (and with most users more or less familiar with them already) has the virtue to expose the user to all the options from the very start, and to help him/her make the right choice. Of course there will be users that don't want or don't have the time to read/think about all the options. Not a big deal: they'll just do 'overlapsAny(query, subject)', which is not a lot more typing than 'query %in% subject', especially if they use tab completion. It's true that it's more common to ask questions about overlap than about equality but there are some use cases for the latter (as the original thread shows). Until now, when you had such a use case, you could not use match() or %in%, which would have been the natural things to use, because they got hijacked to do something else, and you were left with nothing. Not a satisfying situation. So at a minimum, we needed to restore the true/real/original semantic of match() to do "equality" instead of "overlap". But it's hard to do this for match() and not do it for %in% too. For more than 99% of R users, %in% is just a simple wrapper for 'match(x, table, nomatch = 0) > 0' (this is how it has been documented and implemented in base R for many years). Not maintaining this relationship between %in% and match() would only cause grief and frustration to newcomers to Bioconductor. H. On 01/04/2013 03:32 PM, Cook, Malcolm wrote: > Hiya again, > > I am definitely a late comer to BioC, so I definitely easily defer to > the tide of history. > > But I do think you miss my point Michael about the proposed change > making the relationship between %in% and match for {G,I}Ranges{List} > mimic that between other vectors, and I do think that changing the API > would make other late-comers take to BioC easier/faster. > > That said, I NEVER use %in% so I really have no stake in the matter, and > I DEFINITELY appreciate the argument to not changing the API just for > sematic sweetness. > > That that said, Herve is _/so good/_ about deprecations and warnings > that make such changes fairly easily digestible. > > That that that.... enough.... I bow out of this one....!!!! > > Always learning and Happy New Year to all lurkers, > > ~Malcolm > > *From:*Michael Lawrence [mailto:lawrence.michael at gene.com] > *Sent:* Friday, January 04, 2013 5:11 PM > *To:* Cook, Malcolm > *Cc:* Sean Davis; Michael Lawrence; Hervé Pagès (hpages at fhcrc.org); Tim > Triche, Jr.; Vedran Franke; bioconductor at r-project.org > *Subject:* Re: [BioC] countMatches() (was: table for GenomicRanges) > > On Fri, Jan 4, 2013 at 1:56 PM, Cook, Malcolm <mec at="" stowers.org=""> <mailto:mec at="" stowers.org="">> wrote: > > Hiya, > > For what it is worth... > > I think the change to %in% is warranted. > > If I understand correctly, this change restores the relationship between > the semantics of `%in` and the semantics of `match`. > > From the docs: > > '"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0' > > Herve's change restores this relationship. > > > match and %in% were initially consistent (both considering any overlap); > Herve has changed both of them together. The whole idea behind IRanges > is that ranges are special data types with special semantics. We have > reimplemented much of the existing R vector API using those semantics; > this extends beyond match/%in%. I am hesitant about making such sweeping > changes to the API so late in the life-cycle of the package. There was a > feature request for a way to count identical ranges in a set of ranges. > Let's please not get carried away and start redesigning the API for this > one, albeit useful, request. There are all sorts of inconsistencies in > the API, and many of them were conscious decisions that considered > practical use cases. > > Michael > > > Herve, I suspect you were you as a result able to completely drop > all the `%in%,BiocClass1,BiocClass2` definitions and depend upon > base::%in% > > Am I right? > > If so, may I suggest that Herve stay the course, with the addition of > '"%ol%" <- function(a, b) findOverlaps(a, b, maxgap=0L, > minoverlap=1L, type='any', select='all') > 0' > > This would provide a perspicacious idiom, thereby optimizing the API > for Michaels observed common use case. > > Just sayin' > > ~Malcolm > > > .-----Original Message----- > .From: bioconductor-bounces at r-project.org > <mailto:bioconductor-bounces at="" r-project.org=""> > [mailto:bioconductor-bounces at r-project.org > <mailto:bioconductor-bounces at="" r-project.org="">] On Behalf Of Sean Davis > .Sent: Friday, January 04, 2013 3:37 PM > .To: Michael Lawrence > .Cc: Tim Triche, Jr.; Vedran Franke; bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > .Subject: Re: [BioC] countMatches() (was: table for GenomicRanges) > . > .On Fri, Jan 4, 2013 at 4:32 PM, Michael Lawrence > .<lawrence.michael at="" gene.com="" <mailto:lawrence.michael="" at="" gene.com="">> wrote: > .> The change to the behavior of %in% is a pretty big one. Are you > thinking > .> that all set-based operations should behave this way? For > example, setdiff > .> and intersect? I really liked the syntax of "peaks %in% genes". > In my > .> experience, it's way more common to ask questions about overlap > than about > .> equality, so I'd rather optimize the API for that use case. But > again, > .> that's just my personal bias. > . > .For what it is worth, I share Michael's personal bias here. > . > .Sean > . > . > .> Michael > .> > .> > .> On Fri, Jan 4, 2013 at 1:11 PM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote: > .> > .>> Hi, > .>> > .>> I added findMatches() and countMatches() to the latest IRanges / > .>> GenomicRanges packages (in BioC devel only). > .>> > .>> findMatches(x, table): An enhanced version of ?match? that > .>> returns all the matches in a Hits object. > .>> > .>> countMatches(x, table): Returns an integer vector of the length > .>> of ?x?, containing the number of matches in ?table? for > .>> each element in ?x?. > .>> > > .>> countMatches() is what you can use to tally/count/tabulate > (choose your > > .>> preferred term) the unique elements in a GRanges object: > .>> > .>> library(GenomicRanges) > .>> set.seed(33) > .>> gr <- GRanges("chr1", IRanges(sample(15,20,replace=**TRUE), > width=5)) > .>> > .>> Then: > .>> > .>> > gr_levels <- sort(unique(gr)) > .>> > countMatches(gr_levels, gr) > .>> [1] 1 1 1 2 4 2 2 1 2 2 2 > .>> > .>> Note that findMatches() and countMatches() also work on > IRanges and > .>> DNAStringSet objects, as well as on ordinary atomic vectors: > .>> > .>> library(hgu95av2probe) > .>> library(Biostrings) > .>> probes <- DNAStringSet(hgu95av2probe) > .>> unique_probes <- unique(probes) > .>> count <- countMatches(unique_probes, probes) > .>> max(count) # 7 > .>> > .>> I made other changes in IRanges/GenomicRanges so that the notion > .>> of "match" between elements of a vector-like object now > consistently > .>> means "equality" instead of "overlap", even for range- based > objects > .>> like IRanges or GRanges objects. This notion of "equality" is the > .>> same that is used by ==. The most visible consequence of those > .>> changes is that using %in% between 2 IRanges or GRanges objects > .>> 'query' and 'subject' in order to do overlaps was replaced by > .>> overlapsAny(query, subject). > .>> > .>> overlapsAny(query, subject): Finds the ranges in ?query? that > .>> overlap any of the ranges in ?subject?. > .>> > > .>> There are warnings and deprecation messages in place to help > smooth > > .>> the transition. > .>> > .>> Cheers, > .>> H. > .>> > .>> -- > .>> Hervé Pagès > .>> > .>> Program in Computational Biology > .>> Division of Public Health Sciences > .>> Fred Hutchinson Cancer Research Center > .>> 1100 Fairview Ave. N, M1-B514 > .>> P.O. Box 19024 > .>> Seattle, WA 98109-1024 > .>> > .>> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > .>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > .>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > .>> > .> > .> [[alternative HTML version deleted]] > .> > .> > .> _______________________________________________ > .> Bioconductor mailing list > .> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > .> https://stat.ethz.ch/mailman/listinfo/bioconductor > .> Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > . > ._______________________________________________ > .Bioconductor mailing list > .Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > .https://stat.ethz.ch/mailman/listinfo/bioconductor > .Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLYlink written 4.9 years ago by Hervé Pagès ♦♦ 13k
I think having overlapsAny is a nice addition and helps make the API more complete and explicit. Are you sure we need to change the behavior of the match method for this relatively uncommon use case? I don't think "match" always has to mean "equality". It is a more general concept in my mind. The most common use case for matching ranges is overlap. Michael On Fri, Jan 4, 2013 at 8:34 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > Yes 'peaks %in% genes' is cute and was probably doing the right thing > for most users (although not all). But 'exons %in% genes' is cute too > and was probably doing the wrong thing for all users. Advanced users > like you guys would have no problem switching to > > !is.na(findOverlaps(peaks, genes, type="within", select="any")) > > or > > !is.na(findOverlaps(peaks, genes, type="equal", select="any")) > > in case 'peaks %in% genes' was not doing exactly what you wanted, > but most users would not find this particularly friendly. Even > worse, some users probably didn't realize that 'peaks %in% genes' > was not doing exactly what they thought it did because "peaks in > genes" in English suggests that the peaks are within the genes, > but it's not what 'peaks %in% genes' does. > > Having overlapsAny(), with exactly the same extra arguments as > countOverlaps() and subsetByOverlaps() (i.e. 'maxgap', 'minoverlap', > 'type', 'ignore.strand'), all of them documented (and with most > users more or less familiar with them already) has the virtue to > expose the user to all the options from the very start, and to > help him/her make the right choice. Of course there will be users > that don't want or don't have the time to read/think about all the > options. Not a big deal: they'll just do 'overlapsAny(query, subject)', > which is not a lot more typing than 'query %in% subject', especially > if they use tab completion. > > It's true that it's more common to ask questions about overlap than > about equality but there are some use cases for the latter (as the > original thread shows). Until now, when you had such a use case, you > could not use match() or %in%, which would have been the natural things > to use, because they got hijacked to do something else, and you were > left with nothing. Not a satisfying situation. So at a minimum, we > needed to restore the true/real/original semantic of match() to do > "equality" instead of "overlap". But it's hard to do this for match() > and not do it for %in% too. For more than 99% of R users, %in% is > just a simple wrapper for 'match(x, table, nomatch = 0) > 0' (this > is how it has been documented and implemented in base R for many > years). Not maintaining this relationship between %in% and match() > would only cause grief and frustration to newcomers to Bioconductor. > > H. > > > > On 01/04/2013 03:32 PM, Cook, Malcolm wrote: > >> Hiya again, >> >> I am definitely a late comer to BioC, so I definitely easily defer to >> the tide of history. >> >> But I do think you miss my point Michael about the proposed change >> making the relationship between %in% and match for {G,I}Ranges{List} >> mimic that between other vectors, and I do think that changing the API >> would make other late-comers take to BioC easier/faster. >> >> That said, I NEVER use %in% so I really have no stake in the matter, and >> I DEFINITELY appreciate the argument to not changing the API just for >> sematic sweetness. >> >> That that said, Herve is _/so good/_ about deprecations and warnings >> >> that make such changes fairly easily digestible. >> >> That that that.... enough.... I bow out of this one....!!!! >> >> Always learning and Happy New Year to all lurkers, >> >> ~Malcolm >> >> *From:*Michael Lawrence [mailto:lawrence.michael@gene.**com<lawrence.michael@gene.com> >> ] >> *Sent:* Friday, January 04, 2013 5:11 PM >> *To:* Cook, Malcolm >> *Cc:* Sean Davis; Michael Lawrence; Hervé Pagès (hpages@fhcrc.org); Tim >> >> Triche, Jr.; Vedran Franke; bioconductor@r-project.org >> *Subject:* Re: [BioC] countMatches() (was: table for GenomicRanges) >> >> >> On Fri, Jan 4, 2013 at 1:56 PM, Cook, Malcolm <mec@stowers.org>> <mailto:mec@stowers.org>> wrote: >> >> Hiya, >> >> For what it is worth... >> >> I think the change to %in% is warranted. >> >> If I understand correctly, this change restores the relationship between >> the semantics of `%in` and the semantics of `match`. >> >> From the docs: >> >> '"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0' >> >> Herve's change restores this relationship. >> >> >> match and %in% were initially consistent (both considering any overlap); >> Herve has changed both of them together. The whole idea behind IRanges >> is that ranges are special data types with special semantics. We have >> reimplemented much of the existing R vector API using those semantics; >> this extends beyond match/%in%. I am hesitant about making such sweeping >> changes to the API so late in the life-cycle of the package. There was a >> feature request for a way to count identical ranges in a set of ranges. >> Let's please not get carried away and start redesigning the API for this >> one, albeit useful, request. There are all sorts of inconsistencies in >> the API, and many of them were conscious decisions that considered >> practical use cases. >> >> Michael >> >> >> Herve, I suspect you were you as a result able to completely drop >> all the `%in%,BiocClass1,BiocClass2` definitions and depend upon >> base::%in% >> >> Am I right? >> >> If so, may I suggest that Herve stay the course, with the addition of >> '"%ol%" <- function(a, b) findOverlaps(a, b, maxgap=0L, >> minoverlap=1L, type='any', select='all') > 0' >> >> This would provide a perspicacious idiom, thereby optimizing the API >> for Michaels observed common use case. >> >> Just sayin' >> >> ~Malcolm >> >> >> .-----Original Message----- >> .From: bioconductor-bounces@r-**project.org<bioconductor- bounces@r-project.org=""> >> <mailto:bioconductor-bounces@**r-project.org<bioconductor- bounces@r-project.org=""> >> > >> [mailto:bioconductor-bounces@**r-project.org<bioconductor- bounces@r-project.org=""> >> >> <mailto:bioconductor-bounces@**r-project.org<bioconductor- bounces@r-project.org="">>] >> On Behalf Of Sean Davis >> .Sent: Friday, January 04, 2013 3:37 PM >> .To: Michael Lawrence >> .Cc: Tim Triche, Jr.; Vedran Franke; bioconductor@r-project.org >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org="">> >> >> .Subject: Re: [BioC] countMatches() (was: table for GenomicRanges) >> . >> .On Fri, Jan 4, 2013 at 4:32 PM, Michael Lawrence >> .<lawrence.michael@gene.com <mailto:lawrence.michael@gene.**com<lawrence.michael@gene.com="">>> >> wrote: >> .> The change to the behavior of %in% is a pretty big one. Are you >> thinking >> .> that all set-based operations should behave this way? For >> example, setdiff >> .> and intersect? I really liked the syntax of "peaks %in% genes". >> In my >> .> experience, it's way more common to ask questions about overlap >> than about >> .> equality, so I'd rather optimize the API for that use case. But >> again, >> .> that's just my personal bias. >> . >> .For what it is worth, I share Michael's personal bias here. >> . >> .Sean >> . >> . >> .> Michael >> .> >> .> >> .> On Fri, Jan 4, 2013 at 1:11 PM, Hervé Pagès <hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>> wrote: >> .> >> .>> Hi, >> .>> >> .>> I added findMatches() and countMatches() to the latest IRanges / >> .>> GenomicRanges packages (in BioC devel only). >> .>> >> .>> findMatches(x, table): An enhanced version of ‘match’ that >> .>> returns all the matches in a Hits object. >> .>> >> .>> countMatches(x, table): Returns an integer vector of the >> length >> .>> of ‘x’, containing the number of matches in ‘table’ >> for >> .>> each element in ‘x’. >> .>> >> >> .>> countMatches() is what you can use to tally/count/tabulate >> (choose your >> >> .>> preferred term) the unique elements in a GRanges object: >> .>> >> .>> library(GenomicRanges) >> .>> set.seed(33) >> .>> gr <- GRanges("chr1", IRanges(sample(15,20,replace=****TRUE), >> width=5)) >> .>> >> .>> Then: >> .>> >> .>> > gr_levels <- sort(unique(gr)) >> .>> > countMatches(gr_levels, gr) >> .>> [1] 1 1 1 2 4 2 2 1 2 2 2 >> .>> >> .>> Note that findMatches() and countMatches() also work on >> IRanges and >> .>> DNAStringSet objects, as well as on ordinary atomic vectors: >> .>> >> .>> library(hgu95av2probe) >> .>> library(Biostrings) >> .>> probes <- DNAStringSet(hgu95av2probe) >> .>> unique_probes <- unique(probes) >> .>> count <- countMatches(unique_probes, probes) >> .>> max(count) # 7 >> .>> >> .>> I made other changes in IRanges/GenomicRanges so that the notion >> .>> of "match" between elements of a vector-like object now >> consistently >> .>> means "equality" instead of "overlap", even for range- based >> objects >> .>> like IRanges or GRanges objects. This notion of "equality" is >> the >> .>> same that is used by ==. The most visible consequence of those >> .>> changes is that using %in% between 2 IRanges or GRanges objects >> .>> 'query' and 'subject' in order to do overlaps was replaced by >> .>> overlapsAny(query, subject). >> .>> >> .>> overlapsAny(query, subject): Finds the ranges in ‘query’ that >> .>> overlap any of the ranges in ‘subject’. >> .>> >> >> .>> There are warnings and deprecation messages in place to help >> smooth >> >> .>> the transition. >> .>> >> .>> Cheers, >> .>> H. >> .>> >> .>> -- >> .>> Hervé Pagès >> .>> >> .>> Program in Computational Biology >> .>> Division of Public Health Sciences >> .>> Fred Hutchinson Cancer Research Center >> .>> 1100 Fairview Ave. N, M1-B514 >> .>> P.O. Box 19024 >> .>> Seattle, WA 98109-1024 >> .>> >> .>> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> .>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >> .>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >> >> .>> >> .> >> .> [[alternative HTML version deleted]] >> .> >> .> >> .> ______________________________**_________________ >> .> Bioconductor mailing list >> .> Bioconductor@r-project.org <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> > >> >> .> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https :="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> .> Search the archives: >> http://news.gmane.org/gmane.**science.biology.informatics.**con ductor<http: news.gmane.org="" gmane.science.biology.informatics.conduct="" or=""> >> . >> ._____________________________**__________________ >> .Bioconductor mailing list >> .Bioconductor@r-project.org <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> > >> >> .https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> .Search the archives: >> http://news.gmane.org/gmane.**science.biology.informatics.**con ductor<http: news.gmane.org="" gmane.science.biology.informatics.conduct="" or=""> >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]
ADD REPLYlink written 4.9 years ago by Michael Lawrence9.8k
Hi Michael, I don't think "match" (the word) always has to mean "equality" either. However having match() (the function) do "whole exact matching" (aka "equality") for any kind of vector-like object has the advantage of: (a) making it consistent with base::match() (?base::match is pretty explicit about what the contract of match() is) (b) preserving its relationship with ==, duplicated(), unique(), etc... (c) not frustrating the user who needs something to do exact matching on ranges (as I mentioned previously, if you take match() away from him/her, s/he'll be left with nothing). IMO those advantages counterbalance *by far* the very little convenience you get from having 'match(query, subject)' do 'findOverlaps(query, subject, select="first")' on IRanges/GRanges objects. If you need to do that, just use the latter, or, if you think that's still too much typing, define a wrapper e.g. 'ovmatch(query, subject)'. There are plenty of specialized tools around for doing inexact/fuzzy/partial/overlap matching for many particular types of vector-like objects: grep() and family, pmatch(), charmatch(), agrep(), grepRaw(), matchPattern() and family, findOverlaps() and family, findIntervals(), etc... For the reasons I mentioned above, none of them should hijack match() to make it do some particular type of inexact matching on some particular type of objects. Even if, for that particular type of objects, doing that particular type of inexact matching is more common than doing exact matching. H. On 01/06/2013 05:39 PM, Michael Lawrence wrote: > I think having overlapsAny is a nice addition and helps make the API > more complete and explicit. Are you sure we need to change the behavior > of the match method for this relatively uncommon use case? Yes because otherwise users with a use case of doing match() even if it's uncommon, > I don't think > "match" always has to mean "equality". It is a more general concept in > my mind. The most common use case for matching ranges is overlap. Of course "match" doesn't always have to mean equality. But of base > > Michael > > > On Fri, Jan 4, 2013 at 8:34 PM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote: > > Yes 'peaks %in% genes' is cute and was probably doing the right thing > for most users (although not all). But 'exons %in% genes' is cute too > and was probably doing the wrong thing for all users. Advanced users > like you guys would have no problem switching to > > !is.na <http: is.na="">(findOverlaps(peaks, genes, type="within", > select="any")) > > or > > !is.na <http: is.na="">(findOverlaps(peaks, genes, type="equal", > select="any")) > > in case 'peaks %in% genes' was not doing exactly what you wanted, > but most users would not find this particularly friendly. Even > worse, some users probably didn't realize that 'peaks %in% genes' > was not doing exactly what they thought it did because "peaks in > genes" in English suggests that the peaks are within the genes, > but it's not what 'peaks %in% genes' does. > > Having overlapsAny(), with exactly the same extra arguments as > countOverlaps() and subsetByOverlaps() (i.e. 'maxgap', 'minoverlap', > 'type', 'ignore.strand'), all of them documented (and with most > users more or less familiar with them already) has the virtue to > expose the user to all the options from the very start, and to > help him/her make the right choice. Of course there will be users > that don't want or don't have the time to read/think about all the > options. Not a big deal: they'll just do 'overlapsAny(query, subject)', > which is not a lot more typing than 'query %in% subject', especially > if they use tab completion. > > It's true that it's more common to ask questions about overlap than > about equality but there are some use cases for the latter (as the > original thread shows). Until now, when you had such a use case, you > could not use match() or %in%, which would have been the natural things > to use, because they got hijacked to do something else, and you were > left with nothing. Not a satisfying situation. So at a minimum, we > needed to restore the true/real/original semantic of match() to do > "equality" instead of "overlap". But it's hard to do this for match() > and not do it for %in% too. For more than 99% of R users, %in% is > just a simple wrapper for 'match(x, table, nomatch = 0) > 0' (this > is how it has been documented and implemented in base R for many > years). Not maintaining this relationship between %in% and match() > would only cause grief and frustration to newcomers to Bioconductor. > > H. > > > > On 01/04/2013 03:32 PM, Cook, Malcolm wrote: > > Hiya again, > > I am definitely a late comer to BioC, so I definitely easily > defer to > the tide of history. > > But I do think you miss my point Michael about the proposed change > making the relationship between %in% and match for {G,I}Ranges{List} > mimic that between other vectors, and I do think that changing > the API > would make other late-comers take to BioC easier/faster. > > That said, I NEVER use %in% so I really have no stake in the > matter, and > I DEFINITELY appreciate the argument to not changing the API > just for > sematic sweetness. > > That that said, Herve is _/so good/_ about deprecations and warnings > > that make such changes fairly easily digestible. > > That that that.... enough.... I bow out of this one....!!!! > > Always learning and Happy New Year to all lurkers, > > ~Malcolm > > *From:*Michael Lawrence [mailto:lawrence.michael at gene.__com > <mailto:lawrence.michael at="" gene.com="">] > *Sent:* Friday, January 04, 2013 5:11 PM > *To:* Cook, Malcolm > *Cc:* Sean Davis; Michael Lawrence; Hervé Pagès > (hpages at fhcrc.org <mailto:hpages at="" fhcrc.org="">); Tim > > Triche, Jr.; Vedran Franke; bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > *Subject:* Re: [BioC] countMatches() (was: table for GenomicRanges) > > > On Fri, Jan 4, 2013 at 1:56 PM, Cook, Malcolm <mec at="" stowers.org=""> <mailto:mec at="" stowers.org=""> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org="">>> wrote: > > Hiya, > > For what it is worth... > > I think the change to %in% is warranted. > > If I understand correctly, this change restores the relationship > between > the semantics of `%in` and the semantics of `match`. > > From the docs: > > '"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0' > > Herve's change restores this relationship. > > > match and %in% were initially consistent (both considering any > overlap); > Herve has changed both of them together. The whole idea behind > IRanges > is that ranges are special data types with special semantics. We > have > reimplemented much of the existing R vector API using those > semantics; > this extends beyond match/%in%. I am hesitant about making such > sweeping > changes to the API so late in the life-cycle of the package. > There was a > feature request for a way to count identical ranges in a set of > ranges. > Let's please not get carried away and start redesigning the API > for this > one, albeit useful, request. There are all sorts of > inconsistencies in > the API, and many of them were conscious decisions that considered > practical use cases. > > Michael > > > Herve, I suspect you were you as a result able to > completely drop > all the `%in%,BiocClass1,BiocClass2` definitions and depend > upon > base::%in% > > Am I right? > > If so, may I suggest that Herve stay the course, with the > addition of > '"%ol%" <- function(a, b) findOverlaps(a, b, maxgap=0L, > minoverlap=1L, type='any', select='all') > 0' > > This would provide a perspicacious idiom, thereby > optimizing the API > for Michaels observed common use case. > > Just sayin' > > ~Malcolm > > > .-----Original Message----- > .From: bioconductor-bounces at r-__project.org > <mailto:bioconductor-bounces at="" r-project.org=""> > <mailto:bioconductor-bounces at="" __r-project.org=""> <mailto:bioconductor-bounces at="" r-project.org="">> > [mailto:bioconductor-bounces at __r-project.org > <mailto:bioconductor-bounces at="" r-project.org=""> > > <mailto:bioconductor-bounces at="" __r-project.org=""> <mailto:bioconductor-bounces at="" r-project.org="">>] On Behalf Of Sean > Davis > .Sent: Friday, January 04, 2013 3:37 PM > .To: Michael Lawrence > .Cc: Tim Triche, Jr.; Vedran Franke; > bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > > .Subject: Re: [BioC] countMatches() (was: table for > GenomicRanges) > . > .On Fri, Jan 4, 2013 at 4:32 PM, Michael Lawrence > .<lawrence.michael at="" gene.com=""> <mailto:lawrence.michael at="" gene.com=""> > <mailto:lawrence.michael at="" gene.__com=""> <mailto:lawrence.michael at="" gene.com="">>> wrote: > .> The change to the behavior of %in% is a pretty big > one. Are you > thinking > .> that all set-based operations should behave this way? For > example, setdiff > .> and intersect? I really liked the syntax of "peaks > %in% genes". > In my > .> experience, it's way more common to ask questions > about overlap > than about > .> equality, so I'd rather optimize the API for that use > case. But > again, > .> that's just my personal bias. > . > .For what it is worth, I share Michael's personal bias here. > . > .Sean > . > . > .> Michael > .> > .> > .> On Fri, Jan 4, 2013 at 1:11 PM, Hervé Pagès > <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> wrote: > .> > .>> Hi, > .>> > .>> I added findMatches() and countMatches() to the > latest IRanges / > .>> GenomicRanges packages (in BioC devel only). > .>> > .>> findMatches(x, table): An enhanced version of > ?match? that > .>> returns all the matches in a Hits object. > .>> > .>> countMatches(x, table): Returns an integer vector > of the length > .>> of ?x?, containing the number of matches in > ?table? for > .>> each element in ?x?. > .>> > > .>> countMatches() is what you can use to > tally/count/tabulate > (choose your > > .>> preferred term) the unique elements in a GRanges object: > .>> > .>> library(GenomicRanges) > .>> set.seed(33) > .>> gr <- GRanges("chr1", > IRanges(sample(15,20,replace=*__*TRUE), > width=5)) > .>> > .>> Then: > .>> > .>> > gr_levels <- sort(unique(gr)) > .>> > countMatches(gr_levels, gr) > .>> [1] 1 1 1 2 4 2 2 1 2 2 2 > .>> > .>> Note that findMatches() and countMatches() also work on > IRanges and > .>> DNAStringSet objects, as well as on ordinary atomic > vectors: > .>> > .>> library(hgu95av2probe) > .>> library(Biostrings) > .>> probes <- DNAStringSet(hgu95av2probe) > .>> unique_probes <- unique(probes) > .>> count <- countMatches(unique_probes, probes) > .>> max(count) # 7 > .>> > .>> I made other changes in IRanges/GenomicRanges so that > the notion > .>> of "match" between elements of a vector-like object now > consistently > .>> means "equality" instead of "overlap", even for > range-based > objects > .>> like IRanges or GRanges objects. This notion of > "equality" is the > .>> same that is used by ==. The most visible consequence > of those > .>> changes is that using %in% between 2 IRanges or > GRanges objects > .>> 'query' and 'subject' in order to do overlaps was > replaced by > .>> overlapsAny(query, subject). > .>> > .>> overlapsAny(query, subject): Finds the ranges in > ?query? that > .>> overlap any of the ranges in ?subject?. > .>> > > .>> There are warnings and deprecation messages in place > to help > smooth > > .>> the transition. > .>> > .>> Cheers, > .>> H. > .>> > .>> -- > .>> Hervé Pagès > .>> > .>> Program in Computational Biology > .>> Division of Public Health Sciences > .>> Fred Hutchinson Cancer Research Center > .>> 1100 Fairview Ave. N, M1-B514 > .>> P.O. Box 19024 > .>> Seattle, WA 98109-1024 > .>> > .>> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > .>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> > .>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > > .>> > .> > .> [[alternative HTML version deleted]] > .> > .> > .> _________________________________________________ > .> Bioconductor mailing list > .> Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > > .> https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > .> Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > . > ._________________________________________________ > .Bioconductor mailing list > .Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > > .https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > .Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLYlink written 4.9 years ago by Hervé Pagès ♦♦ 13k
On Mon, Jan 7, 2013 at 11:00 AM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Michael, > > I don't think "match" (the word) always has to mean "equality" either. > However having match() (the function) do "whole exact matching" (aka > "equality") for any kind of vector-like object has the advantage of: > > (a) making it consistent with base::match() (?base::match is pretty > explicit about what the contract of match() is) > > (a) alone is obviously not enough. We have many methods, like the set operations, that treat ranges specially. Are we going to start moving everything toward the base behavior? And have rangeIntersect, rangeSetdiff, etc? (b) preserving its relationship with ==, duplicated(), unique(), > etc... > > So it becomes consistent with duplicated/unique, but we lose consistency with the set operations. > (c) not frustrating the user who needs something to do exact > matching on ranges (as I mentioned previously, if you take > match() away from him/her, s/he'll be left with nothing). > > No one has ever asked for match() to behave this way. There was a request for a way to tabulate identical ranges. It was a nice idea to extract the general "outer equal" findMatches function. But the changes seem to be snow-balling. These types of changes mean a lot of maintenance work for the users. A deprecation cycle does not circumvent that. IMO those advantages counterbalance *by far* the very little > convenience you get from having 'match(query, subject)' do > 'findOverlaps(query, subject, select="first")' on > IRanges/GRanges objects. If you need to do that, just use the > latter, or, if you think that's still too much typing, define > a wrapper e.g. 'ovmatch(query, subject)'. > > There are plenty of specialized tools around for doing > inexact/fuzzy/partial/overlap matching for many particular types > of vector-like objects: grep() and family, pmatch(), charmatch(), > agrep(), grepRaw(), matchPattern() and family, findOverlaps() and > family, findIntervals(), etc... For the reasons I mentioned > above, none of them should hijack match() to make it do some > particular type of inexact matching on some particular type of > objects. Even if, for that particular type of objects, doing that > particular type of inexact matching is more common than doing > exact matching. > > H. > > > > On 01/06/2013 05:39 PM, Michael Lawrence wrote: > >> I think having overlapsAny is a nice addition and helps make the API >> more complete and explicit. Are you sure we need to change the behavior >> of the match method for this relatively uncommon use case? >> > > Yes because otherwise users with a use case of doing match() > > even if it's uncommon, > > > I don't think >> "match" always has to mean "equality". It is a more general concept in >> my mind. The most common use case for matching ranges is overlap. >> > > Of course "match" doesn't always have to mean equality. But of base > > >> Michael >> >> >> On Fri, Jan 4, 2013 at 8:34 PM, Hervé Pagès <hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>> wrote: >> >> Yes 'peaks %in% genes' is cute and was probably doing the right thing >> for most users (although not all). But 'exons %in% genes' is cute too >> and was probably doing the wrong thing for all users. Advanced users >> like you guys would have no problem switching to >> >> !is.na <http: is.na="">(findOverlaps(**peaks, genes, type="within", >> select="any")) >> >> or >> >> !is.na <http: is.na="">(findOverlaps(**peaks, genes, type="equal", >> >> select="any")) >> >> in case 'peaks %in% genes' was not doing exactly what you wanted, >> but most users would not find this particularly friendly. Even >> worse, some users probably didn't realize that 'peaks %in% genes' >> was not doing exactly what they thought it did because "peaks in >> genes" in English suggests that the peaks are within the genes, >> but it's not what 'peaks %in% genes' does. >> >> Having overlapsAny(), with exactly the same extra arguments as >> countOverlaps() and subsetByOverlaps() (i.e. 'maxgap', 'minoverlap', >> 'type', 'ignore.strand'), all of them documented (and with most >> users more or less familiar with them already) has the virtue to >> expose the user to all the options from the very start, and to >> help him/her make the right choice. Of course there will be users >> that don't want or don't have the time to read/think about all the >> options. Not a big deal: they'll just do 'overlapsAny(query, >> subject)', >> which is not a lot more typing than 'query %in% subject', especially >> if they use tab completion. >> >> It's true that it's more common to ask questions about overlap than >> about equality but there are some use cases for the latter (as the >> original thread shows). Until now, when you had such a use case, you >> could not use match() or %in%, which would have been the natural >> things >> to use, because they got hijacked to do something else, and you were >> left with nothing. Not a satisfying situation. So at a minimum, we >> needed to restore the true/real/original semantic of match() to do >> "equality" instead of "overlap". But it's hard to do this for match() >> and not do it for %in% too. For more than 99% of R users, %in% is >> just a simple wrapper for 'match(x, table, nomatch = 0) > 0' (this >> is how it has been documented and implemented in base R for many >> years). Not maintaining this relationship between %in% and match() >> would only cause grief and frustration to newcomers to Bioconductor. >> >> H. >> >> >> >> On 01/04/2013 03:32 PM, Cook, Malcolm wrote: >> >> Hiya again, >> >> I am definitely a late comer to BioC, so I definitely easily >> defer to >> the tide of history. >> >> But I do think you miss my point Michael about the proposed change >> making the relationship between %in% and match for >> {G,I}Ranges{List} >> mimic that between other vectors, and I do think that changing >> the API >> would make other late-comers take to BioC easier/faster. >> >> That said, I NEVER use %in% so I really have no stake in the >> matter, and >> I DEFINITELY appreciate the argument to not changing the API >> just for >> sematic sweetness. >> >> That that said, Herve is _/so good/_ about deprecations and >> warnings >> >> that make such changes fairly easily digestible. >> >> That that that.... enough.... I bow out of this one....!!!! >> >> Always learning and Happy New Year to all lurkers, >> >> ~Malcolm >> >> *From:*Michael Lawrence [mailto:lawrence.michael@gene.**__com >> >> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com="">>] >> *Sent:* Friday, January 04, 2013 5:11 PM >> *To:* Cook, Malcolm >> *Cc:* Sean Davis; Michael Lawrence; Hervé Pagès >> (hpages@fhcrc.org <mailto:hpages@fhcrc.org>); Tim >> >> >> Triche, Jr.; Vedran Franke; bioconductor@r-project.org >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> > >> *Subject:* Re: [BioC] countMatches() (was: table for >> GenomicRanges) >> >> >> On Fri, Jan 4, 2013 at 1:56 PM, Cook, Malcolm <mec@stowers.org>> <mailto:mec@stowers.org> >> <mailto:mec@stowers.org <mailto:mec@stowers.org="">>> wrote: >> >> Hiya, >> >> For what it is worth... >> >> I think the change to %in% is warranted. >> >> If I understand correctly, this change restores the relationship >> between >> the semantics of `%in` and the semantics of `match`. >> >> From the docs: >> >> '"%in%" <- function(x, table) match(x, table, nomatch = 0) > >> 0' >> >> Herve's change restores this relationship. >> >> >> match and %in% were initially consistent (both considering any >> overlap); >> Herve has changed both of them together. The whole idea behind >> IRanges >> is that ranges are special data types with special semantics. We >> have >> reimplemented much of the existing R vector API using those >> semantics; >> this extends beyond match/%in%. I am hesitant about making such >> sweeping >> changes to the API so late in the life-cycle of the package. >> There was a >> feature request for a way to count identical ranges in a set of >> ranges. >> Let's please not get carried away and start redesigning the API >> for this >> one, albeit useful, request. There are all sorts of >> inconsistencies in >> the API, and many of them were conscious decisions that considered >> practical use cases. >> >> Michael >> >> >> Herve, I suspect you were you as a result able to >> completely drop >> all the `%in%,BiocClass1,BiocClass2` definitions and depend >> upon >> base::%in% >> >> Am I right? >> >> If so, may I suggest that Herve stay the course, with the >> addition of >> '"%ol%" <- function(a, b) findOverlaps(a, b, maxgap=0L, >> minoverlap=1L, type='any', select='all') > 0' >> >> This would provide a perspicacious idiom, thereby >> optimizing the API >> for Michaels observed common use case. >> >> Just sayin' >> >> ~Malcolm >> >> >> .-----Original Message----- >> .From: bioconductor-bounces@r-__**project.org <bioconductor-bounces@r-__project.org> >> <mailto:bioconductor-bounces@**r-project.org<bioconductor- bounces@r-project.org=""> >> > >> <mailto:bioconductor-bounces@_**_r-project.org>> <mailto:bioconductor-bounces@**r-project.org<bioconductor- bounces@r-project.org=""> >> >> >> [mailto:bioconductor-bounces@_**_r-project.org >> <mailto:bioconductor-bounces@**r-project.org<bioconductor- bounces@r-project.org=""> >> > >> >> <mailto:bioconductor-bounces@_**_r-project.org>> >> <mailto:bioconductor-bounces@**r-project.org<bioconductor- bounces@r-project.org="">>>] >> On Behalf Of Sean >> Davis >> .Sent: Friday, January 04, 2013 3:37 PM >> .To: Michael Lawrence >> .Cc: Tim Triche, Jr.; Vedran Franke; >> bioconductor@r-project.org <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> > >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> >> >> >> .Subject: Re: [BioC] countMatches() (was: table for >> GenomicRanges) >> . >> .On Fri, Jan 4, 2013 at 4:32 PM, Michael Lawrence >> .<lawrence.michael@gene.com>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com="">> >> <mailto:lawrence.michael@gene.**__com>> >> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com="">>>> >> wrote: >> .> The change to the behavior of %in% is a pretty big >> one. Are you >> thinking >> .> that all set-based operations should behave this way? >> For >> example, setdiff >> .> and intersect? I really liked the syntax of "peaks >> %in% genes". >> In my >> .> experience, it's way more common to ask questions >> about overlap >> than about >> .> equality, so I'd rather optimize the API for that use >> case. But >> again, >> .> that's just my personal bias. >> . >> .For what it is worth, I share Michael's personal bias >> here. >> . >> .Sean >> . >> . >> .> Michael >> .> >> .> >> .> On Fri, Jan 4, 2013 at 1:11 PM, Hervé Pagès >> <hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>> wrote: >> .> >> .>> Hi, >> .>> >> .>> I added findMatches() and countMatches() to the >> latest IRanges / >> .>> GenomicRanges packages (in BioC devel only). >> .>> >> .>> findMatches(x, table): An enhanced version of >> ‘match’ that >> .>> returns all the matches in a Hits object. >> .>> >> .>> countMatches(x, table): Returns an integer vector >> of the length >> .>> of ‘x’, containing the number of matches in >> ‘table’ for >> .>> each element in ‘x’. >> .>> >> >> .>> countMatches() is what you can use to >> tally/count/tabulate >> (choose your >> >> .>> preferred term) the unique elements in a GRanges >> object: >> .>> >> .>> library(GenomicRanges) >> .>> set.seed(33) >> .>> gr <- GRanges("chr1", >> IRanges(sample(15,20,replace=***__*TRUE), >> >> width=5)) >> .>> >> .>> Then: >> .>> >> .>> > gr_levels <- sort(unique(gr)) >> .>> > countMatches(gr_levels, gr) >> .>> [1] 1 1 1 2 4 2 2 1 2 2 2 >> .>> >> .>> Note that findMatches() and countMatches() also work on >> IRanges and >> .>> DNAStringSet objects, as well as on ordinary atomic >> vectors: >> .>> >> .>> library(hgu95av2probe) >> .>> library(Biostrings) >> .>> probes <- DNAStringSet(hgu95av2probe) >> .>> unique_probes <- unique(probes) >> .>> count <- countMatches(unique_probes, probes) >> .>> max(count) # 7 >> .>> >> .>> I made other changes in IRanges/GenomicRanges so that >> the notion >> .>> of "match" between elements of a vector-like object now >> consistently >> .>> means "equality" instead of "overlap", even for >> range-based >> objects >> .>> like IRanges or GRanges objects. This notion of >> "equality" is the >> .>> same that is used by ==. The most visible consequence >> of those >> .>> changes is that using %in% between 2 IRanges or >> GRanges objects >> .>> 'query' and 'subject' in order to do overlaps was >> replaced by >> .>> overlapsAny(query, subject). >> .>> >> .>> overlapsAny(query, subject): Finds the ranges in >> ‘query’ that >> .>> overlap any of the ranges in ‘subject’. >> .>> >> >> .>> There are warnings and deprecation messages in place >> to help >> smooth >> >> .>> the transition. >> .>> >> .>> Cheers, >> .>> H. >> .>> >> .>> -- >> .>> Hervé Pagès >> .>> >> .>> Program in Computational Biology >> .>> Division of Public Health Sciences >> .>> Fred Hutchinson Cancer Research Center >> .>> 1100 Fairview Ave. N, M1-B514 >> .>> P.O. Box 19024 >> .>> Seattle, WA 98109-1024 >> .>> >> .>> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> >> .>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >> <tel:%28206%29%20667-5791> >> .>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >> <tel:%28206%29%20667-1319> >> >> .>> >> .> >> .> [[alternative HTML version deleted]] >> .> >> .> >> .> ______________________________**___________________ >> >> .> Bioconductor mailing list >> .> Bioconductor@r-project.org >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> > >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> >> >> >> .> https://stat.ethz.ch/mailman/_**_listinfo/biocond uctor<https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> >> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https="" :="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> > >> .> Search the archives: >> http://news.gmane.org/gmane.__**science.biology.informatics.__** >> conductor<http: news.gmane.org="" gmane.__science.biology.informatics="" .__conductor=""> >> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor=""> >> > >> . >> ._____________________________**____________________ >> >> .Bioconductor mailing list >> .Bioconductor@r-project.org >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> > >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> >> >> >> .https://stat.ethz.ch/mailman/**__listinfo/bioconduc tor<https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> >> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https="" :="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> > >> .Search the archives: >> http://news.gmane.org/gmane.__**science.biology.informatics.__** >> conductor<http: news.gmane.org="" gmane.__science.biology.informatics="" .__conductor=""> >> >> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor=""> >> > >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> >> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >> >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]
ADD REPLYlink written 4.9 years ago by Michael Lawrence9.8k
So why not leave %in% as it was and transition everything forward to explicitly using { `%within%`, `%overlaps%`|`%overlapping%`, `%equals%` } such that identical( x %within% table, countOverlaps(x, table, type='within') > 0 ) == TRUE identical( x %overlaps% table, countOverlaps(x, table, type='any') > 0 ) == TRUE identical( x %equals% table, countOverlaps(x, table, type='equal') > 0 ) == TRUE and for the time being, identical( x %overlaps% table, countOverlaps(x, table, type='any') > 0 ) == TRUE ## but with a noisy nastygram that will halt if options("warn"=2) No breakage for %in% methods until such time as a full deprecation cycle has passed, and if the maintainers can't be arsed to do anything at all about the warnings by the second full release, then perhaps they don't really care that much after all. Just a thought? >From someone (me) who has their own issues with keeping everything up to date and should know better. If you want to use %in% for peaks %in% genes (why on earth would you do this rather than peaks %in% promoters(genes), anyways?) then a nastygram could be emitted "WARNING: YOUR SHORTHAND NOTATION IS DOOMED AFTER BIOC 2.13, YOU WILL BE ASSIMILATED" and everyone is (more or less) happy. On Mon, Jan 7, 2013 at 11:33 AM, Michael Lawrence <lawrence.michael@gene.com> wrote: > > > > On Mon, Jan 7, 2013 at 11:00 AM, Hervé Pagès <hpages@fhcrc.org> wrote: > >> Hi Michael, >> >> I don't think "match" (the word) always has to mean "equality" either. >> However having match() (the function) do "whole exact matching" (aka >> "equality") for any kind of vector-like object has the advantage of: >> >> (a) making it consistent with base::match() (?base::match is pretty >> explicit about what the contract of match() is) >> >> > (a) alone is obviously not enough. We have many methods, like the set > operations, that treat ranges specially. Are we going to start moving > everything toward the base behavior? And have rangeIntersect, rangeSetdiff, > etc? > > (b) preserving its relationship with ==, duplicated(), unique(), >> etc... >> >> > So it becomes consistent with duplicated/unique, but we lose consistency > with the set operations. > > >> (c) not frustrating the user who needs something to do exact >> matching on ranges (as I mentioned previously, if you take >> match() away from him/her, s/he'll be left with nothing). >> >> > No one has ever asked for match() to behave this way. There was a request > for a way to tabulate identical ranges. It was a nice idea to extract the > general "outer equal" findMatches function. But the changes seem to be > snow-balling. These types of changes mean a lot of maintenance work for > the users. A deprecation cycle does not circumvent that. > > > IMO those advantages counterbalance *by far* the very little >> convenience you get from having 'match(query, subject)' do >> 'findOverlaps(query, subject, select="first")' on >> IRanges/GRanges objects. If you need to do that, just use the >> latter, or, if you think that's still too much typing, define >> a wrapper e.g. 'ovmatch(query, subject)'. >> >> There are plenty of specialized tools around for doing >> inexact/fuzzy/partial/overlap matching for many particular types >> of vector-like objects: grep() and family, pmatch(), charmatch(), >> agrep(), grepRaw(), matchPattern() and family, findOverlaps() and >> family, findIntervals(), etc... For the reasons I mentioned >> above, none of them should hijack match() to make it do some >> particular type of inexact matching on some particular type of >> objects. Even if, for that particular type of objects, doing that >> particular type of inexact matching is more common than doing >> exact matching. >> >> H. >> >> >> >> On 01/06/2013 05:39 PM, Michael Lawrence wrote: >> >>> I think having overlapsAny is a nice addition and helps make the API >>> more complete and explicit. Are you sure we need to change the behavior >>> of the match method for this relatively uncommon use case? >>> >> >> Yes because otherwise users with a use case of doing match() >> >> even if it's uncommon, >> >> >> I don't think >>> "match" always has to mean "equality". It is a more general concept in >>> my mind. The most common use case for matching ranges is overlap. >>> >> >> Of course "match" doesn't always have to mean equality. But of base >> >> >>> Michael >>> >>> >>> On Fri, Jan 4, 2013 at 8:34 PM, Hervé Pagès <hpages@fhcrc.org>>> <mailto:hpages@fhcrc.org>> wrote: >>> >>> Yes 'peaks %in% genes' is cute and was probably doing the right thing >>> for most users (although not all). But 'exons %in% genes' is cute too >>> and was probably doing the wrong thing for all users. Advanced users >>> like you guys would have no problem switching to >>> >>> !is.na <http: is.na="">(findOverlaps(**peaks, genes, type="within", >>> select="any")) >>> >>> or >>> >>> !is.na <http: is.na="">(findOverlaps(**peaks, genes, type="equal", >>> >>> select="any")) >>> >>> in case 'peaks %in% genes' was not doing exactly what you wanted, >>> but most users would not find this particularly friendly. Even >>> worse, some users probably didn't realize that 'peaks %in% genes' >>> was not doing exactly what they thought it did because "peaks in >>> genes" in English suggests that the peaks are within the genes, >>> but it's not what 'peaks %in% genes' does. >>> >>> Having overlapsAny(), with exactly the same extra arguments as >>> countOverlaps() and subsetByOverlaps() (i.e. 'maxgap', 'minoverlap', >>> 'type', 'ignore.strand'), all of them documented (and with most >>> users more or less familiar with them already) has the virtue to >>> expose the user to all the options from the very start, and to >>> help him/her make the right choice. Of course there will be users >>> that don't want or don't have the time to read/think about all the >>> options. Not a big deal: they'll just do 'overlapsAny(query, >>> subject)', >>> which is not a lot more typing than 'query %in% subject', especially >>> if they use tab completion. >>> >>> It's true that it's more common to ask questions about overlap than >>> about equality but there are some use cases for the latter (as the >>> original thread shows). Until now, when you had such a use case, you >>> could not use match() or %in%, which would have been the natural >>> things >>> to use, because they got hijacked to do something else, and you were >>> left with nothing. Not a satisfying situation. So at a minimum, we >>> needed to restore the true/real/original semantic of match() to do >>> "equality" instead of "overlap". But it's hard to do this for match() >>> and not do it for %in% too. For more than 99% of R users, %in% is >>> just a simple wrapper for 'match(x, table, nomatch = 0) > 0' (this >>> is how it has been documented and implemented in base R for many >>> years). Not maintaining this relationship between %in% and match() >>> would only cause grief and frustration to newcomers to Bioconductor. >>> >>> H. >>> >>> >>> >>> On 01/04/2013 03:32 PM, Cook, Malcolm wrote: >>> >>> Hiya again, >>> >>> I am definitely a late comer to BioC, so I definitely easily >>> defer to >>> the tide of history. >>> >>> But I do think you miss my point Michael about the proposed >>> change >>> making the relationship between %in% and match for >>> {G,I}Ranges{List} >>> mimic that between other vectors, and I do think that changing >>> the API >>> would make other late-comers take to BioC easier/faster. >>> >>> That said, I NEVER use %in% so I really have no stake in the >>> matter, and >>> I DEFINITELY appreciate the argument to not changing the API >>> just for >>> sematic sweetness. >>> >>> That that said, Herve is _/so good/_ about deprecations and >>> warnings >>> >>> that make such changes fairly easily digestible. >>> >>> That that that.... enough.... I bow out of this one....!!!! >>> >>> Always learning and Happy New Year to all lurkers, >>> >>> ~Malcolm >>> >>> *From:*Michael Lawrence [mailto:lawrence.michael@gene.**__com >>> >>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com=""> >>> >] >>> *Sent:* Friday, January 04, 2013 5:11 PM >>> *To:* Cook, Malcolm >>> *Cc:* Sean Davis; Michael Lawrence; Hervé Pagès >>> (hpages@fhcrc.org <mailto:hpages@fhcrc.org>); Tim >>> >>> >>> Triche, Jr.; Vedran Franke; bioconductor@r-project.org >>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>> > >>> *Subject:* Re: [BioC] countMatches() (was: table for >>> GenomicRanges) >>> >>> >>> On Fri, Jan 4, 2013 at 1:56 PM, Cook, Malcolm <mec@stowers.org>>> <mailto:mec@stowers.org> >>> <mailto:mec@stowers.org <mailto:mec@stowers.org="">>> wrote: >>> >>> Hiya, >>> >>> For what it is worth... >>> >>> I think the change to %in% is warranted. >>> >>> If I understand correctly, this change restores the relationship >>> between >>> the semantics of `%in` and the semantics of `match`. >>> >>> From the docs: >>> >>> '"%in%" <- function(x, table) match(x, table, nomatch = 0) > >>> 0' >>> >>> Herve's change restores this relationship. >>> >>> >>> match and %in% were initially consistent (both considering any >>> overlap); >>> Herve has changed both of them together. The whole idea behind >>> IRanges >>> is that ranges are special data types with special semantics. We >>> have >>> reimplemented much of the existing R vector API using those >>> semantics; >>> this extends beyond match/%in%. I am hesitant about making such >>> sweeping >>> changes to the API so late in the life-cycle of the package. >>> There was a >>> feature request for a way to count identical ranges in a set of >>> ranges. >>> Let's please not get carried away and start redesigning the API >>> for this >>> one, albeit useful, request. There are all sorts of >>> inconsistencies in >>> the API, and many of them were conscious decisions that >>> considered >>> practical use cases. >>> >>> Michael >>> >>> >>> Herve, I suspect you were you as a result able to >>> completely drop >>> all the `%in%,BiocClass1,BiocClass2` definitions and depend >>> upon >>> base::%in% >>> >>> Am I right? >>> >>> If so, may I suggest that Herve stay the course, with the >>> addition of >>> '"%ol%" <- function(a, b) findOverlaps(a, b, maxgap=0L, >>> minoverlap=1L, type='any', select='all') > 0' >>> >>> This would provide a perspicacious idiom, thereby >>> optimizing the API >>> for Michaels observed common use case. >>> >>> Just sayin' >>> >>> ~Malcolm >>> >>> >>> .-----Original Message----- >>> .From: bioconductor-bounces@r-__**project.org <bioconductor-bounces@r-__project.org> >>> <mailto:bioconductor-bounces@**r-project.org<bioconductor- bounces@r-project.org=""> >>> > >>> <mailto:bioconductor-bounces@_**_r-project.org>>> <mailto:bioconductor-bounces@**r-project.org<bioconductor- bounces@r-project.org=""> >>> >> >>> [mailto:bioconductor-bounces@_**_r-project.org >>> <mailto:bioconductor-bounces@**r-project.org<bioconductor- bounces@r-project.org=""> >>> > >>> >>> <mailto:bioconductor-bounces@_**_r-project.org>>> >>> <mailto:bioconductor-bounces@**r-project.org<bioconductor- bounces@r-project.org="">>>] >>> On Behalf Of Sean >>> Davis >>> .Sent: Friday, January 04, 2013 3:37 PM >>> .To: Michael Lawrence >>> .Cc: Tim Triche, Jr.; Vedran Franke; >>> bioconductor@r-project.org <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>> > >>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>> >>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>> >> >>> >>> .Subject: Re: [BioC] countMatches() (was: table for >>> GenomicRanges) >>> . >>> .On Fri, Jan 4, 2013 at 4:32 PM, Michael Lawrence >>> .<lawrence.michael@gene.com>>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com="">> >>> <mailto:lawrence.michael@gene.**__com>>> >>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com="">>>> >>> wrote: >>> .> The change to the behavior of %in% is a pretty big >>> one. Are you >>> thinking >>> .> that all set-based operations should behave this way? >>> For >>> example, setdiff >>> .> and intersect? I really liked the syntax of "peaks >>> %in% genes". >>> In my >>> .> experience, it's way more common to ask questions >>> about overlap >>> than about >>> .> equality, so I'd rather optimize the API for that use >>> case. But >>> again, >>> .> that's just my personal bias. >>> . >>> .For what it is worth, I share Michael's personal bias >>> here. >>> . >>> .Sean >>> . >>> . >>> .> Michael >>> .> >>> .> >>> .> On Fri, Jan 4, 2013 at 1:11 PM, Hervé Pagès >>> <hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >>> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>> wrote: >>> .> >>> .>> Hi, >>> .>> >>> .>> I added findMatches() and countMatches() to the >>> latest IRanges / >>> .>> GenomicRanges packages (in BioC devel only). >>> .>> >>> .>> findMatches(x, table): An enhanced version of >>> ‘match’ that >>> .>> returns all the matches in a Hits object. >>> .>> >>> .>> countMatches(x, table): Returns an integer vector >>> of the length >>> .>> of ‘x’, containing the number of matches in >>> ‘table’ for >>> .>> each element in ‘x’. >>> .>> >>> >>> .>> countMatches() is what you can use to >>> tally/count/tabulate >>> (choose your >>> >>> .>> preferred term) the unique elements in a GRanges >>> object: >>> .>> >>> .>> library(GenomicRanges) >>> .>> set.seed(33) >>> .>> gr <- GRanges("chr1", >>> IRanges(sample(15,20,replace=***__*TRUE), >>> >>> width=5)) >>> .>> >>> .>> Then: >>> .>> >>> .>> > gr_levels <- sort(unique(gr)) >>> .>> > countMatches(gr_levels, gr) >>> .>> [1] 1 1 1 2 4 2 2 1 2 2 2 >>> .>> >>> .>> Note that findMatches() and countMatches() also work >>> on >>> IRanges and >>> .>> DNAStringSet objects, as well as on ordinary atomic >>> vectors: >>> .>> >>> .>> library(hgu95av2probe) >>> .>> library(Biostrings) >>> .>> probes <- DNAStringSet(hgu95av2probe) >>> .>> unique_probes <- unique(probes) >>> .>> count <- countMatches(unique_probes, probes) >>> .>> max(count) # 7 >>> .>> >>> .>> I made other changes in IRanges/GenomicRanges so that >>> the notion >>> .>> of "match" between elements of a vector-like object >>> now >>> consistently >>> .>> means "equality" instead of "overlap", even for >>> range-based >>> objects >>> .>> like IRanges or GRanges objects. This notion of >>> "equality" is the >>> .>> same that is used by ==. The most visible consequence >>> of those >>> .>> changes is that using %in% between 2 IRanges or >>> GRanges objects >>> .>> 'query' and 'subject' in order to do overlaps was >>> replaced by >>> .>> overlapsAny(query, subject). >>> .>> >>> .>> overlapsAny(query, subject): Finds the ranges in >>> ‘query’ that >>> .>> overlap any of the ranges in ‘subject’. >>> .>> >>> >>> .>> There are warnings and deprecation messages in place >>> to help >>> smooth >>> >>> .>> the transition. >>> .>> >>> .>> Cheers, >>> .>> H. >>> .>> >>> .>> -- >>> .>> Hervé Pagès >>> .>> >>> .>> Program in Computational Biology >>> .>> Division of Public Health Sciences >>> .>> Fred Hutchinson Cancer Research Center >>> .>> 1100 Fairview Ave. N, M1-B514 >>> .>> P.O. Box 19024 >>> .>> Seattle, WA 98109-1024 >>> .>> >>> .>> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >>> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >>> >>> .>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >>> <tel:%28206%29%20667-5791> >>> .>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >>> <tel:%28206%29%20667-1319> >>> >>> .>> >>> .> >>> .> [[alternative HTML version deleted]] >>> .> >>> .> >>> .> ______________________________**___________________ >>> >>> .> Bioconductor mailing list >>> .> Bioconductor@r-project.org >>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>> > >>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>> >> >>> >>> .> https://stat.ethz.ch/mailman/_**_listinfo/biocon ductor<https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >>> >>> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<http="" s:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> > >>> .> Search the archives: >>> http://news.gmane.org/gmane.__**science.biology.informatics.__** >>> conductor<http: news.gmane.org="" gmane.__science.biology.informatic="" s.__conductor=""> >>> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">>> conductor<http: news.gmane.org="" gmane.science.biology.informatics.="" conductor=""> >>> > >>> . >>> ._____________________________**____________________ >>> >>> .Bioconductor mailing list >>> .Bioconductor@r-project.org >>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>> > >>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>> >> >>> >>> .https://stat.ethz.ch/mailman/**__listinfo/biocondu ctor<https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >>> >>> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<http="" s:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> > >>> .Search the archives: >>> http://news.gmane.org/gmane.__**science.biology.informatics.__** >>> conductor<http: news.gmane.org="" gmane.__science.biology.informatic="" s.__conductor=""> >>> >>> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">>> conductor<http: news.gmane.org="" gmane.science.biology.informatics.="" conductor=""> >>> > >>> >>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >>> >>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >>> >>> >>> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> > > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]
ADD REPLYlink written 4.9 years ago by Tim Triche4.2k
*expletives*! I meant identical( x %overlaps% table, x %in% table ) == TRUE ## but with a noisy nastygram that will halt if options("warn"=2) rather than identical( x %overlaps% table, countOverlaps(x, table, type='any') > 0 ) == TRUE ## which should not have a nastygram at all! Many eyes something something. On Mon, Jan 7, 2013 at 11:45 AM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: > So why not leave %in% as it was and transition everything forward to > explicitly using { `%within%`, `%overlaps%`|`%overlapping%`, `%equals%` } > such that > > identical( x %within% table, countOverlaps(x, table, type='within') > 0 > ) == TRUE > identical( x %overlaps% table, countOverlaps(x, table, type='any') > 0 ) > == TRUE > identical( x %equals% table, countOverlaps(x, table, type='equal') > 0 ) > == TRUE > > and for the time being, > > identical( x %overlaps% table, countOverlaps(x, table, type='any') > 0 ) > == TRUE ## but with a noisy nastygram that will halt if options("warn"=2) > > No breakage for %in% methods until such time as a full deprecation cycle > has passed, and if the maintainers can't be arsed to do anything at all > about the warnings by the second full release, then perhaps they don't > really care that much after all. Just a thought? > > From someone (me) who has their own issues with keeping everything up to > date and should know better. If you want to use %in% for > > peaks %in% genes (why on earth would you do this rather than peaks %in% > promoters(genes), anyways?) > > then a nastygram could be emitted "WARNING: YOUR SHORTHAND NOTATION IS > DOOMED AFTER BIOC 2.13, YOU WILL BE ASSIMILATED" and everyone is (more or > less) happy. > > > > On Mon, Jan 7, 2013 at 11:33 AM, Michael Lawrence < > lawrence.michael@gene.com> wrote: > >> >> >> >> On Mon, Jan 7, 2013 at 11:00 AM, Hervé Pagès <hpages@fhcrc.org> wrote: >> >>> Hi Michael, >>> >>> I don't think "match" (the word) always has to mean "equality" either. >>> However having match() (the function) do "whole exact matching" (aka >>> "equality") for any kind of vector-like object has the advantage of: >>> >>> (a) making it consistent with base::match() (?base::match is pretty >>> explicit about what the contract of match() is) >>> >>> >> (a) alone is obviously not enough. We have many methods, like the set >> operations, that treat ranges specially. Are we going to start moving >> everything toward the base behavior? And have rangeIntersect, rangeSetdiff, >> etc? >> >> (b) preserving its relationship with ==, duplicated(), unique(), >>> etc... >>> >>> >> So it becomes consistent with duplicated/unique, but we lose consistency >> with the set operations. >> >> >>> (c) not frustrating the user who needs something to do exact >>> matching on ranges (as I mentioned previously, if you take >>> match() away from him/her, s/he'll be left with nothing). >>> >>> >> No one has ever asked for match() to behave this way. There was a request >> for a way to tabulate identical ranges. It was a nice idea to extract the >> general "outer equal" findMatches function. But the changes seem to be >> snow-balling. These types of changes mean a lot of maintenance work for >> the users. A deprecation cycle does not circumvent that. >> >> >> IMO those advantages counterbalance *by far* the very little >>> convenience you get from having 'match(query, subject)' do >>> 'findOverlaps(query, subject, select="first")' on >>> IRanges/GRanges objects. If you need to do that, just use the >>> latter, or, if you think that's still too much typing, define >>> a wrapper e.g. 'ovmatch(query, subject)'. >>> >>> There are plenty of specialized tools around for doing >>> inexact/fuzzy/partial/overlap matching for many particular types >>> of vector-like objects: grep() and family, pmatch(), charmatch(), >>> agrep(), grepRaw(), matchPattern() and family, findOverlaps() and >>> family, findIntervals(), etc... For the reasons I mentioned >>> above, none of them should hijack match() to make it do some >>> particular type of inexact matching on some particular type of >>> objects. Even if, for that particular type of objects, doing that >>> particular type of inexact matching is more common than doing >>> exact matching. >>> >>> H. >>> >>> >>> >>> On 01/06/2013 05:39 PM, Michael Lawrence wrote: >>> >>>> I think having overlapsAny is a nice addition and helps make the API >>>> more complete and explicit. Are you sure we need to change the behavior >>>> of the match method for this relatively uncommon use case? >>>> >>> >>> Yes because otherwise users with a use case of doing match() >>> >>> even if it's uncommon, >>> >>> >>> I don't think >>>> "match" always has to mean "equality". It is a more general concept in >>>> my mind. The most common use case for matching ranges is overlap. >>>> >>> >>> Of course "match" doesn't always have to mean equality. But of base >>> >>> >>>> Michael >>>> >>>> >>>> On Fri, Jan 4, 2013 at 8:34 PM, Hervé Pagès <hpages@fhcrc.org>>>> <mailto:hpages@fhcrc.org>> wrote: >>>> >>>> Yes 'peaks %in% genes' is cute and was probably doing the right >>>> thing >>>> for most users (although not all). But 'exons %in% genes' is cute >>>> too >>>> and was probably doing the wrong thing for all users. Advanced >>>> users >>>> like you guys would have no problem switching to >>>> >>>> !is.na <http: is.na="">(findOverlaps(**peaks, genes, >>>> type="within", >>>> select="any")) >>>> >>>> or >>>> >>>> !is.na <http: is.na="">(findOverlaps(**peaks, genes, type="equal", >>>> >>>> select="any")) >>>> >>>> in case 'peaks %in% genes' was not doing exactly what you wanted, >>>> but most users would not find this particularly friendly. Even >>>> worse, some users probably didn't realize that 'peaks %in% genes' >>>> was not doing exactly what they thought it did because "peaks in >>>> genes" in English suggests that the peaks are within the genes, >>>> but it's not what 'peaks %in% genes' does. >>>> >>>> Having overlapsAny(), with exactly the same extra arguments as >>>> countOverlaps() and subsetByOverlaps() (i.e. 'maxgap', 'minoverlap', >>>> 'type', 'ignore.strand'), all of them documented (and with most >>>> users more or less familiar with them already) has the virtue to >>>> expose the user to all the options from the very start, and to >>>> help him/her make the right choice. Of course there will be users >>>> that don't want or don't have the time to read/think about all the >>>> options. Not a big deal: they'll just do 'overlapsAny(query, >>>> subject)', >>>> which is not a lot more typing than 'query %in% subject', especially >>>> if they use tab completion. >>>> >>>> It's true that it's more common to ask questions about overlap than >>>> about equality but there are some use cases for the latter (as the >>>> original thread shows). Until now, when you had such a use case, you >>>> could not use match() or %in%, which would have been the natural >>>> things >>>> to use, because they got hijacked to do something else, and you were >>>> left with nothing. Not a satisfying situation. So at a minimum, we >>>> needed to restore the true/real/original semantic of match() to do >>>> "equality" instead of "overlap". But it's hard to do this for >>>> match() >>>> and not do it for %in% too. For more than 99% of R users, %in% is >>>> just a simple wrapper for 'match(x, table, nomatch = 0) > 0' (this >>>> is how it has been documented and implemented in base R for many >>>> years). Not maintaining this relationship between %in% and match() >>>> would only cause grief and frustration to newcomers to Bioconductor. >>>> >>>> H. >>>> >>>> >>>> >>>> On 01/04/2013 03:32 PM, Cook, Malcolm wrote: >>>> >>>> Hiya again, >>>> >>>> I am definitely a late comer to BioC, so I definitely easily >>>> defer to >>>> the tide of history. >>>> >>>> But I do think you miss my point Michael about the proposed >>>> change >>>> making the relationship between %in% and match for >>>> {G,I}Ranges{List} >>>> mimic that between other vectors, and I do think that changing >>>> the API >>>> would make other late-comers take to BioC easier/faster. >>>> >>>> That said, I NEVER use %in% so I really have no stake in the >>>> matter, and >>>> I DEFINITELY appreciate the argument to not changing the API >>>> just for >>>> sematic sweetness. >>>> >>>> That that said, Herve is _/so good/_ about deprecations and >>>> warnings >>>> >>>> that make such changes fairly easily digestible. >>>> >>>> That that that.... enough.... I bow out of this one....!!!! >>>> >>>> Always learning and Happy New Year to all lurkers, >>>> >>>> ~Malcolm >>>> >>>> *From:*Michael Lawrence [mailto:lawrence.michael@gene.**__com >>>> >>>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com=""> >>>> >] >>>> *Sent:* Friday, January 04, 2013 5:11 PM >>>> *To:* Cook, Malcolm >>>> *Cc:* Sean Davis; Michael Lawrence; Hervé Pagès >>>> (hpages@fhcrc.org <mailto:hpages@fhcrc.org>); Tim >>>> >>>> >>>> Triche, Jr.; Vedran Franke; bioconductor@r-project.org >>>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>>> > >>>> *Subject:* Re: [BioC] countMatches() (was: table for >>>> GenomicRanges) >>>> >>>> >>>> On Fri, Jan 4, 2013 at 1:56 PM, Cook, Malcolm <mec@stowers.org>>>> <mailto:mec@stowers.org> >>>> <mailto:mec@stowers.org <mailto:mec@stowers.org="">>> wrote: >>>> >>>> Hiya, >>>> >>>> For what it is worth... >>>> >>>> I think the change to %in% is warranted. >>>> >>>> If I understand correctly, this change restores the relationship >>>> between >>>> the semantics of `%in` and the semantics of `match`. >>>> >>>> From the docs: >>>> >>>> '"%in%" <- function(x, table) match(x, table, nomatch = 0) >>>> > 0' >>>> >>>> Herve's change restores this relationship. >>>> >>>> >>>> match and %in% were initially consistent (both considering any >>>> overlap); >>>> Herve has changed both of them together. The whole idea behind >>>> IRanges >>>> is that ranges are special data types with special semantics. We >>>> have >>>> reimplemented much of the existing R vector API using those >>>> semantics; >>>> this extends beyond match/%in%. I am hesitant about making such >>>> sweeping >>>> changes to the API so late in the life-cycle of the package. >>>> There was a >>>> feature request for a way to count identical ranges in a set of >>>> ranges. >>>> Let's please not get carried away and start redesigning the API >>>> for this >>>> one, albeit useful, request. There are all sorts of >>>> inconsistencies in >>>> the API, and many of them were conscious decisions that >>>> considered >>>> practical use cases. >>>> >>>> Michael >>>> >>>> >>>> Herve, I suspect you were you as a result able to >>>> completely drop >>>> all the `%in%,BiocClass1,BiocClass2` definitions and depend >>>> upon >>>> base::%in% >>>> >>>> Am I right? >>>> >>>> If so, may I suggest that Herve stay the course, with the >>>> addition of >>>> '"%ol%" <- function(a, b) findOverlaps(a, b, maxgap=0L, >>>> minoverlap=1L, type='any', select='all') > 0' >>>> >>>> This would provide a perspicacious idiom, thereby >>>> optimizing the API >>>> for Michaels observed common use case. >>>> >>>> Just sayin' >>>> >>>> ~Malcolm >>>> >>>> >>>> .-----Original Message----- >>>> .From: bioconductor-bounces@r-__**project.org <bioconductor-bounces@r-__project.org> >>>> <mailto:bioconductor-bounces@**r-project.org <bioconductor-bounces@r-project.org=""> >>>> > >>>> <mailto:bioconductor-bounces@_**_r-project.org>>>> <mailto:bioconductor-bounces@**r-project.org <bioconductor-bounces@r-project.org=""> >>>> >> >>>> [mailto:bioconductor-bounces@_**_r-project.org >>>> <mailto:bioconductor-bounces@**r-project.org <bioconductor-bounces@r-project.org=""> >>>> > >>>> >>>> <mailto:bioconductor-bounces@_**_r-project.org>>>> >>>> <mailto:bioconductor-bounces@**r-project.org <bioconductor-bounces@r-project.org="">>>] >>>> On Behalf Of Sean >>>> Davis >>>> .Sent: Friday, January 04, 2013 3:37 PM >>>> .To: Michael Lawrence >>>> .Cc: Tim Triche, Jr.; Vedran Franke; >>>> bioconductor@r-project.org <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>>> > >>>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>>> >>>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>>> >> >>>> >>>> .Subject: Re: [BioC] countMatches() (was: table for >>>> GenomicRanges) >>>> . >>>> .On Fri, Jan 4, 2013 at 4:32 PM, Michael Lawrence >>>> .<lawrence.michael@gene.com>>>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com=""> >>>> > >>>> <mailto:lawrence.michael@gene.**__com>>>> >>>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com="">>>> >>>> wrote: >>>> .> The change to the behavior of %in% is a pretty big >>>> one. Are you >>>> thinking >>>> .> that all set-based operations should behave this way? >>>> For >>>> example, setdiff >>>> .> and intersect? I really liked the syntax of "peaks >>>> %in% genes". >>>> In my >>>> .> experience, it's way more common to ask questions >>>> about overlap >>>> than about >>>> .> equality, so I'd rather optimize the API for that use >>>> case. But >>>> again, >>>> .> that's just my personal bias. >>>> . >>>> .For what it is worth, I share Michael's personal bias >>>> here. >>>> . >>>> .Sean >>>> . >>>> . >>>> .> Michael >>>> .> >>>> .> >>>> .> On Fri, Jan 4, 2013 at 1:11 PM, Hervé Pagès >>>> <hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >>>> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>> >>>> wrote: >>>> .> >>>> .>> Hi, >>>> .>> >>>> .>> I added findMatches() and countMatches() to the >>>> latest IRanges / >>>> .>> GenomicRanges packages (in BioC devel only). >>>> .>> >>>> .>> findMatches(x, table): An enhanced version of >>>> ‘match’ that >>>> .>> returns all the matches in a Hits object. >>>> .>> >>>> .>> countMatches(x, table): Returns an integer vector >>>> of the length >>>> .>> of ‘x’, containing the number of matches in >>>> ‘table’ for >>>> .>> each element in ‘x’. >>>> .>> >>>> >>>> .>> countMatches() is what you can use to >>>> tally/count/tabulate >>>> (choose your >>>> >>>> .>> preferred term) the unique elements in a GRanges >>>> object: >>>> .>> >>>> .>> library(GenomicRanges) >>>> .>> set.seed(33) >>>> .>> gr <- GRanges("chr1", >>>> IRanges(sample(15,20,replace=***__*TRUE), >>>> >>>> width=5)) >>>> .>> >>>> .>> Then: >>>> .>> >>>> .>> > gr_levels <- sort(unique(gr)) >>>> .>> > countMatches(gr_levels, gr) >>>> .>> [1] 1 1 1 2 4 2 2 1 2 2 2 >>>> .>> >>>> .>> Note that findMatches() and countMatches() also work >>>> on >>>> IRanges and >>>> .>> DNAStringSet objects, as well as on ordinary atomic >>>> vectors: >>>> .>> >>>> .>> library(hgu95av2probe) >>>> .>> library(Biostrings) >>>> .>> probes <- DNAStringSet(hgu95av2probe) >>>> .>> unique_probes <- unique(probes) >>>> .>> count <- countMatches(unique_probes, probes) >>>> .>> max(count) # 7 >>>> .>> >>>> .>> I made other changes in IRanges/GenomicRanges so that >>>> the notion >>>> .>> of "match" between elements of a vector-like object >>>> now >>>> consistently >>>> .>> means "equality" instead of "overlap", even for >>>> range-based >>>> objects >>>> .>> like IRanges or GRanges objects. This notion of >>>> "equality" is the >>>> .>> same that is used by ==. The most visible consequence >>>> of those >>>> .>> changes is that using %in% between 2 IRanges or >>>> GRanges objects >>>> .>> 'query' and 'subject' in order to do overlaps was >>>> replaced by >>>> .>> overlapsAny(query, subject). >>>> .>> >>>> .>> overlapsAny(query, subject): Finds the ranges in >>>> ‘query’ that >>>> .>> overlap any of the ranges in ‘subject’. >>>> .>> >>>> >>>> .>> There are warnings and deprecation messages in place >>>> to help >>>> smooth >>>> >>>> .>> the transition. >>>> .>> >>>> .>> Cheers, >>>> .>> H. >>>> .>> >>>> .>> -- >>>> .>> Hervé Pagès >>>> .>> >>>> .>> Program in Computational Biology >>>> .>> Division of Public Health Sciences >>>> .>> Fred Hutchinson Cancer Research Center >>>> .>> 1100 Fairview Ave. N, M1-B514 >>>> .>> P.O. Box 19024 >>>> .>> Seattle, WA 98109-1024 >>>> .>> >>>> .>> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >>>> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >>>> >>>> .>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >>>> <tel:%28206%29%20667-5791> >>>> .>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >>>> <tel:%28206%29%20667-1319> >>>> >>>> .>> >>>> .> >>>> .> [[alternative HTML version deleted]] >>>> .> >>>> .> >>>> .> ______________________________**___________________ >>>> >>>> .> Bioconductor mailing list >>>> .> Bioconductor@r-project.org >>>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>>> > >>>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>>> >> >>>> >>>> .> https://stat.ethz.ch/mailman/_** >>>> _listinfo/bioconductor<https: stat.ethz.ch="" mailman="" __listinfo="" bi="" oconductor=""> >>>> >>>> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<htt="" ps:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>> > >>>> .> Search the archives: >>>> http://news.gmane.org/gmane.__**science.biology.informatics.__* >>>> *conductor<http: news.gmane.org="" gmane.__science.biology.informat="" ics.__conductor=""> >>>> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">>>> conductor<http: news.gmane.org="" gmane.science.biology.informatics="" .conductor=""> >>>> > >>>> . >>>> ._____________________________**____________________ >>>> >>>> .Bioconductor mailing list >>>> .Bioconductor@r-project.org >>>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>>> > >>>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>>> >> >>>> >>>> .https://stat.ethz.ch/mailman/**__listinfo/biocond uctor<https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >>>> >>>> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<htt="" ps:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>> > >>>> .Search the archives: >>>> http://news.gmane.org/gmane.__**science.biology.informatics.__* >>>> *conductor<http: news.gmane.org="" gmane.__science.biology.informat="" ics.__conductor=""> >>>> >>>> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">>>> conductor<http: news.gmane.org="" gmane.science.biology.informatics="" .conductor=""> >>>> > >>>> >>>> >>>> -- >>>> Hervé Pagès >>>> >>>> Program in Computational Biology >>>> Division of Public Health Sciences >>>> Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Ave. N, M1-B514 >>>> P.O. Box 19024 >>>> Seattle, WA 98109-1024 >>>> >>>> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >>>> >>>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >>>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >>>> >>>> >>>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages@fhcrc.org >>> Phone: (206) 667-5791 >>> Fax: (206) 667-1319 >>> >> >> > > > -- > *A model is a lie that helps you see the truth.* > * > * > Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]
ADD REPLYlink written 4.9 years ago by Tim Triche4.2k
Hi Tim, I could add the %ov% operator as a replacement for the old %in%. So you would write 'peaks %ov% genes' instead of 'peaks %in% genes'. Would just be a convenience wrapper for 'overlapsAny(peaks, genes)'. Cheers, H. On 01/07/2013 11:45 AM, Tim Triche, Jr. wrote: > So why not leave %in% as it was and transition everything forward to > explicitly using { `%within%`, `%overlaps%`|`%overlapping%`, `%equals%` > } such that > > identical( x %within% table, countOverlaps(x, table, type='within') > > 0 ) == TRUE > identical( x %overlaps% table, countOverlaps(x, table, type='any') > > 0 ) == TRUE > identical( x %equals% table, countOverlaps(x, table, type='equal') > > 0 ) == TRUE > > and for the time being, > > identical( x %overlaps% table, countOverlaps(x, table, type='any') > > 0 ) == TRUE ## but with a noisy nastygram that will halt if > options("warn"=2) > No breakage for %in% methods until such time as a full deprecation cycle > has passed, and if the maintainers can't be arsed to do anything at all > about the warnings by the second full release, then perhaps they don't > really care that much after all. Just a thought? > > From someone (me) who has their own issues with keeping everything up > to date and should know better. If you want to use %in% for > > peaks %in% genes (why on earth would you do this rather than peaks > %in% promoters(genes), anyways?) > > then a nastygram could be emitted "WARNING: YOUR SHORTHAND NOTATION IS > DOOMED AFTER BIOC 2.13, YOU WILL BE ASSIMILATED" and everyone is (more > or less) happy. > > > > On Mon, Jan 7, 2013 at 11:33 AM, Michael Lawrence > <lawrence.michael at="" gene.com="" <mailto:lawrence.michael="" at="" gene.com="">> wrote: > > > > > On Mon, Jan 7, 2013 at 11:00 AM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote: > > Hi Michael, > > I don't think "match" (the word) always has to mean "equality" > either. > However having match() (the function) do "whole exact matching" (aka > "equality") for any kind of vector-like object has the advantage of: > > (a) making it consistent with base::match() (?base::match is > pretty > explicit about what the contract of match() is) > > > (a) alone is obviously not enough. We have many methods, like the > set operations, that treat ranges specially. Are we going to start > moving everything toward the base behavior? And have rangeIntersect, > rangeSetdiff, etc? > > (b) preserving its relationship with ==, duplicated(), unique(), > etc... > > > So it becomes consistent with duplicated/unique, but we lose > consistency with the set operations. > > (c) not frustrating the user who needs something to do exact > matching on ranges (as I mentioned previously, if you take > match() away from him/her, s/he'll be left with nothing). > > > No one has ever asked for match() to behave this way. There was a > request for a way to tabulate identical ranges. It was a nice idea > to extract the general "outer equal" findMatches function. But the > changes seem to be snow-balling. These types of changes mean a lot > of maintenance work for the users. A deprecation cycle does not > circumvent that. > > > IMO those advantages counterbalance *by far* the very little > convenience you get from having 'match(query, subject)' do > 'findOverlaps(query, subject, select="first")' on > IRanges/GRanges objects. If you need to do that, just use the > latter, or, if you think that's still too much typing, define > a wrapper e.g. 'ovmatch(query, subject)'. > > There are plenty of specialized tools around for doing > inexact/fuzzy/partial/overlap matching for many particular types > of vector-like objects: grep() and family, pmatch(), charmatch(), > agrep(), grepRaw(), matchPattern() and family, findOverlaps() and > family, findIntervals(), etc... For the reasons I mentioned > above, none of them should hijack match() to make it do some > particular type of inexact matching on some particular type of > objects. Even if, for that particular type of objects, doing that > particular type of inexact matching is more common than doing > exact matching. > > H. > > > > On 01/06/2013 05:39 PM, Michael Lawrence wrote: > > I think having overlapsAny is a nice addition and helps make > the API > more complete and explicit. Are you sure we need to change > the behavior > of the match method for this relatively uncommon use case? > > > Yes because otherwise users with a use case of doing match() > > even if it's uncommon, > > > I don't think > "match" always has to mean "equality". It is a more general > concept in > my mind. The most common use case for matching ranges is > overlap. > > > Of course "match" doesn't always have to mean equality. But of base > > > Michael > > > On Fri, Jan 4, 2013 at 8:34 PM, Hervé Pagès > <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> wrote: > > Yes 'peaks %in% genes' is cute and was probably doing > the right thing > for most users (although not all). But 'exons %in% > genes' is cute too > and was probably doing the wrong thing for all users. > Advanced users > like you guys would have no problem switching to > > !is.na <http: is.na=""> > <http: is.na="">(findOverlaps(__peaks, genes, type="within", > select="any")) > > or > > !is.na <http: is.na=""> > <http: is.na="">(findOverlaps(__peaks, genes, type="equal", > > select="any")) > > in case 'peaks %in% genes' was not doing exactly what > you wanted, > but most users would not find this particularly > friendly. Even > worse, some users probably didn't realize that 'peaks > %in% genes' > was not doing exactly what they thought it did because > "peaks in > genes" in English suggests that the peaks are within > the genes, > but it's not what 'peaks %in% genes' does. > > Having overlapsAny(), with exactly the same extra > arguments as > countOverlaps() and subsetByOverlaps() (i.e. 'maxgap', > 'minoverlap', > 'type', 'ignore.strand'), all of them documented (and > with most > users more or less familiar with them already) has the > virtue to > expose the user to all the options from the very start, > and to > help him/her make the right choice. Of course there > will be users > that don't want or don't have the time to read/think > about all the > options. Not a big deal: they'll just do > 'overlapsAny(query, subject)', > which is not a lot more typing than 'query %in% > subject', especially > if they use tab completion. > > It's true that it's more common to ask questions about > overlap than > about equality but there are some use cases for the > latter (as the > original thread shows). Until now, when you had such a > use case, you > could not use match() or %in%, which would have been > the natural things > to use, because they got hijacked to do something else, > and you were > left with nothing. Not a satisfying situation. So at a > minimum, we > needed to restore the true/real/original semantic of > match() to do > "equality" instead of "overlap". But it's hard to do > this for match() > and not do it for %in% too. For more than 99% of R > users, %in% is > just a simple wrapper for 'match(x, table, nomatch = 0) > > 0' (this > is how it has been documented and implemented in base R > for many > years). Not maintaining this relationship between %in% > and match() > would only cause grief and frustration to newcomers to > Bioconductor. > > H. > > > > On 01/04/2013 03:32 PM, Cook, Malcolm wrote: > > Hiya again, > > I am definitely a late comer to BioC, so I > definitely easily > defer to > the tide of history. > > But I do think you miss my point Michael about the > proposed change > making the relationship between %in% and match for > {G,I}Ranges{List} > mimic that between other vectors, and I do think > that changing > the API > would make other late-comers take to BioC > easier/faster. > > That said, I NEVER use %in% so I really have no > stake in the > matter, and > I DEFINITELY appreciate the argument to not > changing the API > just for > sematic sweetness. > > That that said, Herve is _/so good/_ about > deprecations and warnings > > that make such changes fairly easily digestible. > > That that that.... enough.... I bow out of this > one....!!!! > > Always learning and Happy New Year to all lurkers, > > ~Malcolm > > *From:*Michael Lawrence > [mailto:lawrence.michael at gene. > <mailto:lawrence.michael at="" gene.="">____com > > <mailto:lawrence.michael at="" gene.__com=""> <mailto:lawrence.michael at="" gene.com="">>] > *Sent:* Friday, January 04, 2013 5:11 PM > *To:* Cook, Malcolm > *Cc:* Sean Davis; Michael Lawrence; Hervé Pagès > (hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>); Tim > > > Triche, Jr.; Vedran Franke; > bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > *Subject:* Re: [BioC] countMatches() (was: table > for GenomicRanges) > > > On Fri, Jan 4, 2013 at 1:56 PM, Cook, Malcolm > <mec at="" stowers.org="" <mailto:mec="" at="" stowers.org=""> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org="">> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org=""> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org="">>>> wrote: > > Hiya, > > For what it is worth... > > I think the change to %in% is warranted. > > If I understand correctly, this change restores the > relationship > between > the semantics of `%in` and the semantics of `match`. > > From the docs: > > '"%in%" <- function(x, table) match(x, table, > nomatch = 0) > 0' > > Herve's change restores this relationship. > > > match and %in% were initially consistent (both > considering any > overlap); > Herve has changed both of them together. The whole > idea behind > IRanges > is that ranges are special data types with special > semantics. We > have > reimplemented much of the existing R vector API > using those > semantics; > this extends beyond match/%in%. I am hesitant about > making such > sweeping > changes to the API so late in the life-cycle of the > package. > There was a > feature request for a way to count identical ranges > in a set of > ranges. > Let's please not get carried away and start > redesigning the API > for this > one, albeit useful, request. There are all sorts of > inconsistencies in > the API, and many of them were conscious decisions > that considered > practical use cases. > > Michael > > > Herve, I suspect you were you as a result able to > completely drop > all the `%in%,BiocClass1,BiocClass2` > definitions and depend > upon > base::%in% > > Am I right? > > If so, may I suggest that Herve stay the > course, with the > addition of > '"%ol%" <- function(a, b) findOverlaps(a, > b, maxgap=0L, > minoverlap=1L, type='any', select='all') > 0' > > This would provide a perspicacious idiom, thereby > optimizing the API > for Michaels observed common use case. > > Just sayin' > > ~Malcolm > > > .-----Original Message----- > .From: > bioconductor-bounces at r-____project.org > <mailto:bioconductor-bounces at="" r-__project.org=""> > <mailto:bioconductor-bounces at="" __r-project.org=""> <mailto:bioconductor-bounces at="" r-project.org="">> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>____r-project.org > <http: r-project.org=""> > <mailto:bioconductor-bounces at="" __r-project.org=""> <mailto:bioconductor-bounces at="" r-project.org="">>> > [mailto:bioconductor-bounces@ > <mailto:bioconductor-bounces@>____r-project.org > <http: r-project.org=""> > <mailto:bioconductor-bounces at="" __r-project.org=""> <mailto:bioconductor-bounces at="" r-project.org="">> > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>____r-project.org > <http: r-project.org=""> > > <mailto:bioconductor-bounces at="" __r-project.org=""> <mailto:bioconductor-bounces at="" r-project.org="">>>] On Behalf Of Sean > Davis > .Sent: Friday, January 04, 2013 3:37 PM > .To: Michael Lawrence > .Cc: Tim Triche, Jr.; Vedran Franke; > bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org=""> > > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>> > > .Subject: Re: [BioC] countMatches() (was: > table for > GenomicRanges) > . > .On Fri, Jan 4, 2013 at 4:32 PM, Michael > Lawrence > .<lawrence.michael at="" gene.com=""> <mailto:lawrence.michael at="" gene.com=""> > <mailto:lawrence.michael at="" gene.__com=""> <mailto:lawrence.michael at="" gene.com="">> > <mailto:lawrence.michael at="" gene.=""> <mailto:lawrence.michael at="" gene.="">____com > > <mailto:lawrence.michael at="" gene.__com=""> <mailto:lawrence.michael at="" gene.com="">>>> wrote: > .> The change to the behavior of %in% is a > pretty big > one. Are you > thinking > .> that all set-based operations should > behave this way? For > example, setdiff > .> and intersect? I really liked the syntax > of "peaks > %in% genes". > In my > .> experience, it's way more common to ask > questions > about overlap > than about > .> equality, so I'd rather optimize the API > for that use > case. But > again, > .> that's just my personal bias. > . > .For what it is worth, I share Michael's > personal bias here. > . > .Sean > . > . > .> Michael > .> > .> > .> On Fri, Jan 4, 2013 at 1:11 PM, Hervé Pagès > <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">>>> wrote: > .> > .>> Hi, > .>> > .>> I added findMatches() and countMatches() > to the > latest IRanges / > .>> GenomicRanges packages (in BioC devel only). > .>> > .>> findMatches(x, table): An enhanced > version of > ?match? that > .>> returns all the matches in a > Hits object. > .>> > .>> countMatches(x, table): Returns an > integer vector > of the length > .>> of ?x?, containing the number > of matches in > ?table? for > .>> each element in ?x?. > .>> > > .>> countMatches() is what you can use to > tally/count/tabulate > (choose your > > .>> preferred term) the unique elements in a > GRanges object: > .>> > .>> library(GenomicRanges) > .>> set.seed(33) > .>> gr <- GRanges("chr1", > IRanges(sample(15,20,replace=*____*TRUE), > > width=5)) > .>> > .>> Then: > .>> > .>> > gr_levels <- sort(unique(gr)) > .>> > countMatches(gr_levels, gr) > .>> [1] 1 1 1 2 4 2 2 1 2 2 2 > .>> > .>> Note that findMatches() and > countMatches() also work on > IRanges and > .>> DNAStringSet objects, as well as on > ordinary atomic > vectors: > .>> > .>> library(hgu95av2probe) > .>> library(Biostrings) > .>> probes <- DNAStringSet(hgu95av2probe) > .>> unique_probes <- unique(probes) > .>> count <- countMatches(unique_probes, > probes) > .>> max(count) # 7 > .>> > .>> I made other changes in > IRanges/GenomicRanges so that > the notion > .>> of "match" between elements of a > vector-like object now > consistently > .>> means "equality" instead of "overlap", > even for > range-based > objects > .>> like IRanges or GRanges objects. This > notion of > "equality" is the > .>> same that is used by ==. The most > visible consequence > of those > .>> changes is that using %in% between 2 > IRanges or > GRanges objects > .>> 'query' and 'subject' in order to do > overlaps was > replaced by > .>> overlapsAny(query, subject). > .>> > .>> overlapsAny(query, subject): Finds the > ranges in > ?query? that > .>> overlap any of the ranges in ?subject?. > .>> > > .>> There are warnings and deprecation > messages in place > to help > smooth > > .>> the transition. > .>> > .>> Cheers, > .>> H. > .>> > .>> -- > .>> Hervé Pagès > .>> > .>> Program in Computational Biology > .>> Division of Public Health Sciences > .>> Fred Hutchinson Cancer Research Center > .>> 1100 Fairview Ave. N, M1-B514 > .>> P.O. Box 19024 > .>> Seattle, WA 98109-1024 > .>> > .>> E-mail: hpages at fhcrc.org > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> > > .>> Phone: (206) 667-5791 > <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> > .>> Fax: (206) 667-1319 > <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > > .>> > .> > .> [[alternative HTML version deleted]] > .> > .> > .> > ___________________________________________________ > > .> Bioconductor mailing list > .> Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>> > > .> > https://stat.ethz.ch/mailman/____listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> > > > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">> > .> Search the archives: > http://news.gmane.org/gmane.____science.biology.informat ics.____conductor > <http: news.gmane.org="" gmane.__science.biology.informati="" cs.__conductor=""> > > <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">> > . > > .___________________________________________________ > > .Bioconductor mailing list > .Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>> > > > .https://stat.ethz.ch/mailman/____listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> > > > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">> > .Search the archives: > http://news.gmane.org/gmane.____science.biology.informat ics.____conductor > <http: news.gmane.org="" gmane.__science.biology.informati="" cs.__conductor=""> > > > <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > > > > > -- > /A model is a lie that helps you see the truth./ > / > / > Howard Skipper > <http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLYlink written 4.9 years ago by Hervé Pagès ♦♦ 13k
hell, I'll add the operators if there's support for them. obviously they're not a big deal and a patch would take 5 minutes flat. my hope was to be very explicit about what each type of operation meant, so that when a newcomer to the Ranges API sees peaks %overlapping% promoters(someGroupOfGenesWeCareAbout) it cannot be confused with peaks %within% rangesThatCorrespondToSomeChromatinState or peaks %equal% aBunchOfDNAseFootprints or DMRs %in% genes ## what the hell does this really mean, anyways? it's so bad on so many levels because whenever someone says "what is the advantage of Ranges-based analyses?", these are the archetypal sorts of queries that come to mind. Except that usually in my examples they are based on posterior probabilities, but perhaps that could stand to change. Anyways, that's just my bias, and you're doing the heavy lifting. But if people agree with the motivations I will write the patch today. Cheers, --t On Tue, Jan 8, 2013 at 9:20 AM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Tim, > > I could add the %ov% operator as a replacement for the old %in%. So you > would write 'peaks %ov% genes' instead of 'peaks %in% genes'. Would just > be a convenience wrapper for 'overlapsAny(peaks, genes)'. > > Cheers, > H. > > > On 01/07/2013 11:45 AM, Tim Triche, Jr. wrote: > >> So why not leave %in% as it was and transition everything forward to >> explicitly using { `%within%`, `%overlaps%`|`%overlapping%`, `%equals%` >> } such that >> >> identical( x %within% table, countOverlaps(x, table, type='within') > >> 0 ) == TRUE >> identical( x %overlaps% table, countOverlaps(x, table, type='any') > >> 0 ) == TRUE >> identical( x %equals% table, countOverlaps(x, table, type='equal') > >> 0 ) == TRUE >> >> and for the time being, >> >> identical( x %overlaps% table, countOverlaps(x, table, type='any') > >> 0 ) == TRUE ## but with a noisy nastygram that will halt if >> options("warn"=2) >> No breakage for %in% methods until such time as a full deprecation cycle >> has passed, and if the maintainers can't be arsed to do anything at all >> about the warnings by the second full release, then perhaps they don't >> really care that much after all. Just a thought? >> >> From someone (me) who has their own issues with keeping everything up >> to date and should know better. If you want to use %in% for >> >> peaks %in% genes (why on earth would you do this rather than peaks >> %in% promoters(genes), anyways?) >> >> then a nastygram could be emitted "WARNING: YOUR SHORTHAND NOTATION IS >> DOOMED AFTER BIOC 2.13, YOU WILL BE ASSIMILATED" and everyone is (more >> or less) happy. >> >> >> >> On Mon, Jan 7, 2013 at 11:33 AM, Michael Lawrence >> <lawrence.michael@gene.com <mailto:lawrence.michael@gene.**com<lawrence.michael@gene.com="">>> >> wrote: >> >> >> >> >> On Mon, Jan 7, 2013 at 11:00 AM, Hervé Pagès <hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>> wrote: >> >> Hi Michael, >> >> I don't think "match" (the word) always has to mean "equality" >> either. >> However having match() (the function) do "whole exact matching" >> (aka >> "equality") for any kind of vector-like object has the advantage >> of: >> >> (a) making it consistent with base::match() (?base::match is >> pretty >> explicit about what the contract of match() is) >> >> >> (a) alone is obviously not enough. We have many methods, like the >> set operations, that treat ranges specially. Are we going to start >> moving everything toward the base behavior? And have rangeIntersect, >> rangeSetdiff, etc? >> >> (b) preserving its relationship with ==, duplicated(), >> unique(), >> etc... >> >> >> So it becomes consistent with duplicated/unique, but we lose >> consistency with the set operations. >> >> (c) not frustrating the user who needs something to do exact >> matching on ranges (as I mentioned previously, if you take >> match() away from him/her, s/he'll be left with nothing). >> >> >> No one has ever asked for match() to behave this way. There was a >> request for a way to tabulate identical ranges. It was a nice idea >> to extract the general "outer equal" findMatches function. But the >> changes seem to be snow-balling. These types of changes mean a lot >> of maintenance work for the users. A deprecation cycle does not >> circumvent that. >> >> >> IMO those advantages counterbalance *by far* the very little >> convenience you get from having 'match(query, subject)' do >> 'findOverlaps(query, subject, select="first")' on >> IRanges/GRanges objects. If you need to do that, just use the >> latter, or, if you think that's still too much typing, define >> a wrapper e.g. 'ovmatch(query, subject)'. >> >> There are plenty of specialized tools around for doing >> inexact/fuzzy/partial/overlap matching for many particular types >> of vector-like objects: grep() and family, pmatch(), charmatch(), >> agrep(), grepRaw(), matchPattern() and family, findOverlaps() and >> family, findIntervals(), etc... For the reasons I mentioned >> above, none of them should hijack match() to make it do some >> particular type of inexact matching on some particular type of >> objects. Even if, for that particular type of objects, doing that >> particular type of inexact matching is more common than doing >> exact matching. >> >> H. >> >> >> >> On 01/06/2013 05:39 PM, Michael Lawrence wrote: >> >> I think having overlapsAny is a nice addition and helps make >> the API >> more complete and explicit. Are you sure we need to change >> the behavior >> of the match method for this relatively uncommon use case? >> >> >> Yes because otherwise users with a use case of doing match() >> >> even if it's uncommon, >> >> >> I don't think >> "match" always has to mean "equality". It is a more general >> concept in >> my mind. The most common use case for matching ranges is >> overlap. >> >> >> Of course "match" doesn't always have to mean equality. But of >> base >> >> >> Michael >> >> >> On Fri, Jan 4, 2013 at 8:34 PM, Hervé Pagès >> <hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>> wrote: >> >> Yes 'peaks %in% genes' is cute and was probably doing >> the right thing >> for most users (although not all). But 'exons %in% >> genes' is cute too >> and was probably doing the wrong thing for all users. >> Advanced users >> like you guys would have no problem switching to >> >> !is.na <http: is.na=""> >> <http: is.na="">(findOverlaps(__**peaks, genes, type="within", >> >> select="any")) >> >> or >> >> !is.na <http: is.na=""> >> <http: is.na="">(findOverlaps(__**peaks, genes, type="equal", >> >> >> select="any")) >> >> in case 'peaks %in% genes' was not doing exactly what >> you wanted, >> but most users would not find this particularly >> friendly. Even >> worse, some users probably didn't realize that 'peaks >> %in% genes' >> was not doing exactly what they thought it did because >> "peaks in >> genes" in English suggests that the peaks are within >> the genes, >> but it's not what 'peaks %in% genes' does. >> >> Having overlapsAny(), with exactly the same extra >> arguments as >> countOverlaps() and subsetByOverlaps() (i.e. 'maxgap', >> 'minoverlap', >> 'type', 'ignore.strand'), all of them documented (and >> with most >> users more or less familiar with them already) has the >> virtue to >> expose the user to all the options from the very start, >> and to >> help him/her make the right choice. Of course there >> will be users >> that don't want or don't have the time to read/think >> about all the >> options. Not a big deal: they'll just do >> 'overlapsAny(query, subject)', >> which is not a lot more typing than 'query %in% >> subject', especially >> if they use tab completion. >> >> It's true that it's more common to ask questions about >> overlap than >> about equality but there are some use cases for the >> latter (as the >> original thread shows). Until now, when you had such a >> use case, you >> could not use match() or %in%, which would have been >> the natural things >> to use, because they got hijacked to do something else, >> and you were >> left with nothing. Not a satisfying situation. So at a >> minimum, we >> needed to restore the true/real/original semantic of >> match() to do >> "equality" instead of "overlap". But it's hard to do >> this for match() >> and not do it for %in% too. For more than 99% of R >> users, %in% is >> just a simple wrapper for 'match(x, table, nomatch = 0) >> > 0' (this >> is how it has been documented and implemented in base R >> for many >> years). Not maintaining this relationship between %in% >> and match() >> would only cause grief and frustration to newcomers to >> Bioconductor. >> >> H. >> >> >> >> On 01/04/2013 03:32 PM, Cook, Malcolm wrote: >> >> Hiya again, >> >> I am definitely a late comer to BioC, so I >> definitely easily >> defer to >> the tide of history. >> >> But I do think you miss my point Michael about the >> proposed change >> making the relationship between %in% and match for >> {G,I}Ranges{List} >> mimic that between other vectors, and I do think >> that changing >> the API >> would make other late-comers take to BioC >> easier/faster. >> >> That said, I NEVER use %in% so I really have no >> stake in the >> matter, and >> I DEFINITELY appreciate the argument to not >> changing the API >> just for >> sematic sweetness. >> >> That that said, Herve is _/so good/_ about >> deprecations and warnings >> >> that make such changes fairly easily digestible. >> >> That that that.... enough.... I bow out of this >> one....!!!! >> >> Always learning and Happy New Year to all lurkers, >> >> ~Malcolm >> >> *From:*Michael Lawrence >> [mailto:lawrence.michael@gene. >> <mailto:lawrence.michael@gene.**>____com >> >> >> <mailto:lawrence.michael@gene.**__com>> <mailto:lawrence.michael@gene.**com<lawrence.michael@gene.com> >> >>] >> *Sent:* Friday, January 04, 2013 5:11 PM >> *To:* Cook, Malcolm >> *Cc:* Sean Davis; Michael Lawrence; Hervé Pagès >> (hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>); Tim >> >> >> >> Triche, Jr.; Vedran Franke; >> bioconductor@r-project.org <mailto:bioconductor@r-**>> project.org <bioconductor@r-project.org>> >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> >> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> >> >> *Subject:* Re: [BioC] countMatches() (was: table >> for GenomicRanges) >> >> >> On Fri, Jan 4, 2013 at 1:56 PM, Cook, Malcolm >> <mec@stowers.org <mailto:mec@stowers.org=""> >> <mailto:mec@stowers.org <mailto:mec@stowers.org="">> >> <mailto:mec@stowers.org <mailto:mec@stowers.org=""> >> <mailto:mec@stowers.org <mailto:mec@stowers.org="">>>> wrote: >> >> Hiya, >> >> For what it is worth... >> >> I think the change to %in% is warranted. >> >> If I understand correctly, this change restores the >> relationship >> between >> the semantics of `%in` and the semantics of `match`. >> >> From the docs: >> >> '"%in%" <- function(x, table) match(x, table, >> nomatch = 0) > 0' >> >> Herve's change restores this relationship. >> >> >> match and %in% were initially consistent (both >> considering any >> overlap); >> Herve has changed both of them together. The whole >> idea behind >> IRanges >> is that ranges are special data types with special >> semantics. We >> have >> reimplemented much of the existing R vector API >> using those >> semantics; >> this extends beyond match/%in%. I am hesitant about >> making such >> sweeping >> changes to the API so late in the life-cycle of the >> package. >> There was a >> feature request for a way to count identical ranges >> in a set of >> ranges. >> Let's please not get carried away and start >> redesigning the API >> for this >> one, albeit useful, request. There are all sorts of >> inconsistencies in >> the API, and many of them were conscious decisions >> that considered >> practical use cases. >> >> Michael >> >> >> Herve, I suspect you were you as a result able >> to >> completely drop >> all the `%in%,BiocClass1,BiocClass2` >> definitions and depend >> upon >> base::%in% >> >> Am I right? >> >> If so, may I suggest that Herve stay the >> course, with the >> addition of >> '"%ol%" <- function(a, b) findOverlaps(a, >> b, maxgap=0L, >> minoverlap=1L, type='any', select='all') > 0' >> >> This would provide a perspicacious idiom, >> thereby >> optimizing the API >> for Michaels observed common use case. >> >> Just sayin' >> >> ~Malcolm >> >> >> .-----Original Message----- >> .From: >> bioconductor-bounces@r-____**project.org<bioconductor- bounces@r-____project.org=""> >> <mailto:bioconductor-bounces@**r-__project.org <bioconductor-bounces@r-__project.org=""> >> > >> <mailto:bioconductor-bounces@_**_r-project.org>> <mailto:bioconductor-bounces@**r-project.org <bioconductor-bounces@r-project.org=""> >> >> >> >> <mailto:bioconductor-bounces@>> <mailto:bioconductor-bounces@>**____r-project.org >> <http: r-project.org=""> >> <mailto:bioconductor-bounces@_**_r-project.org>> <mailto:bioconductor-bounces@**r-project.org <bioconductor-bounces@r-project.org=""> >> >>> >> [mailto:bioconductor-bounces@ >> >> <mailto:bioconductor-bounces@>**____r-project.org >> <http: r-project.org=""> >> <mailto:bioconductor-bounces@_**_r-project.org>> <mailto:bioconductor-bounces@**r-project.org <bioconductor-bounces@r-project.org=""> >> >> >> >> <mailto:bioconductor-bounces@>> <mailto:bioconductor-bounces@>**____r-project.org >> <http: r-project.org=""> >> >> <mailto:bioconductor-bounces@_**_r-project.org>> <mailto:bioconductor-bounces@**r-project.org <bioconductor-bounces@r-project.org="">>>>] >> On Behalf Of Sean >> Davis >> .Sent: Friday, January 04, 2013 3:37 PM >> .To: Michael Lawrence >> .Cc: Tim Triche, Jr.; Vedran Franke; >> bioconductor@r-project.org >> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> > >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> >> >> <mailto:bioconductor@r-____**project.org< bioconductor@r-____project.org=""> >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> > >> >> >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> >>> >> >> .Subject: Re: [BioC] countMatches() (was: >> table for >> GenomicRanges) >> . >> .On Fri, Jan 4, 2013 at 4:32 PM, Michael >> Lawrence >> .<lawrence.michael@gene.com>> <mailto:lawrence.michael@gene.**com<lawrence.michael@gene.com> >> > >> <mailto:lawrence.michael@gene.**__com>> <mailto:lawrence.michael@gene.**com<lawrence.michael@gene.com> >> >> >> <mailto:lawrence.michael@gene.>> <mailto:lawrence.michael@gene.**>____com >> >> <mailto:lawrence.michael@gene.**__com>> <mailto:lawrence.michael@gene.**com<lawrence.michael@gene.com>>>>> >> wrote: >> .> The change to the behavior of %in% is a >> pretty big >> one. Are you >> thinking >> .> that all set-based operations should >> behave this way? For >> example, setdiff >> .> and intersect? I really liked the syntax >> of "peaks >> %in% genes". >> In my >> .> experience, it's way more common to ask >> questions >> about overlap >> than about >> .> equality, so I'd rather optimize the API >> for that use >> case. But >> again, >> .> that's just my personal bias. >> . >> .For what it is worth, I share Michael's >> personal bias here. >> . >> .Sean >> . >> . >> .> Michael >> .> >> .> >> .> On Fri, Jan 4, 2013 at 1:11 PM, Hervé Pagès >> <hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> <mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>>>> wrote: >> .> >> .>> Hi, >> .>> >> .>> I added findMatches() and countMatches() >> to the >> latest IRanges / >> .>> GenomicRanges packages (in BioC devel >> only). >> .>> >> .>> findMatches(x, table): An enhanced >> version of >> ‘match’ that >> .>> returns all the matches in a >> Hits object. >> .>> >> .>> countMatches(x, table): Returns an >> integer vector >> of the length >> .>> of ‘x’, containing the number >> of matches in >> ‘table’ for >> .>> each element in ‘x’. >> .>> >> >> .>> countMatches() is what you can use to >> tally/count/tabulate >> (choose your >> >> .>> preferred term) the unique elements in a >> GRanges object: >> .>> >> .>> library(GenomicRanges) >> .>> set.seed(33) >> .>> gr <- GRanges("chr1", >> IRanges(sample(15,20,replace=***____*TRUE), >> >> >> width=5)) >> .>> >> .>> Then: >> .>> >> .>> > gr_levels <- sort(unique(gr)) >> .>> > countMatches(gr_levels, gr) >> .>> [1] 1 1 1 2 4 2 2 1 2 2 2 >> .>> >> .>> Note that findMatches() and >> countMatches() also work on >> IRanges and >> .>> DNAStringSet objects, as well as on >> ordinary atomic >> vectors: >> .>> >> .>> library(hgu95av2probe) >> .>> library(Biostrings) >> .>> probes <- DNAStringSet(hgu95av2probe) >> .>> unique_probes <- unique(probes) >> .>> count <- countMatches(unique_probes, >> probes) >> .>> max(count) # 7 >> .>> >> .>> I made other changes in >> IRanges/GenomicRanges so that >> the notion >> .>> of "match" between elements of a >> vector-like object now >> consistently >> .>> means "equality" instead of "overlap", >> even for >> range-based >> objects >> .>> like IRanges or GRanges objects. This >> notion of >> "equality" is the >> .>> same that is used by ==. The most >> visible consequence >> of those >> .>> changes is that using %in% between 2 >> IRanges or >> GRanges objects >> .>> 'query' and 'subject' in order to do >> overlaps was >> replaced by >> .>> overlapsAny(query, subject). >> .>> >> .>> overlapsAny(query, subject): Finds the >> ranges in >> ‘query’ that >> .>> overlap any of the ranges in >> ‘subject’. >> .>> >> >> .>> There are warnings and deprecation >> messages in place >> to help >> smooth >> >> .>> the transition. >> .>> >> .>> Cheers, >> .>> H. >> .>> >> .>> -- >> .>> Hervé Pagès >> .>> >> .>> Program in Computational Biology >> .>> Division of Public Health Sciences >> .>> Fred Hutchinson Cancer Research Center >> .>> 1100 Fairview Ave. N, M1-B514 >> .>> P.O. Box 19024 >> .>> Seattle, WA 98109-1024 >> .>> >> .>> E-mail: hpages@fhcrc.org >> <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>> >> >> .>> Phone: (206) 667-5791 >> <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> >> <tel:%28206%29%20667-5791> >> .>> Fax: (206) 667-1319 >> <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319> >> <tel:%28206%29%20667-1319> >> >> .>> >> .> >> .> [[alternative HTML version >> deleted]] >> .> >> .> >> .> >> ______________________________**_____________________ >> >> >> .> Bioconductor mailing list >> .> Bioconductor@r-project.org >> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> > >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> >> >> <mailto:bioconductor@r-____**project.org<bioco nductor@r-____project.org=""> >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> > >> >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> >>> >> >> .> >> https://stat.ethz.ch/mailman/_**___listinfo/bioconducto r<https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor="" <https:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> > >> >> >> >> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor="" <https:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<h="" ttps:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> >> >> .> Search the archives: >> http://news.gmane.org/gmane.__** >> __science.biology.informatics.**____conductor<http: news.gmane.org="" gmane.____science.biology.informatics.____conductor=""> >> <http: news.gmane.org="" gmane._**="">> _science.biology.informatics._**_conductor<http: news.gmane.org="" gm="" ane.__science.biology.informatics.__conductor=""> >> > >> >> >> <http: news.gmane.org="" gmane._**="">> _science.biology.informatics._**_conductor<http: news.gmane.org="" gm="" ane.__science.biology.informatics.__conductor=""> >> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor=""> >> >> >> . >> >> ._____________________________**______________________ >> >> >> .Bioconductor mailing list >> .Bioconductor@r-project.org >> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> > >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> >> >> <mailto:bioconductor@r-____**project.org<bioco nductor@r-____project.org=""> >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> > >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> >>> >> >> >> .https://stat.ethz.ch/mailman/**____listinfo/biocondu ctor<https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor="" <https:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> > >> >> >> >> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor="" <https:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<h="" ttps:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> >> >> .Search the archives: >> http://news.gmane.org/gmane.__** >> __science.biology.informatics.**____conductor<http: news.gmane.org="" gmane.____science.biology.informatics.____conductor=""> >> <http: news.gmane.org="" gmane._**="">> _science.biology.informatics._**_conductor<http: news.gmane.org="" gm="" ane.__science.biology.informatics.__conductor=""> >> > >> >> >> >> <http: news.gmane.org="" gmane._**="">> _science.biology.informatics._**_conductor<http: news.gmane.org="" gm="" ane.__science.biology.informatics.__conductor=""> >> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor=""> >> >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> >> >> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >> <tel:%28206%29%20667-5791> >> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >> <tel:%28206%29%20667-1319> >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >> >> >> >> >> >> -- >> /A model is a lie that helps you see the truth./ >> / >> / >> Howard Skipper >> <http: cancerres.**aacrjournals.org="" content="" 31="" 9="" **1173.full.pdf<h="" ttp:="" cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >> > >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]
ADD REPLYlink written 4.9 years ago by Tim Triche4.2k
Thanks Tim, Malcolm for the feedback. @Tim, I won't comment on the variants of %ov% you are proposing for doing "within" or "equal" instead of "any" (but if people want them, I'll add them too). For now I just want to focus on restoring the convenience of the old %in%, whose removal is understandably causing some frustration. And so we can move on. Cheers, H. On 01/08/2013 09:50 AM, Tim Triche, Jr. wrote: > hell, I'll add the operators if there's support for them. obviously > they're not a big deal and a patch would take 5 minutes flat. > > my hope was to be very explicit about what each type of operation meant, > so that when a newcomer to the Ranges API sees > > peaks %overlapping% promoters(someGroupOfGenesWeCareAbout) > > it cannot be confused with > > peaks %within% rangesThatCorrespondToSomeChromatinState > > or > > peaks %equal% aBunchOfDNAseFootprints > > or > > DMRs %in% genes ## what the hell does this really mean, anyways? > it's so bad on so many levels > > because whenever someone says "what is the advantage of Ranges-based > analyses?", these are the archetypal sorts of queries that come to mind. > Except that usually in my examples they are based on posterior > probabilities, but perhaps that could stand to change. > > Anyways, that's just my bias, and you're doing the heavy lifting. But > if people agree with the motivations I will write the patch today. > > Cheers, > > --t > > > > > On Tue, Jan 8, 2013 at 9:20 AM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote: > > Hi Tim, > > I could add the %ov% operator as a replacement for the old %in%. So you > would write 'peaks %ov% genes' instead of 'peaks %in% genes'. Would just > be a convenience wrapper for 'overlapsAny(peaks, genes)'. > > Cheers, > H. > > > On 01/07/2013 11:45 AM, Tim Triche, Jr. wrote: > > So why not leave %in% as it was and transition everything forward to > explicitly using { `%within%`, `%overlaps%`|`%overlapping%`, > `%equals%` > } such that > > identical( x %within% table, countOverlaps(x, table, > type='within') > > 0 ) == TRUE > identical( x %overlaps% table, countOverlaps(x, table, > type='any') > > 0 ) == TRUE > identical( x %equals% table, countOverlaps(x, table, > type='equal') > > 0 ) == TRUE > > and for the time being, > > identical( x %overlaps% table, countOverlaps(x, table, > type='any') > > 0 ) == TRUE ## but with a noisy nastygram that will halt if > options("warn"=2) > No breakage for %in% methods until such time as a full > deprecation cycle > has passed, and if the maintainers can't be arsed to do anything > at all > about the warnings by the second full release, then perhaps they > don't > really care that much after all. Just a thought? > > From someone (me) who has their own issues with keeping > everything up > to date and should know better. If you want to use %in% for > > peaks %in% genes (why on earth would you do this rather than > peaks > %in% promoters(genes), anyways?) > > then a nastygram could be emitted "WARNING: YOUR SHORTHAND > NOTATION IS > DOOMED AFTER BIOC 2.13, YOU WILL BE ASSIMILATED" and everyone is > (more > or less) happy. > > > > On Mon, Jan 7, 2013 at 11:33 AM, Michael Lawrence > <lawrence.michael at="" gene.com="" <mailto:lawrence.michael="" at="" gene.com=""> > <mailto:lawrence.michael at="" gene.__com=""> <mailto:lawrence.michael at="" gene.com="">>> wrote: > > > > > On Mon, Jan 7, 2013 at 11:00 AM, Hervé Pagès > <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> wrote: > > Hi Michael, > > I don't think "match" (the word) always has to mean > "equality" > either. > However having match() (the function) do "whole exact > matching" (aka > "equality") for any kind of vector-like object has the > advantage of: > > (a) making it consistent with base::match() > (?base::match is > pretty > explicit about what the contract of match() is) > > > (a) alone is obviously not enough. We have many methods, > like the > set operations, that treat ranges specially. Are we going > to start > moving everything toward the base behavior? And have > rangeIntersect, > rangeSetdiff, etc? > > (b) preserving its relationship with ==, > duplicated(), unique(), > etc... > > > So it becomes consistent with duplicated/unique, but we lose > consistency with the set operations. > > (c) not frustrating the user who needs something to > do exact > matching on ranges (as I mentioned previously, > if you take > match() away from him/her, s/he'll be left with > nothing). > > > No one has ever asked for match() to behave this way. There > was a > request for a way to tabulate identical ranges. It was a > nice idea > to extract the general "outer equal" findMatches function. > But the > changes seem to be snow-balling. These types of changes > mean a lot > of maintenance work for the users. A deprecation cycle does not > circumvent that. > > > IMO those advantages counterbalance *by far* the very > little > convenience you get from having 'match(query, subject)' do > 'findOverlaps(query, subject, select="first")' on > IRanges/GRanges objects. If you need to do that, just > use the > latter, or, if you think that's still too much typing, > define > a wrapper e.g. 'ovmatch(query, subject)'. > > There are plenty of specialized tools around for doing > inexact/fuzzy/partial/overlap matching for many > particular types > of vector-like objects: grep() and family, pmatch(), > charmatch(), > agrep(), grepRaw(), matchPattern() and family, > findOverlaps() and > family, findIntervals(), etc... For the reasons I mentioned > above, none of them should hijack match() to make it do > some > particular type of inexact matching on some particular > type of > objects. Even if, for that particular type of objects, > doing that > particular type of inexact matching is more common than > doing > exact matching. > > H. > > > > On 01/06/2013 05:39 PM, Michael Lawrence wrote: > > I think having overlapsAny is a nice addition and > helps make > the API > more complete and explicit. Are you sure we need to > change > the behavior > of the match method for this relatively uncommon > use case? > > > Yes because otherwise users with a use case of doing > match() > > even if it's uncommon, > > > I don't think > "match" always has to mean "equality". It is a more > general > concept in > my mind. The most common use case for matching > ranges is > overlap. > > > Of course "match" doesn't always have to mean equality. > But of base > > > Michael > > > On Fri, Jan 4, 2013 at 8:34 PM, Hervé Pagès > <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>> wrote: > > Yes 'peaks %in% genes' is cute and was > probably doing > the right thing > for most users (although not all). But 'exons %in% > genes' is cute too > and was probably doing the wrong thing for > all users. > Advanced users > like you guys would have no problem switching to > > !is.na <http: is.na=""> <http: is.na=""> > <http: is.na="">(findOverlaps(____peaks, genes, > type="within", > > select="any")) > > or > > !is.na <http: is.na=""> <http: is.na=""> > <http: is.na="">(findOverlaps(____peaks, genes, > type="equal", > > > select="any")) > > in case 'peaks %in% genes' was not doing > exactly what > you wanted, > but most users would not find this particularly > friendly. Even > worse, some users probably didn't realize that > 'peaks > %in% genes' > was not doing exactly what they thought it did > because > "peaks in > genes" in English suggests that the peaks are > within > the genes, > but it's not what 'peaks %in% genes' does. > > Having overlapsAny(), with exactly the same extra > arguments as > countOverlaps() and subsetByOverlaps() (i.e. > 'maxgap', > 'minoverlap', > 'type', 'ignore.strand'), all of them > documented (and > with most > users more or less familiar with them already) > has the > virtue to > expose the user to all the options from the > very start, > and to > help him/her make the right choice. Of course > there > will be users > that don't want or don't have the time to > read/think > about all the > options. Not a big deal: they'll just do > 'overlapsAny(query, subject)', > which is not a lot more typing than 'query %in% > subject', especially > if they use tab completion. > > It's true that it's more common to ask > questions about > overlap than > about equality but there are some use cases > for the > latter (as the > original thread shows). Until now, when you > had such a > use case, you > could not use match() or %in%, which would > have been > the natural things > to use, because they got hijacked to do > something else, > and you were > left with nothing. Not a satisfying situation. > So at a > minimum, we > needed to restore the true/real/original > semantic of > match() to do > "equality" instead of "overlap". But it's hard > to do > this for match() > and not do it for %in% too. For more than 99% of R > users, %in% is > just a simple wrapper for 'match(x, table, > nomatch = 0) > > 0' (this > is how it has been documented and implemented > in base R > for many > years). Not maintaining this relationship > between %in% > and match() > would only cause grief and frustration to > newcomers to > Bioconductor. > > H. > > > > On 01/04/2013 03:32 PM, Cook, Malcolm wrote: > > Hiya again, > > I am definitely a late comer to BioC, so I > definitely easily > defer to > the tide of history. > > But I do think you miss my point Michael > about the > proposed change > making the relationship between %in% and > match for > {G,I}Ranges{List} > mimic that between other vectors, and I do > think > that changing > the API > would make other late-comers take to BioC > easier/faster. > > That said, I NEVER use %in% so I really > have no > stake in the > matter, and > I DEFINITELY appreciate the argument to not > changing the API > just for > sematic sweetness. > > That that said, Herve is _/so good/_ about > deprecations and warnings > > that make such changes fairly easily > digestible. > > That that that.... enough.... I bow out of > this > one....!!!! > > Always learning and Happy New Year to all > lurkers, > > ~Malcolm > > *From:*Michael Lawrence > [mailto:lawrence.michael at gene > <mailto:lawrence.michael at="" gene="">. > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene="">.__>____com > > > <mailto:lawrence.michael at="" gene.=""> <mailto:lawrence.michael at="" gene.="">____com > <mailto:lawrence.michael at="" gene.__com=""> <mailto:lawrence.michael at="" gene.com="">>>] > *Sent:* Friday, January 04, 2013 5:11 PM > *To:* Cook, Malcolm > *Cc:* Sean Davis; Michael Lawrence; Herv? > Pag?s > (hpages at fhcrc.org > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>); Tim > > > > Triche, Jr.; Vedran Franke; > bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org=""> > > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>> > *Subject:* Re: [BioC] countMatches() (was: > table > for GenomicRanges) > > > On Fri, Jan 4, 2013 at 1:56 PM, Cook, Malcolm > <mec at="" stowers.org="" <mailto:mec="" at="" stowers.org=""> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org="">> > <mailto:mec at="" stowers.org=""> <mailto:mec at="" stowers.org=""> <mailto:mec at="" stowers.org=""> <mailto:mec at="" stowers.org="">>> > <mailto:mec at="" stowers.org=""> <mailto:mec at="" stowers.org=""> <mailto:mec at="" stowers.org=""> <mailto:mec at="" stowers.org="">> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org=""> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org="">>>>> wrote: > > Hiya, > > For what it is worth... > > I think the change to %in% is warranted. > > If I understand correctly, this change > restores the > relationship > between > the semantics of `%in` and the semantics > of `match`. > > From the docs: > > '"%in%" <- function(x, table) match(x, > table, > nomatch = 0) > 0' > > Herve's change restores this relationship. > > > match and %in% were initially consistent (both > considering any > overlap); > Herve has changed both of them together. > The whole > idea behind > IRanges > is that ranges are special data types with > special > semantics. We > have > reimplemented much of the existing R > vector API > using those > semantics; > this extends beyond match/%in%. I am > hesitant about > making such > sweeping > changes to the API so late in the > life-cycle of the > package. > There was a > feature request for a way to count > identical ranges > in a set of > ranges. > Let's please not get carried away and start > redesigning the API > for this > one, albeit useful, request. There are all > sorts of > inconsistencies in > the API, and many of them were conscious > decisions > that considered > practical use cases. > > Michael > > > Herve, I suspect you were you as a > result able to > completely drop > all the `%in%,BiocClass1,BiocClass2` > definitions and depend > upon > base::%in% > > Am I right? > > If so, may I suggest that Herve stay the > course, with the > addition of > '"%ol%" <- function(a, b) > findOverlaps(a, > b, maxgap=0L, > minoverlap=1L, type='any', > select='all') > 0' > > This would provide a perspicacious > idiom, thereby > optimizing the API > for Michaels observed common use case. > > Just sayin' > > ~Malcolm > > > .-----Original Message----- > .From: > bioconductor-bounces at r-______project.org > <mailto:bioconductor-bounces at="" r-____project.org=""> > <mailto:bioconductor-bounces at="" __r-="" __project.org=""> <mailto:bioconductor-bounces at="" r-__project.org="">> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>____r-project.org > <http: r-project.org=""> > <mailto:bioconductor-bounces at="" __r-project.org=""> <mailto:bioconductor-bounces at="" r-project.org="">>> > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>>______r-project.org > <http: r-project.org=""> > <http: r-project.org=""> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>____r-project.org > <http: r-project.org=""> > <mailto:bioconductor-bounces at="" __r-project.org=""> <mailto:bioconductor-bounces at="" r-project.org="">>>> > [mailto:bioconductor-bounces@ > <mailto:bioconductor-bounces@> > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>>______r-project.org > <http: r-project.org=""> > <http: r-project.org=""> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>____r-project.org > <http: r-project.org=""> > <mailto:bioconductor-bounces at="" __r-project.org=""> <mailto:bioconductor-bounces at="" r-project.org="">>> > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>>______r-project.org > <http: r-project.org=""> > <http: r-project.org=""> > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>____r-project.org > <http: r-project.org=""> > <mailto:bioconductor-bounces at="" __r-project.org=""> <mailto:bioconductor-bounces at="" r-project.org="">>>>] On Behalf Of Sean > Davis > .Sent: Friday, January 04, 2013 3:37 PM > .To: Michael Lawrence > .Cc: Tim Triche, Jr.; Vedran Franke; > bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>> > > <mailto:bioconductor at="" r-______project.org=""> <mailto:bioconductor at="" r-____project.org=""> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org="">> > > > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>>> > > .Subject: Re: [BioC] countMatches() > (was: > table for > GenomicRanges) > . > .On Fri, Jan 4, 2013 at 4:32 PM, > Michael > Lawrence > .<lawrence.michael at="" gene.com=""> <mailto:lawrence.michael at="" gene.com=""> > <mailto:lawrence.michael at="" gene.__com=""> <mailto:lawrence.michael at="" gene.com="">> > <mailto:lawrence.michael at="" gene.=""> <mailto:lawrence.michael at="" gene.="">____com > <mailto:lawrence.michael at="" gene.__com=""> <mailto:lawrence.michael at="" gene.com="">>> > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene="">. > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene="">.__>____com > > <mailto:lawrence.michael at="" gene.=""> <mailto:lawrence.michael at="" gene.="">____com > <mailto:lawrence.michael at="" gene.__com=""> <mailto:lawrence.michael at="" gene.com="">>>>> wrote: > .> The change to the behavior of > %in% is a > pretty big > one. Are you > thinking > .> that all set-based operations should > behave this way? For > example, setdiff > .> and intersect? I really liked > the syntax > of "peaks > %in% genes". > In my > .> experience, it's way more common > to ask > questions > about overlap > than about > .> equality, so I'd rather optimize > the API > for that use > case. But > again, > .> that's just my personal bias. > . > .For what it is worth, I share > Michael's > personal bias here. > . > .Sean > . > . > .> Michael > .> > .> > .> On Fri, Jan 4, 2013 at 1:11 PM, > Hervé Pagès > <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">>>>> wrote: > .> > .>> Hi, > .>> > .>> I added findMatches() and > countMatches() > to the > latest IRanges / > .>> GenomicRanges packages (in BioC > devel only). > .>> > .>> findMatches(x, table): An > enhanced > version of > ?match? that > .>> returns all the > matches in a > Hits object. > .>> > .>> countMatches(x, table): > Returns an > integer vector > of the length > .>> of ?x?, containing > the number > of matches in > ?table? for > .>> each element in ?x?. > .>> > > .>> countMatches() is what you can > use to > tally/count/tabulate > (choose your > > .>> preferred term) the unique > elements in a > GRanges object: > .>> > .>> library(GenomicRanges) > .>> set.seed(33) > .>> gr <- GRanges("chr1", > IRanges(sample(15,20,replace=*______*TRUE), > > > width=5)) > .>> > .>> Then: > .>> > .>> > gr_levels <- sort(unique(gr)) > .>> > countMatches(gr_levels, gr) > .>> [1] 1 1 1 2 4 2 2 1 2 2 2 > .>> > .>> Note that findMatches() and > countMatches() also work on > IRanges and > .>> DNAStringSet objects, as well as on > ordinary atomic > vectors: > .>> > .>> library(hgu95av2probe) > .>> library(Biostrings) > .>> probes <- > DNAStringSet(hgu95av2probe) > .>> unique_probes <- unique(probes) > .>> count <- > countMatches(unique_probes, > probes) > .>> max(count) # 7 > .>> > .>> I made other changes in > IRanges/GenomicRanges so that > the notion > .>> of "match" between elements of a > vector-like object now > consistently > .>> means "equality" instead of > "overlap", > even for > range-based > objects > .>> like IRanges or GRanges > objects. This > notion of > "equality" is the > .>> same that is used by ==. The most > visible consequence > of those > .>> changes is that using %in% > between 2 > IRanges or > GRanges objects > .>> 'query' and 'subject' in order > to do > overlaps was > replaced by > .>> overlapsAny(query, subject). > .>> > .>> overlapsAny(query, subject): > Finds the > ranges in > ?query? that > .>> overlap any of the ranges > in ?subject?. > .>> > > .>> There are warnings and deprecation > messages in place > to help > smooth > > .>> the transition. > .>> > .>> Cheers, > .>> H. > .>> > .>> -- > .>> Hervé Pagès > .>> > .>> Program in Computational Biology > .>> Division of Public Health Sciences > .>> Fred Hutchinson Cancer Research > Center > .>> 1100 Fairview Ave. N, M1-B514 > .>> P.O. Box 19024 > .>> Seattle, WA 98109-1024 > .>> > .>> E-mail: hpages at fhcrc.org > <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> > > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>> > > .>> Phone: (206) 667-5791 > <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> > .>> Fax: (206) 667-1319 > <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > > .>> > .> > .> [[alternative HTML > version deleted]] > .> > .> > .> > _____________________________________________________ > > > .> Bioconductor mailing list > .> Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>> > <mailto:bioconductor at="" r-______project.org=""> <mailto:bioconductor at="" r-____project.org=""> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org="">> > > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>>> > > .> > https://stat.ethz.ch/mailman/______listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">> > > > > > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>> > .> Search the archives: > http://news.gmane.org/gmane.______science.biology.informatic s.______conductor > <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> > > <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> <http: news.gmane.org="" gmane.__science.biology.informatics._="" _conductor="">> > > > > <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> <http: news.gmane.org="" gmane.__science.biology.informatics._="" _conductor=""> > > <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">>> > . > > > ._____________________________________________________ > > > .Bioconductor mailing list > .Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>> > <mailto:bioconductor at="" r-______project.org=""> <mailto:bioconductor at="" r-____project.org=""> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org="">> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>>> > > > > .https://stat.ethz.ch/mailman/______listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">> > > > > > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>> > .Search the archives: > http://news.gmane.org/gmane.______science.biology.informatic s.______conductor > <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> > > <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> <http: news.gmane.org="" gmane.__science.biology.informatics._="" _conductor="">> > > > > > <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> <http: news.gmane.org="" gmane.__science.biology.informatics._="" _conductor=""> > > <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">>> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> > > > Phone: (206) 667-5791 > <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > > > > > > -- > /A model is a lie that helps you see the truth./ > / > / > Howard Skipper > <http: cancerres.__aacrjournals.org="" content="" 31="" 9="" __1173.full.pdf="" <http:="" cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf="">> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > > > > -- > /A model is a lie that helps you see the truth./ > / > / > Howard Skipper > <http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLYlink written 4.9 years ago by Hervé Pagès ♦♦ 13k
I would vote for %over% instead of %ov%. Just 2 more characters but way clearer, at least to me. The hardest thing to type are the %'s. Michael On Tue, Jan 8, 2013 at 11:09 AM, Hervé Pagès <hpages@fhcrc.org> wrote: > Thanks Tim, Malcolm for the feedback. > > @Tim, I won't comment on the variants of %ov% you are proposing for > doing "within" or "equal" instead of "any" (but if people want them, > I'll add them too). For now I just want to focus on restoring the > convenience of the old %in%, whose removal is understandably causing > some frustration. And so we can move on. > > Cheers, > H. > > > > On 01/08/2013 09:50 AM, Tim Triche, Jr. wrote: > >> hell, I'll add the operators if there's support for them. obviously >> they're not a big deal and a patch would take 5 minutes flat. >> >> my hope was to be very explicit about what each type of operation meant, >> so that when a newcomer to the Ranges API sees >> >> peaks %overlapping% promoters(**someGroupOfGenesWeCareAbout) >> >> it cannot be confused with >> >> peaks %within% rangesThatCorrespondToSomeChro**matinState >> >> or >> >> peaks %equal% aBunchOfDNAseFootprints >> >> or >> >> DMRs %in% genes ## what the hell does this really mean, anyways? >> it's so bad on so many levels >> >> because whenever someone says "what is the advantage of Ranges- based >> analyses?", these are the archetypal sorts of queries that come to mind. >> Except that usually in my examples they are based on posterior >> probabilities, but perhaps that could stand to change. >> >> Anyways, that's just my bias, and you're doing the heavy lifting. But >> if people agree with the motivations I will write the patch today. >> >> Cheers, >> >> --t >> >> >> >> >> On Tue, Jan 8, 2013 at 9:20 AM, Hervé Pagès <hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>> wrote: >> >> Hi Tim, >> >> I could add the %ov% operator as a replacement for the old %in%. So >> you >> would write 'peaks %ov% genes' instead of 'peaks %in% genes'. Would >> just >> be a convenience wrapper for 'overlapsAny(peaks, genes)'. >> >> Cheers, >> H. >> >> >> On 01/07/2013 11:45 AM, Tim Triche, Jr. wrote: >> >> So why not leave %in% as it was and transition everything forward >> to >> explicitly using { `%within%`, `%overlaps%`|`%overlapping%`, >> `%equals%` >> } such that >> >> identical( x %within% table, countOverlaps(x, table, >> type='within') > >> 0 ) == TRUE >> identical( x %overlaps% table, countOverlaps(x, table, >> type='any') > >> 0 ) == TRUE >> identical( x %equals% table, countOverlaps(x, table, >> type='equal') > >> 0 ) == TRUE >> >> and for the time being, >> >> identical( x %overlaps% table, countOverlaps(x, table, >> type='any') > >> 0 ) == TRUE ## but with a noisy nastygram that will halt if >> options("warn"=2) >> No breakage for %in% methods until such time as a full >> deprecation cycle >> has passed, and if the maintainers can't be arsed to do anything >> at all >> about the warnings by the second full release, then perhaps they >> don't >> really care that much after all. Just a thought? >> >> From someone (me) who has their own issues with keeping >> everything up >> to date and should know better. If you want to use %in% for >> >> peaks %in% genes (why on earth would you do this rather than >> peaks >> %in% promoters(genes), anyways?) >> >> then a nastygram could be emitted "WARNING: YOUR SHORTHAND >> NOTATION IS >> DOOMED AFTER BIOC 2.13, YOU WILL BE ASSIMILATED" and everyone is >> (more >> or less) happy. >> >> >> >> On Mon, Jan 7, 2013 at 11:33 AM, Michael Lawrence >> <lawrence.michael@gene.com <mailto:lawrence.michael@gene.**com<lawrence.michael@gene.com=""> >> > >> <mailto:lawrence.michael@gene.**__com>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com="">>>> >> wrote: >> >> >> >> >> On Mon, Jan 7, 2013 at 11:00 AM, Hervé Pagès >> <hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>> wrote: >> >> Hi Michael, >> >> I don't think "match" (the word) always has to mean >> "equality" >> either. >> However having match() (the function) do "whole exact >> matching" (aka >> "equality") for any kind of vector-like object has the >> advantage of: >> >> (a) making it consistent with base::match() >> (?base::match is >> pretty >> explicit about what the contract of match() is) >> >> >> (a) alone is obviously not enough. We have many methods, >> like the >> set operations, that treat ranges specially. Are we going >> to start >> moving everything toward the base behavior? And have >> rangeIntersect, >> rangeSetdiff, etc? >> >> (b) preserving its relationship with ==, >> duplicated(), unique(), >> etc... >> >> >> So it becomes consistent with duplicated/unique, but we lose >> consistency with the set operations. >> >> (c) not frustrating the user who needs something to >> do exact >> matching on ranges (as I mentioned previously, >> if you take >> match() away from him/her, s/he'll be left with >> nothing). >> >> >> No one has ever asked for match() to behave this way. There >> was a >> request for a way to tabulate identical ranges. It was a >> nice idea >> to extract the general "outer equal" findMatches function. >> But the >> changes seem to be snow-balling. These types of changes >> mean a lot >> of maintenance work for the users. A deprecation cycle does >> not >> circumvent that. >> >> >> IMO those advantages counterbalance *by far* the very >> little >> convenience you get from having 'match(query, subject)' >> do >> 'findOverlaps(query, subject, select="first")' on >> IRanges/GRanges objects. If you need to do that, just >> use the >> latter, or, if you think that's still too much typing, >> define >> a wrapper e.g. 'ovmatch(query, subject)'. >> >> There are plenty of specialized tools around for doing >> inexact/fuzzy/partial/overlap matching for many >> particular types >> of vector-like objects: grep() and family, pmatch(), >> charmatch(), >> agrep(), grepRaw(), matchPattern() and family, >> findOverlaps() and >> family, findIntervals(), etc... For the reasons I >> mentioned >> above, none of them should hijack match() to make it do >> some >> particular type of inexact matching on some particular >> type of >> objects. Even if, for that particular type of objects, >> doing that >> particular type of inexact matching is more common than >> doing >> exact matching. >> >> H. >> >> >> >> On 01/06/2013 05:39 PM, Michael Lawrence wrote: >> >> I think having overlapsAny is a nice addition and >> helps make >> the API >> more complete and explicit. Are you sure we need to >> change >> the behavior >> of the match method for this relatively uncommon >> use case? >> >> >> Yes because otherwise users with a use case of doing >> match() >> >> even if it's uncommon, >> >> >> I don't think >> "match" always has to mean "equality". It is a more >> general >> concept in >> my mind. The most common use case for matching >> ranges is >> overlap. >> >> >> Of course "match" doesn't always have to mean equality. >> But of base >> >> >> Michael >> >> >> On Fri, Jan 4, 2013 at 8:34 PM, Hervé Pagès >> <hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>>> wrote: >> >> Yes 'peaks %in% genes' is cute and was >> probably doing >> the right thing >> for most users (although not all). But 'exons >> %in% >> genes' is cute too >> and was probably doing the wrong thing for >> all users. >> Advanced users >> like you guys would have no problem switching to >> >> !is.na <http: is.na=""> <http: is.na=""> >> <http: is.na="">(findOverlaps(__**__peaks, genes, >> >> type="within", >> >> select="any")) >> >> or >> >> !is.na <http: is.na=""> <http: is.na=""> >> <http: is.na="">(findOverlaps(__**__peaks, genes, >> >> type="equal", >> >> >> select="any")) >> >> in case 'peaks %in% genes' was not doing >> exactly what >> you wanted, >> but most users would not find this particularly >> friendly. Even >> worse, some users probably didn't realize that >> 'peaks >> %in% genes' >> was not doing exactly what they thought it did >> because >> "peaks in >> genes" in English suggests that the peaks are >> within >> the genes, >> but it's not what 'peaks %in% genes' does. >> >> Having overlapsAny(), with exactly the same >> extra >> arguments as >> countOverlaps() and subsetByOverlaps() (i.e. >> 'maxgap', >> 'minoverlap', >> 'type', 'ignore.strand'), all of them >> documented (and >> with most >> users more or less familiar with them already) >> has the >> virtue to >> expose the user to all the options from the >> very start, >> and to >> help him/her make the right choice. Of course >> there >> will be users >> that don't want or don't have the time to >> read/think >> about all the >> options. Not a big deal: they'll just do >> 'overlapsAny(query, subject)', >> which is not a lot more typing than 'query %in% >> subject', especially >> if they use tab completion. >> >> It's true that it's more common to ask >> questions about >> overlap than >> about equality but there are some use cases >> for the >> latter (as the >> original thread shows). Until now, when you >> had such a >> use case, you >> could not use match() or %in%, which would >> have been >> the natural things >> to use, because they got hijacked to do >> something else, >> and you were >> left with nothing. Not a satisfying situation. >> So at a >> minimum, we >> needed to restore the true/real/original >> semantic of >> match() to do >> "equality" instead of "overlap". But it's hard >> to do >> this for match() >> and not do it for %in% too. For more than 99% >> of R >> users, %in% is >> just a simple wrapper for 'match(x, table, >> nomatch = 0) >> > 0' (this >> is how it has been documented and implemented >> in base R >> for many >> years). Not maintaining this relationship >> between %in% >> and match() >> would only cause grief and frustration to >> newcomers to >> Bioconductor. >> >> H. >> >> >> >> On 01/04/2013 03:32 PM, Cook, Malcolm wrote: >> >> Hiya again, >> >> I am definitely a late comer to BioC, so I >> definitely easily >> defer to >> the tide of history. >> >> But I do think you miss my point Michael >> about the >> proposed change >> making the relationship between %in% and >> match for >> {G,I}Ranges{List} >> mimic that between other vectors, and I do >> think >> that changing >> the API >> would make other late-comers take to BioC >> easier/faster. >> >> That said, I NEVER use %in% so I really >> have no >> stake in the >> matter, and >> I DEFINITELY appreciate the argument to not >> changing the API >> just for >> sematic sweetness. >> >> That that said, Herve is _/so good/_ about >> deprecations and warnings >> >> that make such changes fairly easily >> digestible. >> >> That that that.... enough.... I bow out of >> this >> one....!!!! >> >> Always learning and Happy New Year to all >> lurkers, >> >> ~Malcolm >> >> *From:*Michael Lawrence >> [mailto:lawrence.michael@gene >> <mailto:lawrence.michael@gene>**. >> <mailto:lawrence.michael@gene>> <mailto:lawrence.michael@gene>**.__>____com >> >> >> <mailto:lawrence.michael@gene.>> <mailto:lawrence.michael@gene.**>____com >> <mailto:lawrence.michael@gene.**__com>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com=""> >> >>>] >> *Sent:* Friday, January 04, 2013 5:11 PM >> *To:* Cook, Malcolm >> *Cc:* Sean Davis; Michael Lawrence; Hervé >> Pagès >> (hpages@fhcrc.org >> <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>>); Tim >> >> >> >> Triche, Jr.; Vedran Franke; >> bioconductor@r-project.org <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> > >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> >> >> <mailto:bioconductor@r-____**project. org<bioconductor@r-____project.org=""> >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> > >> >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> >>> >> *Subject:* Re: [BioC] countMatches() (was: >> table >> for GenomicRanges) >> >> >> On Fri, Jan 4, 2013 at 1:56 PM, Cook, >> Malcolm >> <mec@stowers.org <mailto:mec@stowers.org=""> >> <mailto:mec@stowers.org <mailto:mec@stowers.org="">> >> <mailto:mec@stowers.org>> <mailto:mec@stowers.org> <mailto:mec@stowers.org>> <mailto:mec@stowers.org>>> >> <mailto:mec@stowers.org>> <mailto:mec@stowers.org> <mailto:mec@stowers.org>> <mailto:mec@stowers.org>> >> <mailto:mec@stowers.org <mailto:mec@stowers.org=""> >> <mailto:mec@stowers.org <mailto:mec@stowers.org="">>>>> wrote: >> >> Hiya, >> >> For what it is worth... >> >> I think the change to %in% is warranted. >> >> If I understand correctly, this change >> restores the >> relationship >> between >> the semantics of `%in` and the semantics >> of `match`. >> >> From the docs: >> >> '"%in%" <- function(x, table) match(x, >> table, >> nomatch = 0) > 0' >> >> Herve's change restores this relationship. >> >> >> match and %in% were initially consistent >> (both >> considering any >> overlap); >> Herve has changed both of them together. >> The whole >> idea behind >> IRanges >> is that ranges are special data types with >> special >> semantics. We >> have >> reimplemented much of the existing R >> vector API >> using those >> semantics; >> this extends beyond match/%in%. I am >> hesitant about >> making such >> sweeping >> changes to the API so late in the >> life-cycle of the >> package. >> There was a >> feature request for a way to count >> identical ranges >> in a set of >> ranges. >> Let's please not get carried away and start >> redesigning the API >> for this >> one, albeit useful, request. There are all >> sorts of >> inconsistencies in >> the API, and many of them were conscious >> decisions >> that considered >> practical use cases. >> >> Michael >> >> >> Herve, I suspect you were you as a >> result able to >> completely drop >> all the `%in%,BiocClass1,BiocClass2` >> definitions and depend >> upon >> base::%in% >> >> Am I right? >> >> If so, may I suggest that Herve stay >> the >> course, with the >> addition of >> '"%ol%" <- function(a, b) >> findOverlaps(a, >> b, maxgap=0L, >> minoverlap=1L, type='any', >> select='all') > 0' >> >> This would provide a perspicacious >> idiom, thereby >> optimizing the API >> for Michaels observed common use case. >> >> Just sayin' >> >> ~Malcolm >> >> >> .-----Original Message----- >> .From: >> bioconductor-bounces@r-______**project.org<bioconductor- bounces@r-______project.org=""> >> <mailto:bioconductor-bounces@**r-____project.org <bioconductor-bounces@r-____project.org=""> >> > >> <mailto:bioconductor-bounces@_**_r- __project.org="">> <mailto:bioconductor-bounces@**r-__project.org <bioconductor-bounces@r-__project.org=""> >> >> >> <mailto:bioconductor-bounces@>> <mailto:bioconductor-bounces@>**____r-project.org >> <http: r-project.org=""> >> <mailto:bioconductor-bounces@_**_r-project.org>> <mailto:bioconductor-bounces@**r-project.org<bioconductor- bounces@r-project.org=""> >> >>> >> >> <mailto:bioconductor-bounces@>> <mailto:bioconductor-bounces@> >> <mailto:bioconductor-bounces@>> <mailto:bioconductor-bounces@>**>______r-project.org >> <http: r-project.org=""> >> <http: r-project.org=""> >> <mailto:bioconductor-bounces@>> <mailto:bioconductor-bounces@>**____r-project.org >> <http: r-project.org=""> >> <mailto:bioconductor-bounces@_**_r-project.org>> <mailto:bioconductor-bounces@**r-project.org<bioconductor- bounces@r-project.org=""> >> >>>> >> [mailto:bioconductor-bounces@ >> <mailto:bioconductor-bounces@> >> >> <mailto:bioconductor-bounces@>> <mailto:bioconductor-bounces@>**>______r-project.org >> <http: r-project.org=""> >> <http: r-project.org=""> >> <mailto:bioconductor-bounces@>> <mailto:bioconductor-bounces@>**____r-project.org >> <http: r-project.org=""> >> <mailto:bioconductor-bounces@_**_r-project.org>> <mailto:bioconductor-bounces@**r-project.org<bioconductor- bounces@r-project.org=""> >> >>> >> >> <mailto:bioconductor-bounces@>> <mailto:bioconductor-bounces@> >> <mailto:bioconductor-bounces@>> <mailto:bioconductor-bounces@>**>______r-project.org >> <http: r-project.org=""> >> <http: r-project.org=""> >> >> <mailto:bioconductor-bounces@>> <mailto:bioconductor-bounces@>**____r-project.org >> <http: r-project.org=""> >> <mailto:bioconductor-bounces@_**_r-project.org>> <mailto:bioconductor-bounces@**r-project.org<bioconductor- bounces@r-project.org="">>>>>] >> On Behalf Of Sean >> Davis >> .Sent: Friday, January 04, 2013 3:37 >> PM >> .To: Michael Lawrence >> .Cc: Tim Triche, Jr.; Vedran Franke; >> bioconductor@r-project.org <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> > >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> >> >> <mailto:bioconductor@r-____**project.org<bioco nductor@r-____project.org=""> >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> > >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> >>> >> >> <mailto:bioconductor@r-______**project.org<bioconductor@r-_ _____project.org=""> >> <mailto:bioconductor@r-____**project.org<bioconductor@r-___ _project.org=""> >> > >> >> <mailto:bioconductor@r-____**project.org<bioco nductor@r-____project.org=""> >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> >> >> >> >> <mailto:bioconductor@r-____**project. org<bioconductor@r-____project.org=""> >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> > >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> >>>> >> >> .Subject: Re: [BioC] countMatches() >> (was: >> table for >> GenomicRanges) >> . >> .On Fri, Jan 4, 2013 at 4:32 PM, >> Michael >> Lawrence >> .<lawrence.michael@gene.com>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com="">> >> <mailto:lawrence.michael@gene.**__com>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com="">>> >> <mailto:lawrence.michael@gene.>> <mailto:lawrence.michael@gene.**>____com >> <mailto:lawrence.michael@gene.**__com>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com=""> >> >>> >> <mailto:lawrence.michael@gene>> <mailto:lawrence.michael@gene>**. >> <mailto:lawrence.michael@gene>> <mailto:lawrence.michael@gene>**.__>____com >> >> <mailto:lawrence.michael@gene.>> <mailto:lawrence.michael@gene.**>____com >> <mailto:lawrence.michael@gene.**__com>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com="">>>>>> >> wrote: >> .> The change to the behavior of >> %in% is a >> pretty big >> one. Are you >> thinking >> .> that all set-based operations >> should >> behave this way? For >> example, setdiff >> .> and intersect? I really liked >> the syntax >> of "peaks >> %in% genes". >> In my >> .> experience, it's way more common >> to ask >> questions >> about overlap >> than about >> .> equality, so I'd rather optimize >> the API >> for that use >> case. But >> again, >> .> that's just my personal bias. >> . >> .For what it is worth, I share >> Michael's >> personal bias here. >> . >> .Sean >> . >> . >> .> Michael >> .> >> .> >> .> On Fri, Jan 4, 2013 at 1:11 PM, >> Hervé Pagès >> <hpages@fhcrc.org>> <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>> >> <mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>>>>> wrote: >> .> >> .>> Hi, >> .>> >> .>> I added findMatches() and >> countMatches() >> to the >> latest IRanges / >> .>> GenomicRanges packages (in BioC >> devel only). >> .>> >> .>> findMatches(x, table): An >> enhanced >> version of >> ‘match’ that >> .>> returns all the >> matches in a >> Hits object. >> .>> >> .>> countMatches(x, table): >> Returns an >> integer vector >> of the length >> .>> of ‘x’, containing >> the number >> of matches in >> ‘table’ for >> .>> each element in ‘x’. >> .>> >> >> .>> countMatches() is what you can >> use to >> tally/count/tabulate >> (choose your >> >> .>> preferred term) the unique >> elements in a >> GRanges object: >> .>> >> .>> library(GenomicRanges) >> .>> set.seed(33) >> .>> gr <- GRanges("chr1", >> IRanges(sample(15,20,replace=*** >> ______*TRUE), >> >> >> >> width=5)) >> .>> >> .>> Then: >> .>> >> .>> > gr_levels <- sort(unique(gr)) >> .>> > countMatches(gr_levels, gr) >> .>> [1] 1 1 1 2 4 2 2 1 2 2 2 >> .>> >> .>> Note that findMatches() and >> countMatches() also work on >> IRanges and >> .>> DNAStringSet objects, as well as >> on >> ordinary atomic >> vectors: >> .>> >> .>> library(hgu95av2probe) >> .>> library(Biostrings) >> .>> probes <- >> DNAStringSet(hgu95av2probe) >> .>> unique_probes <- unique(probes) >> .>> count <- >> countMatches(unique_probes, >> probes) >> .>> max(count) # 7 >> .>> >> .>> I made other changes in >> IRanges/GenomicRanges so that >> the notion >> .>> of "match" between elements of a >> vector-like object now >> consistently >> .>> means "equality" instead of >> "overlap", >> even for >> range-based >> objects >> .>> like IRanges or GRanges >> objects. This >> notion of >> "equality" is the >> .>> same that is used by ==. The most >> visible consequence >> of those >> .>> changes is that using %in% >> between 2 >> IRanges or >> GRanges objects >> .>> 'query' and 'subject' in order >> to do >> overlaps was >> replaced by >> .>> overlapsAny(query, subject). >> .>> >> .>> overlapsAny(query, subject): >> Finds the >> ranges in >> ‘query’ that >> .>> overlap any of the ranges >> in ‘subject’. >> .>> >> >> .>> There are warnings and >> deprecation >> messages in place >> to help >> smooth >> >> .>> the transition. >> .>> >> .>> Cheers, >> .>> H. >> .>> >> .>> -- >> .>> Hervé Pagès >> .>> >> .>> Program in Computational Biology >> .>> Division of Public Health >> Sciences >> .>> Fred Hutchinson Cancer Research >> Center >> .>> 1100 Fairview Ave. N, M1-B514 >> .>> P.O. Box 19024 >> .>> Seattle, WA 98109-1024 >> .>> >> .>> E-mail: hpages@fhcrc.org >> <mailto:hpages@fhcrc.org> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>> >> <mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>> >> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>>> >> >> .>> Phone: (206) 667-5791 >> <tel:%28206%29%20667-5791> >> <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> >> <tel:%28206%29%20667-5791> >> .>> Fax: (206) 667-1319 >> <tel:%28206%29%20667-1319> >> <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319> >> <tel:%28206%29%20667-1319> >> >> .>> >> .> >> .> [[alternative HTML >> version deleted]] >> .> >> .> >> .> >> ______________________________** >> _______________________ >> >> >> >> .> Bioconductor mailing list >> .> Bioconductor@r-project.org >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> > >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> >> >> <mailto:bioconductor@r-____**project. org<bioconductor@r-____project.org=""> >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> > >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> >>> >> <mailto:bioconductor@r-______**projec t.org<bioconductor@r-______project.org=""> >> <mailto:bioconductor@r-____**project.org<bioconductor@r-___ _project.org=""> >> > >> >> <mailto:bioconductor@r-____**project.org<bioco nductor@r-____project.org=""> >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> >> >> >> <mailto:bioconductor@r-____**project. org<bioconductor@r-____project.org=""> >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> > >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> >>>> >> >> .> >> https://stat.ethz.ch/mailman/_**_____listinfo/bioconductor< https://stat.ethz.ch/mailman/______listinfo/bioconductor> >> <https: stat.ethz.ch="" mailman="" **____listinfo="" bioconductor<h="" ttps:="" stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> >> > >> >> >> <https: stat.ethz.ch="" mailman="" **____listinfo="" bioconductor<h="" ttps:="" stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor<htt="" ps:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> >> >> >> >> >> >> <https: stat.ethz.ch="" mailman="" **____listinfo="" bioconductor<h="" ttps:="" stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor<htt="" ps:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> > >> >> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor<htt="" ps:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https="" :="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> >>> >> .> Search the archives: >> http://news.gmane.org/gmane.__**____science.biology.** >> informatics.______conductor<http: news.gmane.org="" gmane.______scien="" ce.biology.informatics.______conductor=""> >> <http: news.gmane.org="" gmane._**___science.biology.**="">> informatics.____conductor<http: news.gmane.org="" gmane.____science.b="" iology.informatics.____conductor=""> >> > >> >> >> <http: news.gmane.org="" gmane._**___science.biology.**="">> informatics.____conductor<http: news.gmane.org="" gmane.____science.b="" iology.informatics.____conductor=""> >> <http: news.gmane.org="" gmane._**_science.biology.informatics._**="">> _conductor<http: news.gmane.org="" gmane.__science.biology.informatic="" s.__conductor=""> >> >> >> >> >> >> <http: news.gmane.org="" gmane._**___science.biology.**="">> informatics.____conductor<http: news.gmane.org="" gmane.____science.b="" iology.informatics.____conductor=""> >> <http: news.gmane.org="" gmane._**_science.biology.informatics._**="">> _conductor<http: news.gmane.org="" gmane.__science.biology.informatic="" s.__conductor=""> >> > >> >> <http: news.gmane.org="" gmane._**_science.biology.informatics._**="">> _conductor<http: news.gmane.org="" gmane.__science.biology.informatic="" s.__conductor=""> >> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor=""> >> >>> >> . >> >> >> ._____________________________**________________________ >> >> >> >> .Bioconductor mailing list >> .Bioconductor@r-project.org >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> > >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> >> >> <mailto:bioconductor@r-____**project. org<bioconductor@r-____project.org=""> >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> > >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> >>> >> <mailto:bioconductor@r-______**projec t.org<bioconductor@r-______project.org=""> >> <mailto:bioconductor@r-____**project.org<bioconductor@r-___ _project.org=""> >> > >> <mailto:bioconductor@r-____**project.org<bioco nductor@r-____project.org=""> >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> >> >> >> <mailto:bioconductor@r-____**project. org<bioconductor@r-____project.org=""> >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> > >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> >>>> >> >> >> >> .https://stat.ethz.ch/mailman/**______listinfo/bioconductor <https: stat.ethz.ch="" mailman="" ______listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **____listinfo="" bioconductor<h="" ttps:="" stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> >> > >> >> <https: stat.ethz.ch="" mailman="" **____listinfo="" bioconductor<h="" ttps:="" stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor<htt="" ps:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> >> >> >> >> >> >> <https: stat.ethz.ch="" mailman="" **____listinfo="" bioconductor<h="" ttps:="" stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor<htt="" ps:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> > >> >> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor<htt="" ps:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https="" :="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> >>> >> .Search the archives: >> http://news.gmane.org/gmane.__**____science.biology.** >> informatics.______conductor<http: news.gmane.org="" gmane.______scien="" ce.biology.informatics.______conductor=""> >> <http: news.gmane.org="" gmane._**___science.biology.**="">> informatics.____conductor<http: news.gmane.org="" gmane.____science.b="" iology.informatics.____conductor=""> >> > >> >> <http: news.gmane.org="" gmane._**___science.biology.**="">> informatics.____conductor<http: news.gmane.org="" gmane.____science.b="" iology.informatics.____conductor=""> >> <http: news.gmane.org="" gmane._**_science.biology.informatics._**="">> _conductor<http: news.gmane.org="" gmane.__science.biology.informatic="" s.__conductor=""> >> >> >> >> >> >> >> <http: news.gmane.org="" gmane._**___science.biology.**="">> informatics.____conductor<http: news.gmane.org="" gmane.____science.b="" iology.informatics.____conductor=""> >> <http: news.gmane.org="" gmane._**_science.biology.informatics._**="">> _conductor<http: news.gmane.org="" gmane.__science.biology.informatic="" s.__conductor=""> >> > >> >> <http: news.gmane.org="" gmane._**_science.biology.informatics._**="">> _conductor<http: news.gmane.org="" gmane.__science.biology.informatic="" s.__conductor=""> >> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor=""> >> >>> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org >> <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>> >> >> >> Phone: (206) 667-5791 >> <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> >> <tel:%28206%29%20667-5791> >> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >> <tel:%28206%29%20667-1319> >> <tel:%28206%29%20667-1319> >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >> <tel:%28206%29%20667-5791> >> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >> <tel:%28206%29%20667-1319> >> >> >> >> >> >> -- >> /A model is a lie that helps you see the truth./ >> / >> / >> Howard Skipper >> <http: cancerres.__aacrjourna**ls.org="" content="" 31="" 9="" __1173.**="">> full.pdf <http: aacrjournals.org="" content="" 31="" 9="" __1173.full.pdf=""> < >> http://cancerres.**aacrjournals.org/content/31/9/**1173.full.pdf<ht tp:="" cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >> >> >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> > ... > > [Message clipped] [[alternative HTML version deleted]]
ADD REPLYlink written 4.9 years ago by Michael Lawrence9.8k
Michael: your suggestion is both clearer and more concise than mine was. +1 (I prefer x %i% y %i% z rather than intersect(x, intersect(y, z)) for the same reason) On Tue, Jan 8, 2013 at 2:03 PM, Michael Lawrence <lawrence.michael@gene.com>wrote: > I would vote for %over% instead of %ov%. Just 2 more characters but way > clearer, at least to me. The hardest thing to type are the %'s. > > Michael > > > > On Tue, Jan 8, 2013 at 11:09 AM, Hervé Pagès <hpages@fhcrc.org> wrote: > >> Thanks Tim, Malcolm for the feedback. >> >> @Tim, I won't comment on the variants of %ov% you are proposing for >> doing "within" or "equal" instead of "any" (but if people want them, >> I'll add them too). For now I just want to focus on restoring the >> convenience of the old %in%, whose removal is understandably causing >> some frustration. And so we can move on. >> >> Cheers, >> H. >> >> >> >> On 01/08/2013 09:50 AM, Tim Triche, Jr. wrote: >> >>> hell, I'll add the operators if there's support for them. obviously >>> they're not a big deal and a patch would take 5 minutes flat. >>> >>> my hope was to be very explicit about what each type of operation meant, >>> so that when a newcomer to the Ranges API sees >>> >>> peaks %overlapping% promoters(**someGroupOfGenesWeCareAbout) >>> >>> it cannot be confused with >>> >>> peaks %within% rangesThatCorrespondToSomeChro**matinState >>> >>> or >>> >>> peaks %equal% aBunchOfDNAseFootprints >>> >>> or >>> >>> DMRs %in% genes ## what the hell does this really mean, anyways? >>> it's so bad on so many levels >>> >>> because whenever someone says "what is the advantage of Ranges- based >>> analyses?", these are the archetypal sorts of queries that come to mind. >>> Except that usually in my examples they are based on posterior >>> probabilities, but perhaps that could stand to change. >>> >>> Anyways, that's just my bias, and you're doing the heavy lifting. But >>> if people agree with the motivations I will write the patch today. >>> >>> Cheers, >>> >>> --t >>> >>> >>> >>> >>> On Tue, Jan 8, 2013 at 9:20 AM, Hervé Pagès <hpages@fhcrc.org>>> <mailto:hpages@fhcrc.org>> wrote: >>> >>> Hi Tim, >>> >>> I could add the %ov% operator as a replacement for the old %in%. So >>> you >>> would write 'peaks %ov% genes' instead of 'peaks %in% genes'. Would >>> just >>> be a convenience wrapper for 'overlapsAny(peaks, genes)'. >>> >>> Cheers, >>> H. >>> >>> >>> On 01/07/2013 11:45 AM, Tim Triche, Jr. wrote: >>> >>> So why not leave %in% as it was and transition everything >>> forward to >>> explicitly using { `%within%`, `%overlaps%`|`%overlapping%`, >>> `%equals%` >>> } such that >>> >>> identical( x %within% table, countOverlaps(x, table, >>> type='within') > >>> 0 ) == TRUE >>> identical( x %overlaps% table, countOverlaps(x, table, >>> type='any') > >>> 0 ) == TRUE >>> identical( x %equals% table, countOverlaps(x, table, >>> type='equal') > >>> 0 ) == TRUE >>> >>> and for the time being, >>> >>> identical( x %overlaps% table, countOverlaps(x, table, >>> type='any') > >>> 0 ) == TRUE ## but with a noisy nastygram that will halt if >>> options("warn"=2) >>> No breakage for %in% methods until such time as a full >>> deprecation cycle >>> has passed, and if the maintainers can't be arsed to do anything >>> at all >>> about the warnings by the second full release, then perhaps they >>> don't >>> really care that much after all. Just a thought? >>> >>> From someone (me) who has their own issues with keeping >>> everything up >>> to date and should know better. If you want to use %in% for >>> >>> peaks %in% genes (why on earth would you do this rather than >>> peaks >>> %in% promoters(genes), anyways?) >>> >>> then a nastygram could be emitted "WARNING: YOUR SHORTHAND >>> NOTATION IS >>> DOOMED AFTER BIOC 2.13, YOU WILL BE ASSIMILATED" and everyone is >>> (more >>> or less) happy. >>> >>> >>> >>> On Mon, Jan 7, 2013 at 11:33 AM, Michael Lawrence >>> <lawrence.michael@gene.com <mailto:lawrence.michael@gene.**com<lawrence.michael@gene.com=""> >>> > >>> <mailto:lawrence.michael@gene.**__com>>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com="">>>> >>> wrote: >>> >>> >>> >>> >>> On Mon, Jan 7, 2013 at 11:00 AM, Hervé Pagès >>> <hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >>> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>> wrote: >>> >>> Hi Michael, >>> >>> I don't think "match" (the word) always has to mean >>> "equality" >>> either. >>> However having match() (the function) do "whole exact >>> matching" (aka >>> "equality") for any kind of vector-like object has the >>> advantage of: >>> >>> (a) making it consistent with base::match() >>> (?base::match is >>> pretty >>> explicit about what the contract of match() is) >>> >>> >>> (a) alone is obviously not enough. We have many methods, >>> like the >>> set operations, that treat ranges specially. Are we going >>> to start >>> moving everything toward the base behavior? And have >>> rangeIntersect, >>> rangeSetdiff, etc? >>> >>> (b) preserving its relationship with ==, >>> duplicated(), unique(), >>> etc... >>> >>> >>> So it becomes consistent with duplicated/unique, but we lose >>> consistency with the set operations. >>> >>> (c) not frustrating the user who needs something to >>> do exact >>> matching on ranges (as I mentioned previously, >>> if you take >>> match() away from him/her, s/he'll be left with >>> nothing). >>> >>> >>> No one has ever asked for match() to behave this way. There >>> was a >>> request for a way to tabulate identical ranges. It was a >>> nice idea >>> to extract the general "outer equal" findMatches function. >>> But the >>> changes seem to be snow-balling. These types of changes >>> mean a lot >>> of maintenance work for the users. A deprecation cycle does >>> not >>> circumvent that. >>> >>> >>> IMO those advantages counterbalance *by far* the very >>> little >>> convenience you get from having 'match(query, subject)' >>> do >>> 'findOverlaps(query, subject, select="first")' on >>> IRanges/GRanges objects. If you need to do that, just >>> use the >>> latter, or, if you think that's still too much typing, >>> define >>> a wrapper e.g. 'ovmatch(query, subject)'. >>> >>> There are plenty of specialized tools around for doing >>> inexact/fuzzy/partial/overlap matching for many >>> particular types >>> of vector-like objects: grep() and family, pmatch(), >>> charmatch(), >>> agrep(), grepRaw(), matchPattern() and family, >>> findOverlaps() and >>> family, findIntervals(), etc... For the reasons I >>> mentioned >>> above, none of them should hijack match() to make it do >>> some >>> particular type of inexact matching on some particular >>> type of >>> objects. Even if, for that particular type of objects, >>> doing that >>> particular type of inexact matching is more common than >>> doing >>> exact matching. >>> >>> H. >>> >>> >>> >>> On 01/06/2013 05:39 PM, Michael Lawrence wrote: >>> >>> I think having overlapsAny is a nice addition and >>> helps make >>> the API >>> more complete and explicit. Are you sure we need to >>> change >>> the behavior >>> of the match method for this relatively uncommon >>> use case? >>> >>> >>> Yes because otherwise users with a use case of doing >>> match() >>> >>> even if it's uncommon, >>> >>> >>> I don't think >>> "match" always has to mean "equality". It is a more >>> general >>> concept in >>> my mind. The most common use case for matching >>> ranges is >>> overlap. >>> >>> >>> Of course "match" doesn't always have to mean equality. >>> But of base >>> >>> >>> Michael >>> >>> >>> On Fri, Jan 4, 2013 at 8:34 PM, Hervé Pagès >>> <hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >>> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >>> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >>> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>>> wrote: >>> >>> Yes 'peaks %in% genes' is cute and was >>> probably doing >>> the right thing >>> for most users (although not all). But 'exons >>> %in% >>> genes' is cute too >>> and was probably doing the wrong thing for >>> all users. >>> Advanced users >>> like you guys would have no problem switching >>> to >>> >>> !is.na <http: is.na=""> <http: is.na=""> >>> <http: is.na="">(findOverlaps(__**__peaks, genes, >>> >>> type="within", >>> >>> select="any")) >>> >>> or >>> >>> !is.na <http: is.na=""> <http: is.na=""> >>> <http: is.na="">(findOverlaps(__**__peaks, genes, >>> >>> type="equal", >>> >>> >>> select="any")) >>> >>> in case 'peaks %in% genes' was not doing >>> exactly what >>> you wanted, >>> but most users would not find this particularly >>> friendly. Even >>> worse, some users probably didn't realize that >>> 'peaks >>> %in% genes' >>> was not doing exactly what they thought it did >>> because >>> "peaks in >>> genes" in English suggests that the peaks are >>> within >>> the genes, >>> but it's not what 'peaks %in% genes' does. >>> >>> Having overlapsAny(), with exactly the same >>> extra >>> arguments as >>> countOverlaps() and subsetByOverlaps() (i.e. >>> 'maxgap', >>> 'minoverlap', >>> 'type', 'ignore.strand'), all of them >>> documented (and >>> with most >>> users more or less familiar with them already) >>> has the >>> virtue to >>> expose the user to all the options from the >>> very start, >>> and to >>> help him/her make the right choice. Of course >>> there >>> will be users >>> that don't want or don't have the time to >>> read/think >>> about all the >>> options. Not a big deal: they'll just do >>> 'overlapsAny(query, subject)', >>> which is not a lot more typing than 'query %in% >>> subject', especially >>> if they use tab completion. >>> >>> It's true that it's more common to ask >>> questions about >>> overlap than >>> about equality but there are some use cases >>> for the >>> latter (as the >>> original thread shows). Until now, when you >>> had such a >>> use case, you >>> could not use match() or %in%, which would >>> have been >>> the natural things >>> to use, because they got hijacked to do >>> something else, >>> and you were >>> left with nothing. Not a satisfying situation. >>> So at a >>> minimum, we >>> needed to restore the true/real/original >>> semantic of >>> match() to do >>> "equality" instead of "overlap". But it's hard >>> to do >>> this for match() >>> and not do it for %in% too. For more than 99% >>> of R >>> users, %in% is >>> just a simple wrapper for 'match(x, table, >>> nomatch = 0) >>> > 0' (this >>> is how it has been documented and implemented >>> in base R >>> for many >>> years). Not maintaining this relationship >>> between %in% >>> and match() >>> would only cause grief and frustration to >>> newcomers to >>> Bioconductor. >>> >>> H. >>> >>> >>> >>> On 01/04/2013 03:32 PM, Cook, Malcolm wrote: >>> >>> Hiya again, >>> >>> I am definitely a late comer to BioC, so I >>> definitely easily >>> defer to >>> the tide of history. >>> >>> But I do think you miss my point Michael >>> about the >>> proposed change >>> making the relationship between %in% and >>> match for >>> {G,I}Ranges{List} >>> mimic that between other vectors, and I do >>> think >>> that changing >>> the API >>> would make other late-comers take to BioC >>> easier/faster. >>> >>> That said, I NEVER use %in% so I really >>> have no >>> stake in the >>> matter, and >>> I DEFINITELY appreciate the argument to not >>> changing the API >>> just for >>> sematic sweetness. >>> >>> That that said, Herve is _/so good/_ about >>> deprecations and warnings >>> >>> that make such changes fairly easily >>> digestible. >>> >>> That that that.... enough.... I bow out of >>> this >>> one....!!!! >>> >>> Always learning and Happy New Year to all >>> lurkers, >>> >>> ~Malcolm >>> >>> *From:*Michael Lawrence >>> [mailto:lawrence.michael@gene >>> <mailto:lawrence.michael@gene>**. >>> <mailto:lawrence.michael@gene>>> <mailto:lawrence.michael@gene>**.__>____com >>> >>> >>> <mailto:lawrence.michael@gene.>>> <mailto:lawrence.michael@gene.**>____com >>> <mailto:lawrence.michael@gene.**__com>>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com=""> >>> >>>] >>> *Sent:* Friday, January 04, 2013 5:11 PM >>> *To:* Cook, Malcolm >>> *Cc:* Sean Davis; Michael Lawrence; Hervé >>> Pagès >>> (hpages@fhcrc.org >>> <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org>>> <mailto:hpages@fhcrc.org>> >>> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >>> >>> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>>); Tim >>> >>> >>> >>> Triche, Jr.; Vedran Franke; >>> bioconductor@r-project.org <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>> > >>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>> >> >>> <mailto:bioconductor@r-____**project .org<bioconductor@r-____project.org=""> >>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>> > >>> >>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>> >>> >>> *Subject:* Re: [BioC] countMatches() (was: >>> table >>> for GenomicRanges) >>> >>> >>> On Fri, Jan 4, 2013 at 1:56 PM, Cook, >>> Malcolm >>> <mec@stowers.org <mailto:mec@stowers.org=""> >>> <mailto:mec@stowers.org <mailto:mec@stowers.org="">> >>> <mailto:mec@stowers.org>>> <mailto:mec@stowers.org> <mailto:mec@stowers.org>>> <mailto:mec@stowers.org>>> >>> <mailto:mec@stowers.org>>> <mailto:mec@stowers.org> <mailto:mec@stowers.org>>> <mailto:mec@stowers.org>> >>> <mailto:mec@stowers.org <mailto:mec@stowers.org=""> >>> <mailto:mec@stowers.org <mailto:mec@stowers.org="">>>>> wrote: >>> >>> Hiya, >>> >>> For what it is worth... >>> >>> I think the change to %in% is warranted. >>> >>> If I understand correctly, this change >>> restores the >>> relationship >>> between >>> the semantics of `%in` and the semantics >>> of `match`. >>> >>> From the docs: >>> >>> '"%in%" <- function(x, table) match(x, >>> table, >>> nomatch = 0) > 0' >>> >>> Herve's change restores this relationship. >>> >>> >>> match and %in% were initially consistent >>> (both >>> considering any >>> overlap); >>> Herve has changed both of them together. >>> The whole >>> idea behind >>> IRanges >>> is that ranges are special data types with >>> special >>> semantics. We >>> have >>> reimplemented much of the existing R >>> vector API >>> using those >>> semantics; >>> this extends beyond match/%in%. I am >>> hesitant about >>> making such >>> sweeping >>> changes to the API so late in the >>> life-cycle of the >>> package. >>> There was a >>> feature request for a way to count >>> identical ranges >>> in a set of >>> ranges. >>> Let's please not get carried away and start >>> redesigning the API >>> for this >>> one, albeit useful, request. There are all >>> sorts of >>> inconsistencies in >>> the API, and many of them were conscious >>> decisions >>> that considered >>> practical use cases. >>> >>> Michael >>> >>> >>> Herve, I suspect you were you as a >>> result able to >>> completely drop >>> all the `%in%,BiocClass1,BiocClass2` >>> definitions and depend >>> upon >>> base::%in% >>> >>> Am I right? >>> >>> If so, may I suggest that Herve stay >>> the >>> course, with the >>> addition of >>> '"%ol%" <- function(a, b) >>> findOverlaps(a, >>> b, maxgap=0L, >>> minoverlap=1L, type='any', >>> select='all') > 0' >>> >>> This would provide a perspicacious >>> idiom, thereby >>> optimizing the API >>> for Michaels observed common use case. >>> >>> Just sayin' >>> >>> ~Malcolm >>> >>> >>> .-----Original Message----- >>> .From: >>> bioconductor-bounces@r-______**project.org<bioconductor- bounces@r-______project.org=""> >>> <mailto:bioconductor-bounces@**r-____project.org <bioconductor-bounces@r-____project.org=""> >>> > >>> <mailto:bioconductor-bounces@_**_r- __project.org="">>> <mailto:bioconductor-bounces@**r-__project.org <bioconductor-bounces@r-__project.org=""> >>> >> >>> <mailto:bioconductor-bounces@>>> <mailto:bioconductor-bounces@>**____r-project.org >>> <http: r-project.org=""> >>> <mailto:bioconductor-bounces@_**_r- project.org="">>> <mailto:bioconductor-bounces@**r-project.org<bioconductor- bounces@r-project.org=""> >>> >>> >>> >>> <mailto:bioconductor-bounces@>>> <mailto:bioconductor-bounces@> >>> <mailto:bioconductor-bounces@>>> <mailto:bioconductor-bounces@>**>______r-project.org >>> <http: r-project.org=""> >>> <http: r-project.org=""> >>> <mailto:bioconductor-bounces@>>> <mailto:bioconductor-bounces@>**____r-project.org >>> <http: r-project.org=""> >>> <mailto:bioconductor-bounces@_**_r- project.org="">>> <mailto:bioconductor-bounces@**r-project.org<bioconductor- bounces@r-project.org=""> >>> >>>> >>> [mailto:bioconductor-bounces@ >>> <mailto:bioconductor-bounces@> >>> >>> <mailto:bioconductor-bounces@>>> <mailto:bioconductor-bounces@>**>______r-project.org >>> <http: r-project.org=""> >>> <http: r-project.org=""> >>> <mailto:bioconductor-bounces@>>> <mailto:bioconductor-bounces@>**____r-project.org >>> <http: r-project.org=""> >>> <mailto:bioconductor-bounces@_**_r- project.org="">>> <mailto:bioconductor-bounces@**r-project.org<bioconductor- bounces@r-project.org=""> >>> >>> >>> >>> <mailto:bioconductor-bounces@>>> <mailto:bioconductor-bounces@> >>> <mailto:bioconductor-bounces@>>> <mailto:bioconductor-bounces@>**>______r-project.org >>> <http: r-project.org=""> >>> <http: r-project.org=""> >>> >>> <mailto:bioconductor-bounces@>>> <mailto:bioconductor-bounces@>**____r-project.org >>> <http: r-project.org=""> >>> <mailto:bioconductor-bounces@_**_r- project.org="">>> <mailto:bioconductor-bounces@**r-project.org<bioconductor- bounces@r-project.org="">>>>>] >>> On Behalf Of Sean >>> Davis >>> .Sent: Friday, January 04, 2013 >>> 3:37 PM >>> .To: Michael Lawrence >>> .Cc: Tim Triche, Jr.; Vedran Franke; >>> bioconductor@r-project.org <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>> > >>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>> >> >>> <mailto:bioconductor@r-____**project.org<bioc onductor@r-____project.org=""> >>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>> > >>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>> >>> >>> >>> <mailto:bioconductor@r-______**project.org<bioconductor@r- ______project.org=""> >>> <mailto:bioconductor@r-____**project.org<bioconductor@r-__ __project.org=""> >>> > >>> >>> <mailto:bioconductor@r-____**project.org<bioc onductor@r-____project.org=""> >>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>> >> >>> >>> >>> <mailto:bioconductor@r-____**project .org<bioconductor@r-____project.org=""> >>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>> > >>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>> >>>> >>> >>> .Subject: Re: [BioC] countMatches() >>> (was: >>> table for >>> GenomicRanges) >>> . >>> .On Fri, Jan 4, 2013 at 4:32 PM, >>> Michael >>> Lawrence >>> .<lawrence.michael@gene.com>>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com="">> >>> <mailto:lawrence.michael@gene.**__com>>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com=""> >>> >> >>> <mailto:lawrence.michael@gene.>>> <mailto:lawrence.michael@gene.**>____com >>> <mailto:lawrence.michael@gene.**__com>>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com=""> >>> >>> >>> <mailto:lawrence.michael@gene>>> <mailto:lawrence.michael@gene>**. >>> <mailto:lawrence.michael@gene>>> <mailto:lawrence.michael@gene>**.__>____com >>> >>> <mailto:lawrence.michael@gene.>>> <mailto:lawrence.michael@gene.**>____com >>> <mailto:lawrence.michael@gene.**__com>>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com="">>>>>> >>> wrote: >>> .> The change to the behavior of >>> %in% is a >>> pretty big >>> one. Are you >>> thinking >>> .> that all set-based operations >>> should >>> behave this way? For >>> example, setdiff >>> .> and intersect? I really liked >>> the syntax >>> of "peaks >>> %in% genes". >>> In my >>> .> experience, it's way more common >>> to ask >>> questions >>> about overlap >>> than about >>> .> equality, so I'd rather optimize >>> the API >>> for that use >>> case. But >>> again, >>> .> that's just my personal bias. >>> . >>> .For what it is worth, I share >>> Michael's >>> personal bias here. >>> . >>> .Sean >>> . >>> . >>> .> Michael >>> .> >>> .> >>> .> On Fri, Jan 4, 2013 at 1:11 PM, >>> Hervé Pagès >>> <hpages@fhcrc.org>>> <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org>>> <mailto:hpages@fhcrc.org>> >>> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >>> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>> >>> <mailto:hpages@fhcrc.org>>> <mailto:hpages@fhcrc.org> >>> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >>> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >>> <mailto:hpages@fhcrc.org>>> <mailto:hpages@fhcrc.org>>>>> wrote: >>> .> >>> .>> Hi, >>> .>> >>> .>> I added findMatches() and >>> countMatches() >>> to the >>> latest IRanges / >>> .>> GenomicRanges packages (in BioC >>> devel only). >>> .>> >>> .>> findMatches(x, table): An >>> enhanced >>> version of >>> ‘match’ that >>> .>> returns all the >>> matches in a >>> Hits object. >>> .>> >>> .>> countMatches(x, table): >>> Returns an >>> integer vector >>> of the length >>> .>> of ‘x’, containing >>> the number >>> of matches in >>> ‘table’ for >>> .>> each element in ‘x’. >>> .>> >>> >>> .>> countMatches() is what you can >>> use to >>> tally/count/tabulate >>> (choose your >>> >>> .>> preferred term) the unique >>> elements in a >>> GRanges object: >>> .>> >>> .>> library(GenomicRanges) >>> .>> set.seed(33) >>> .>> gr <- GRanges("chr1", >>> IRanges(sample(15,20,replace=*** >>> ______*TRUE), >>> >>> >>> >>> width=5)) >>> .>> >>> .>> Then: >>> .>> >>> .>> > gr_levels <- >>> sort(unique(gr)) >>> .>> > countMatches(gr_levels, gr) >>> .>> [1] 1 1 1 2 4 2 2 1 2 2 2 >>> .>> >>> .>> Note that findMatches() and >>> countMatches() also work on >>> IRanges and >>> .>> DNAStringSet objects, as well >>> as on >>> ordinary atomic >>> vectors: >>> .>> >>> .>> library(hgu95av2probe) >>> .>> library(Biostrings) >>> .>> probes <- >>> DNAStringSet(hgu95av2probe) >>> .>> unique_probes <- >>> unique(probes) >>> .>> count <- >>> countMatches(unique_probes, >>> probes) >>> .>> max(count) # 7 >>> .>> >>> .>> I made other changes in >>> IRanges/GenomicRanges so that >>> the notion >>> .>> of "match" between elements of a >>> vector-like object now >>> consistently >>> .>> means "equality" instead of >>> "overlap", >>> even for >>> range-based >>> objects >>> .>> like IRanges or GRanges >>> objects. This >>> notion of >>> "equality" is the >>> .>> same that is used by ==. The >>> most >>> visible consequence >>> of those >>> .>> changes is that using %in% >>> between 2 >>> IRanges or >>> GRanges objects >>> .>> 'query' and 'subject' in order >>> to do >>> overlaps was >>> replaced by >>> .>> overlapsAny(query, subject). >>> .>> >>> .>> overlapsAny(query, subject): >>> Finds the >>> ranges in >>> ‘query’ that >>> .>> overlap any of the ranges >>> in ‘subject’. >>> .>> >>> >>> .>> There are warnings and >>> deprecation >>> messages in place >>> to help >>> smooth >>> >>> .>> the transition. >>> .>> >>> .>> Cheers, >>> .>> H. >>> .>> >>> .>> -- >>> .>> Hervé Pagès >>> .>> >>> .>> Program in Computational Biology >>> .>> Division of Public Health >>> Sciences >>> .>> Fred Hutchinson Cancer Research >>> Center >>> .>> 1100 Fairview Ave. N, M1-B514 >>> .>> P.O. Box 19024 >>> .>> Seattle, WA 98109-1024 >>> .>> >>> .>> E-mail: hpages@fhcrc.org >>> <mailto:hpages@fhcrc.org> >>> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >>> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >>> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>> >>> >>> <mailto:hpages@fhcrc.org>>> <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org>>> <mailto:hpages@fhcrc.org>> >>> >>> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >>> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>>> >>> >>> .>> Phone: (206) 667-5791 >>> <tel:%28206%29%20667-5791> >>> <tel:%28206%29%20667-5791> >>> <tel:%28206%29%20667-5791> >>> <tel:%28206%29%20667-5791> >>> .>> Fax: (206) 667-1319 >>> <tel:%28206%29%20667-1319> >>> <tel:%28206%29%20667-1319> >>> <tel:%28206%29%20667-1319> >>> <tel:%28206%29%20667-1319> >>> >>> .>> >>> .> >>> .> [[alternative HTML >>> version deleted]] >>> .> >>> .> >>> .> >>> ______________________________** >>> _______________________ >>> >>> >>> >>> .> Bioconductor mailing list >>> .> Bioconductor@r-project.org >>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>> > >>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>> >> >>> <mailto:bioconductor@r-____**project .org<bioconductor@r-____project.org=""> >>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>> > >>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>> >>> >>> <mailto:bioconductor@r-______**proje ct.org<bioconductor@r-______project.org=""> >>> <mailto:bioconductor@r-____**project.org<bioconductor@r-__ __project.org=""> >>> > >>> >>> <mailto:bioconductor@r-____**project.org<bioc onductor@r-____project.org=""> >>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>> >> >>> >>> <mailto:bioconductor@r-____**project .org<bioconductor@r-____project.org=""> >>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>> > >>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>> >>>> >>> >>> .> >>> https://stat.ethz.ch/mailman/_**_____listinfo/bioconductor <https: stat.ethz.ch="" mailman="" ______listinfo="" bioconductor=""> >>> <https: stat.ethz.ch="" mailman="" **____listinfo="" bioconductor<="" https:="" stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> >>> > >>> >>> >>> <https: stat.ethz.ch="" mailman="" **____listinfo="" bioconductor<="" https:="" stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> >>> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor<ht="" tps:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >>> >> >>> >>> >>> >>> >>> <https: stat.ethz.ch="" mailman="" **____listinfo="" bioconductor<="" https:="" stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> >>> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor<ht="" tps:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >>> > >>> >>> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor<ht="" tps:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >>> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<http="" s:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> >>> >>> .> Search the archives: >>> http://news.gmane.org/gmane.__**____science.biology.** >>> informatics.______conductor<http: news.gmane.org="" gmane.______scie="" nce.biology.informatics.______conductor=""> >>> <http: news.gmane.org="" gmane._**___science.biology.**="">>> informatics.____conductor<http: news.gmane.org="" gmane.____science.="" biology.informatics.____conductor=""> >>> > >>> >>> >>> <http: news.gmane.org="" gmane._**___science.biology.**="">>> informatics.____conductor<http: news.gmane.org="" gmane.____science.="" biology.informatics.____conductor=""> >>> <http: news.gmane.org="" gmane._**_science.biology.informatics._**="">>> _conductor<http: news.gmane.org="" gmane.__science.biology.informati="" cs.__conductor=""> >>> >> >>> >>> >>> >>> <http: news.gmane.org="" gmane._**___science.biology.**="">>> informatics.____conductor<http: news.gmane.org="" gmane.____science.="" biology.informatics.____conductor=""> >>> <http: news.gmane.org="" gmane._**_science.biology.informatics._**="">>> _conductor<http: news.gmane.org="" gmane.__science.biology.informati="" cs.__conductor=""> >>> > >>> >>> <http: news.gmane.org="" gmane._**_science.biology.informatics._**="">>> _conductor<http: news.gmane.org="" gmane.__science.biology.informati="" cs.__conductor=""> >>> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">>> conductor<http: news.gmane.org="" gmane.science.biology.informatics.="" conductor=""> >>> >>> >>> . >>> >>> >>> ._____________________________**________________________ >>> >>> >>> >>> .Bioconductor mailing list >>> .Bioconductor@r-project.org >>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>> > >>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>> >> >>> <mailto:bioconductor@r-____**project .org<bioconductor@r-____project.org=""> >>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>> > >>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>> >>> >>> <mailto:bioconductor@r-______**proje ct.org<bioconductor@r-______project.org=""> >>> <mailto:bioconductor@r-____**project.org<bioconductor@r-__ __project.org=""> >>> > >>> <mailto:bioconductor@r-____**project.org<bioc onductor@r-____project.org=""> >>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>> >> >>> >>> <mailto:bioconductor@r-____**project .org<bioconductor@r-____project.org=""> >>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>> > >>> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>> >>>> >>> >>> >>> >>> .https://stat.ethz.ch/mailman/**______listinfo/bioconducto r<https: stat.ethz.ch="" mailman="" ______listinfo="" bioconductor=""> >>> <https: stat.ethz.ch="" mailman="" **____listinfo="" bioconductor<="" https:="" stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> >>> > >>> >>> <https: stat.ethz.ch="" mailman="" **____listinfo="" bioconductor<="" https:="" stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> >>> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor<ht="" tps:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >>> >> >>> >>> >>> >>> >>> <https: stat.ethz.ch="" mailman="" **____listinfo="" bioconductor<="" https:="" stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> >>> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor<ht="" tps:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >>> > >>> >>> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor<ht="" tps:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >>> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<http="" s:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> >>> >>> .Search the archives: >>> http://news.gmane.org/gmane.__**____science.biology.** >>> informatics.______conductor<http: news.gmane.org="" gmane.______scie="" nce.biology.informatics.______conductor=""> >>> <http: news.gmane.org="" gmane._**___science.biology.**="">>> informatics.____conductor<http: news.gmane.org="" gmane.____science.="" biology.informatics.____conductor=""> >>> > >>> >>> <http: news.gmane.org="" gmane._**___science.biology.**="">>> informatics.____conductor<http: news.gmane.org="" gmane.____science.="" biology.informatics.____conductor=""> >>> <http: news.gmane.org="" gmane._**_science.biology.informatics._**="">>> _conductor<http: news.gmane.org="" gmane.__science.biology.informati="" cs.__conductor=""> >>> >> >>> >>> >>> >>> >>> <http: news.gmane.org="" gmane._**___science.biology.**="">>> informatics.____conductor<http: news.gmane.org="" gmane.____science.="" biology.informatics.____conductor=""> >>> <http: news.gmane.org="" gmane._**_science.biology.informatics._**="">>> _conductor<http: news.gmane.org="" gmane.__science.biology.informati="" cs.__conductor=""> >>> > >>> >>> <http: news.gmane.org="" gmane._**_science.biology.informatics._**="">>> _conductor<http: news.gmane.org="" gmane.__science.biology.informati="" cs.__conductor=""> >>> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">>> conductor<http: news.gmane.org="" gmane.science.biology.informatics.="" conductor=""> >>> >>> >>> >>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages@fhcrc.org >>> <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org>>> <mailto:hpages@fhcrc.org>> >>> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >>> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>> >>> >>> >>> Phone: (206) 667-5791 >>> <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> >>> <tel:%28206%29%20667-5791> >>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >>> <tel:%28206%29%20667-1319> >>> <tel:%28206%29%20667-1319> >>> >>> >>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >>> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >>> <tel:%28206%29%20667-5791> >>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >>> <tel:%28206%29%20667-1319> >>> >>> >>> >>> >>> >>> -- >>> /A model is a lie that helps you see the truth./ >>> / >>> / >>> Howard Skipper >>> <http: cancerres.__aacrjourna**ls.org="" content="" 31="" 9="" __1173.**="">>> full.pdf <http: aacrjournals.org="" content="" 31="" 9="" __1173.full.pdf=""> < >>> http://cancerres.**aacrjournals.org/content/31/9/**1173.full.pdf<h ttp:="" cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >>> >> >>> >>> >>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> >> ... >> >> [Message clipped] > > > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]
ADD REPLYlink written 4.9 years ago by Tim Triche4.2k
If we're voting/brainstorming, I'd go for one operator for value that the 'type' arg of overlap can take on Thus: %olStart% %olEnd% %olWithin% %olAny% (perhaps with alias of just '%ol%') %olEqual% (which should be same as %in%, right) Doh, I can't stay away from this issue for some reason..... Anyway, my 2 cents ~Malcolm From: Tim Triche, Jr. [mailto:tim.triche@gmail.com] Sent: Tuesday, January 08, 2013 4:12 PM To: Michael Lawrence Cc: Hervé Pagès; Cook, Malcolm; Sean Davis; Vedran Franke; bioconductor@r-project.org Subject: Re: [BioC] countMatches() (was: table for GenomicRanges) Michael: your suggestion is both clearer and more concise than mine was. +1 (I prefer x %i% y %i% z rather than intersect(x, intersect(y, z)) for the same reason) On Tue, Jan 8, 2013 at 2:03 PM, Michael Lawrence <lawrence.michael@gene.com<mailto:lawrence.michael@gene.com>> wrote: I would vote for %over% instead of %ov%. Just 2 more characters but way clearer, at least to me. The hardest thing to type are the %'s. Michael On Tue, Jan 8, 2013 at 11:09 AM, Hervé Pagès <hpages@fhcrc.org<mailto:hpages@fhcrc.org>> wrote: Thanks Tim, Malcolm for the feedback. @Tim, I won't comment on the variants of %ov% you are proposing for doing "within" or "equal" instead of "any" (but if people want them, I'll add them too). For now I just want to focus on restoring the convenience of the old %in%, whose removal is understandably causing some frustration. And so we can move on. Cheers, H. On 01/08/2013 09:50 AM, Tim Triche, Jr. wrote: hell, I'll add the operators if there's support for them. obviously they're not a big deal and a patch would take 5 minutes flat. my hope was to be very explicit about what each type of operation meant, so that when a newcomer to the Ranges API sees peaks %overlapping% promoters(someGroupOfGenesWeCareAbout) it cannot be confused with peaks %within% rangesThatCorrespondToSomeChromatinState or peaks %equal% aBunchOfDNAseFootprints or DMRs %in% genes ## what the hell does this really mean, anyways? it's so bad on so many levels because whenever someone says "what is the advantage of Ranges-based analyses?", these are the archetypal sorts of queries that come to mind. Except that usually in my examples they are based on posterior probabilities, but perhaps that could stand to change. Anyways, that's just my bias, and you're doing the heavy lifting. But if people agree with the motivations I will write the patch today. Cheers, --t On Tue, Jan 8, 2013 at 9:20 AM, Hervé Pagès <hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>>> wrote: Hi Tim, I could add the %ov% operator as a replacement for the old %in%. So you would write 'peaks %ov% genes' instead of 'peaks %in% genes'. Would just be a convenience wrapper for 'overlapsAny(peaks, genes)'. Cheers, H. On 01/07/2013 11:45 AM, Tim Triche, Jr. wrote: So why not leave %in% as it was and transition everything forward to explicitly using { `%within%`, `%overlaps%`|`%overlapping%`, `%equals%` } such that identical( x %within% table, countOverlaps(x, table, type='within') > 0 ) == TRUE identical( x %overlaps% table, countOverlaps(x, table, type='any') > 0 ) == TRUE identical( x %equals% table, countOverlaps(x, table, type='equal') > 0 ) == TRUE and for the time being, identical( x %overlaps% table, countOverlaps(x, table, type='any') > 0 ) == TRUE ## but with a noisy nastygram that will halt if options("warn"=2) No breakage for %in% methods until such time as a full deprecation cycle has passed, and if the maintainers can't be arsed to do anything at all about the warnings by the second full release, then perhaps they don't really care that much after all. Just a thought? From someone (me) who has their own issues with keeping everything up to date and should know better. If you want to use %in% for peaks %in% genes (why on earth would you do this rather than peaks %in% promoters(genes), anyways?) then a nastygram could be emitted "WARNING: YOUR SHORTHAND NOTATION IS DOOMED AFTER BIOC 2.13, YOU WILL BE ASSIMILATED" and everyone is (more or less) happy. On Mon, Jan 7, 2013 at 11:33 AM, Michael Lawrence <lawrence.michael@gene.com<mailto:lawrence.michael@gene.com> <mailto:lawrence.michael@gene.com<mailto:lawrence.michael@gene.com>> <mailto:lawrence.michael@gene.<mailto:lawrence.michael@gene.>__com <mailto:lawrence.michael@gene.com<mailto:lawrence.michael@gene.com>>>> wrote: On Mon, Jan 7, 2013 at 11:00 AM, Hervé Pagès <hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>>>> wrote: Hi Michael, I don't think "match" (the word) always has to mean "equality" either. However having match() (the function) do "whole exact matching" (aka "equality") for any kind of vector-like object has the advantage of: (a) making it consistent with base::match() (?base::match is pretty explicit about what the contract of match() is) (a) alone is obviously not enough. We have many methods, like the set operations, that treat ranges specially. Are we going to start moving everything toward the base behavior? And have rangeIntersect, rangeSetdiff, etc? (b) preserving its relationship with ==, duplicated(), unique(), etc... So it becomes consistent with duplicated/unique, but we lose consistency with the set operations. (c) not frustrating the user who needs something to do exact matching on ranges (as I mentioned previously, if you take match() away from him/her, s/he'll be left with nothing). No one has ever asked for match() to behave this way. There was a request for a way to tabulate identical ranges. It was a nice idea to extract the general "outer equal" findMatches function. But the changes seem to be snow-balling. These types of changes mean a lot of maintenance work for the users. A deprecation cycle does not circumvent that. IMO those advantages counterbalance *by far* the very little convenience you get from having 'match(query, subject)' do 'findOverlaps(query, subject, select="first")' on IRanges/GRanges objects. If you need to do that, just use the latter, or, if you think that's still too much typing, define a wrapper e.g. 'ovmatch(query, subject)'. There are plenty of specialized tools around for doing inexact/fuzzy/partial/overlap matching for many particular types of vector-like objects: grep() and family, pmatch(), charmatch(), agrep(), grepRaw(), matchPattern() and family, findOverlaps() and family, findIntervals(), etc... For the reasons I mentioned above, none of them should hijack match() to make it do some particular type of inexact matching on some particular type of objects. Even if, for that particular type of objects, doing that particular type of inexact matching is more common than doing exact matching. H. On 01/06/2013 05:39 PM, Michael Lawrence wrote: I think having overlapsAny is a nice addition and helps make the API more complete and explicit. Are you sure we need to change the behavior of the match method for this relatively uncommon use case? Yes because otherwise users with a use case of doing match() even if it's uncommon, I don't think "match" always has to mean "equality". It is a more general concept in my mind. The most common use case for matching ranges is overlap. Of course "match" doesn't always have to mean equality. But of base Michael On Fri, Jan 4, 2013 at 8:34 PM, Hervé Pagès <hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>>> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>>>>> wrote: Yes 'peaks %in% genes' is cute and was probably doing the right thing for most users (although not all). But 'exons %in% genes' is cute too and was probably doing the wrong thing for all users. Advanced users like you guys would have no problem switching to !is.na<http: is.na=""> <http: is.na=""> <http: is.na=""> <http: is.na="">(findOverlaps(____peaks, genes, type="within", select="any")) or !is.na<http: is.na=""> <http: is.na=""> <http: is.na=""> <http: is.na="">(findOverlaps(____peaks, genes, type="equal", select="any")) in case 'peaks %in% genes' was not doing exactly what you wanted, but most users would not find this particularly friendly. Even worse, some users probably didn't realize that 'peaks %in% genes' was not doing exactly what they thought it did because "peaks in genes" in English suggests that the peaks are within the genes, but it's not what 'peaks %in% genes' does. Having overlapsAny(), with exactly the same extra arguments as countOverlaps() and subsetByOverlaps() (i.e. 'maxgap', 'minoverlap', 'type', 'ignore.strand'), all of them documented (and with most users more or less familiar with them already) has the virtue to expose the user to all the options from the very start, and to help him/her make the right choice. Of course there will be users that don't want or don't have the time to read/think about all the options. Not a big deal: they'll just do 'overlapsAny(query, subject)', which is not a lot more typing than 'query %in% subject', especially if they use tab completion. It's true that it's more common to ask questions about overlap than about equality but there are some use cases for the latter (as the original thread shows). Until now, when you had such a use case, you could not use match() or %in%, which would have been the natural things to use, because they got hijacked to do something else, and you were left with nothing. Not a satisfying situation. So at a minimum, we needed to restore the true/real/original semantic of match() to do "equality" instead of "overlap". But it's hard to do this for match() and not do it for %in% too. For more than 99% of R users, %in% is just a simple wrapper for 'match(x, table, nomatch = 0) > 0' (this is how it has been documented and implemented in base R for many years). Not maintaining this relationship between %in% and match() would only cause grief and frustration to newcomers to Bioconductor. H. On 01/04/2013 03:32 PM, Cook, Malcolm wrote: Hiya again, I am definitely a late comer to BioC, so I definitely easily defer to the tide of history. But I do think you miss my point Michael about the proposed change making the relationship between %in% and match for {G,I}Ranges{List} mimic that between other vectors, and I do think that changing the API would make other late-comers take to BioC easier/faster. That said, I NEVER use %in% so I really have no stake in the matter, and I DEFINITELY appreciate the argument to not changing the API just for sematic sweetness. That that said, Herve is _/so good/_ about deprecations and warnings that make such changes fairly easily digestible. That that that.... enough.... I bow out of this one....!!!! Always learning and Happy New Year to all lurkers, ~Malcolm *From:*Michael Lawrence [mailto:lawrence.michael@gene<mailto:lawrence.michael@gene> <mailto:lawrence.michael@gene<mailto:lawrence.michael@gene>>. <mailto:lawrence.michael@gene<mailto:lawrence.michael@gene> <mailto:lawrence.michael@gene<mailto:lawrence.michael@gene>>._ _>____com <mailto:lawrence.michael@gene<mailto:lawrence.michael@gene>. <mailto:lawrence.michael@gene<mailto:lawrence.michael@gene>.>____com <mailto:lawrence.michael@gene.<mailto:lawrence.michael@gene.>__com <mailto:lawrence.michael@gene.com<mailto:lawrence.michael@gene .com="">>>>] *Sent:* Friday, January 04, 2013 5:11 PM *To:* Cook, Malcolm *Cc:* Sean Davis; Michael Lawrence; Hervé Pagès (hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>>> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>>>>); Tim Triche, Jr.; Vedran Franke; bioconductor@r-project.org<mailto:bioconductor@r-project.org> <mailto:bioconductor@r-project.org<mailto:bioconductor@r-project.org>> <mailto:bioconductor@r-__project.org<mailto:bioconductor@r-__p roject.org=""> <mailto:bioconductor@r-project.org<mailto:bioconductor@r-proje ct.org="">>> <mailto:bioconductor@r-____project.org<m ailto:bioconductor@r-____project.org=""> <mailto:bioconductor@r-__project.org<mailto:bioconductor@r-__p roject.org="">> <mailto:bioconductor@r-__project.org<mailto:bioco nductor@r-__project.org=""> <mailto:bioconductor@r-project.org<mailto:bioconductor@r-proje ct.org="">>>> *Subject:* Re: [BioC] countMatches() (was: table for GenomicRanges) On Fri, Jan 4, 2013 at 1:56 PM, Cook, Malcolm <mec@stowers.org<mailto:mec@stowers.org> <mailto:mec@stowers.org<mailto:mec@stowers.org>> <mailto:mec@stowers.org<mailto:mec@stowers.org> <mailto:mec@stowers.org<mailto:mec@stowers.org>>> <mailto:mec@stowers.org<mailto:mec@stowers.org> <mailto:mec@stowers.org<mailto:mec@stowers.org>> <mailto:mec@stowers.org<mailto:mec@stowers.org> <mailto:mec@stowers.org<mailto:mec@stowers.org>>>> <mailto:mec@stowers.org<mailto:mec@stowers.org> <mailto:mec@stowers.org<mailto:mec@stowers.org>> <mailto:mec@stowers.org<mailto:mec@stowers.org> <mailto:mec@stowers.org<mailto:mec@stowers.org>>> <mailto:mec@stowers.org<mailto:mec@stowers.org> <mailto:mec@stowers.org<mailto:mec@stowers.org>> <mailto:mec@stowers.org<mailto:mec@stowers.org> <mailto:mec@stowers.org<mailto:mec@stowers.org>>>>>> wrote: Hiya, For what it is worth... I think the change to %in% is warranted. If I understand correctly, this change restores the relationship between the semantics of `%in` and the semantics of `match`. From the docs: '"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0' Herve's change restores this relationship. match and %in% were initially consistent (both considering any overlap); Herve has changed both of them together. The whole idea behind IRanges is that ranges are special data types with special semantics. We have reimplemented much of the existing R vector API using those semantics; this extends beyond match/%in%. I am hesitant about making such sweeping changes to the API so late in the life-cycle of the package. There was a feature request for a way to count identical ranges in a set of ranges. Let's please not get carried away and start redesigning the API for this one, albeit useful, request. There are all sorts of inconsistencies in the API, and many of them were conscious decisions that considered practical use cases. Michael Herve, I suspect you were you as a result able to completely drop all the `%in%,BiocClass1,BiocClass2` definitions and depend upon base::%in% Am I right? If so, may I suggest that Herve stay the course, with the addition of '"%ol%" <- function(a, b) findOverlaps(a, b, maxgap=0L, minoverlap=1L, type='any', select='all') > 0' This would provide a perspicacious idiom, thereby optimizing the API for Michaels observed common use case. Just sayin' ~Malcolm .-----Original Message----- .From: bioconductor-bounces@r-______project.org<mailto:bioconductor- bounces@r-______project.org=""> <mailto:bioconductor-bounces@r-____project.org<mailto :bioconductor-bounces@r-____project.org="">> <mailto:bioconductor-bounces@<mailto :bioconductor-bounces@="">__r-__project.org<http: r-__project.org=""> <mailto:bioconductor-bounces@r-__project.org<mailto :bioconductor-bounces@r-__project.org="">>> <mailto:bioconductor-bounces@<mailto :bioconductor-bounces@=""> <mailto:bioconductor-bounces@<mailto:bioconductor-bounces @="">>____r-project.org<http: r-project.org=""> <http: r-project.org=""> <mailto:bioconductor-bounces@<mailto :bioconductor-bounces@="">__r-project.org<http: r-project.org=""> <mailto:bioconductor-bounces@r-project.org<mailto :bioconductor-bounces@r-project.org="">>>> <mailto:bioconductor- bounces@<mailto:bioconductor-bounces@=""> <mailto:bioconductor-bounces@<mailto:bioconductor-bounces@>> <mailto:bioconductor-bounces@<mailto :bioconductor-bounces@=""> <mailto:bioconductor-bounces@<mailto:bioconductor-bounces @="">>>______r-project.org<http: r-project.org=""> <http: r-project.org=""> <http: r-project.org=""> <mailto:bioconductor-bounces@<mailto :bioconductor-bounces@=""> <mailto:bioconductor-bounces@<mailto:bioconductor-bounces @="">>____r-project.org<http: r-project.org=""> <http: r-project.org=""> <mailto:bioconductor-bounces@<mailto :bioconductor-bounces@="">__r-project.org<http: r-project.org=""> <mailto:bioconductor-bounces@r-project.org<mailto :bioconductor-bounces@r-project.org="">>>>> [mailto:bioconductor- bounces@<mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@<mailto:bioconductor-bounces@>> <mailto:bioconductor-bounces@<mailto :bioconductor-bounces@=""> <mailto:bioconductor-bounces@<mailto:bioconductor-bounces @="">>>______r-project.org<http: r-project.org=""> <http: r-project.org=""> <http: r-project.org=""> <mailto:bioconductor-bounces@<mailto :bioconductor-bounces@=""> <mailto:bioconductor-bounces@<mailto:bioconductor-bounces @="">>____r-project.org<http: r-project.org=""> <http: r-project.org=""> <mailto:bioconductor-bounces@<mailto :bioconductor-bounces@="">__r-project.org<http: r-project.org=""> <mailto:bioconductor-bounces@r-project.org<mailto :bioconductor-bounces@r-project.org="">>>> <mailto:bioconductor- bounces@<mailto:bioconductor-bounces@=""> <mailto:bioconductor-bounces@<mailto:bioconductor-bounces@>> <mailto:bioconductor-bounces@<mailto :bioconductor-bounces@=""> <mailto:bioconductor-bounces@<mailto:bioconductor-bounces @="">>>______r-project.org<http: r-project.org=""> <http: r-project.org=""> <http: r-project.org=""> <mailto:bioconductor-bounces@<mailto :bioconductor-bounces@=""> <mailto:bioconductor-bounces@<mailto:bioconductor-bounces @="">>____r-project.org<http: r-project.org=""> <http: r-project.org=""> <mailto:bioconductor-bounces@<mailto :bioconductor-bounces@="">__r-project.org<http: r-project.org=""> <mailto:bioconductor-bounces@r-project.org<mailto :bioconductor-bounces@r-project.org="">>>>>] On Behalf Of Sean Davis .Sent: Friday, January 04, 2013 3:37 PM .To: Michael Lawrence .Cc: Tim Triche, Jr.; Vedran Franke; bioconductor@r-project.org<mailto:bioconductor@r-project.org> <mailto:bioconductor@r-project.org<mailto:bioconductor@r-project.org>> <mailto:bioconductor@r-__project.org<mailto:bioco nductor@r-__project.org=""> <mailto:bioconductor@r-project.org<mailto:bioconductor@r-proje ct.org="">>> <mailto:bioconductor@r-____project.org<mailto:bio conductor@r-____project.org=""> <mailto:bioconductor@r-__project.org<mailto:bioconductor@r-__p roject.org="">> <mailto:bioconductor@r-__project.org<mailto:bioco nductor@r-__project.org=""> <mailto:bioconductor@r-project.org<mailto:bioconductor@r-proje ct.org="">>>> <mailto:bioconductor@r-______project.org<mailto:bioconductor@r -______project.org=""> <mailto:bioconductor@r-____project.org<mailto:bioconductor@r-_ ___project.org="">> <mailto:bioconductor@r-____project.org<mailto:bio conductor@r-____project.org=""> <mailto:bioconductor@r-__project.org<mailto:bioconductor@r-__p roject.org="">>> <mailto:bioconductor@r-____project.org<m ailto:bioconductor@r-____project.org=""> <mailto:bioconductor@r-__project.org<mailto:bioconductor@r-__p roject.org="">> <mailto:bioconductor@r-__project.org<mailto:bioco nductor@r-__project.org=""> <mailto:bioconductor@r-project.org<mailto:bioconductor@r-proje ct.org="">>>>> .Subject: Re: [BioC] countMatches() (was: table for GenomicRanges) . .On Fri, Jan 4, 2013 at 4:32 PM, Michael Lawrence .<lawrence.michael@gene.com<mailto:lawrence.michael@gene.com> <mailto:lawrence.michael@gene.com<mailto:lawrence.michael@gene.com>> <mailto:lawrence.michael@gene.<mailto:lawrence.michael@gene.>__com <mailto:lawrence.michael@gene.com<mailto:lawrence.michael@gene.com>>> <mailto:lawrence.michael@gene<mailto:lawrence.michael@gene>. <mailto:lawrence.michael@gene<mailto:lawrence.michael@gene>.>____com <mailto:lawrence.michael@gene.<mailto:lawrence.michael@gene.>__com <mailto:lawrence.michael@gene.com<mailto:lawrence.michael@gene.com>>>> <mailto:lawrence.michael@gene<mailto:lawrence.michael@gene> <mailto:lawrence.michael@gene<mailto:lawrence.michael@gene>>. <mailto:lawrence.michael@gene<mailto:lawrence.michael@gene> <mailto:lawrence.michael@gene<mailto:lawrence.michael@gene>>._ _>____com <mailto:lawrence.michael@gene<mailto:lawrence.michael@gene>. <mailto:lawrence.michael@gene<mailto:lawrence.michael@gene>.>____com <mailto:lawrence.michael@gene.<mailto:lawrence.michael@gene.>__com <mailto:lawrence.michael@gene.com<mailto:lawrence.michael@gene .com="">>>>>> wrote: .> The change to the behavior of %in% is a pretty big one. Are you thinking .> that all set-based operations should behave this way? For example, setdiff .> and intersect? I really liked the syntax of "peaks %in% genes". In my .> experience, it's way more common to ask questions about overlap than about .> equality, so I'd rather optimize the API for that use case. But again, .> that's just my personal bias. . .For what it is worth, I share Michael's personal bias here. . .Sean . . .> Michael .> .> .> On Fri, Jan 4, 2013 at 1:11 PM, Hervé Pagès <hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>>> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>>>> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>>> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>>>>>> wrote: .> .>> Hi, .>> .>> I added findMatches() and countMatches() to the latest IRanges / .>> GenomicRanges packages (in BioC devel only). .>> .>> findMatches(x, table): An enhanced version of 'match' that .>> returns all the matches in a Hits object. .>> .>> countMatches(x, table): Returns an integer vector of the length .>> of 'x', containing the number of matches in 'table' for .>> each element in 'x'. .>> .>> countMatches() is what you can use to tally/count/tabulate (choose your .>> preferred term) the unique elements in a GRanges object: .>> .>> library(GenomicRanges) .>> set.seed(33) .>> gr <- GRanges("chr1", IRanges(sample(15,20,replace=*______*TRUE), width=5)) .>> .>> Then: .>> .>> > gr_levels <- sort(unique(gr)) .>> > countMatches(gr_levels, gr) .>> [1] 1 1 1 2 4 2 2 1 2 2 2 .>> .>> Note that findMatches() and countMatches() also work on IRanges and .>> DNAStringSet objects, as well as on ordinary atomic vectors: .>> .>> library(hgu95av2probe) .>> library(Biostrings) .>> probes <- DNAStringSet(hgu95av2probe) .>> unique_probes <- unique(probes) .>> count <- countMatches(unique_probes, probes) .>> max(count) # 7 .>> .>> I made other changes in IRanges/GenomicRanges so that the notion .>> of "match" between elements of a vector-like object now consistently .>> means "equality" instead of "overlap", even for range-based objects .>> like IRanges or GRanges objects. This notion of "equality" is the .>> same that is used by ==. The most visible consequence of those .>> changes is that using %in% between 2 IRanges or GRanges objects .>> 'query' and 'subject' in order to do overlaps was replaced by .>> overlapsAny(query, subject). .>> .>> overlapsAny(query, subject): Finds the ranges in 'query' that .>> overlap any of the ranges in 'subject'. .>> .>> There are warnings and deprecation messages in place to help smooth .>> the transition. .>> .>> Cheers, .>> H. .>> .>> -- .>> Hervé Pagès .>> .>> Program in Computational Biology .>> Division of Public Health Sciences .>> Fred Hutchinson Cancer Research Center .>> 1100 Fairview Ave. N, M1-B514 .>> P.O. Box 19024 .>> Seattle, WA 98109-1024 .>> .>> E-mail: hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>>> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>>>> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>>> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>>>>> .>> Phone: (206) 667-5791<tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> .>> Fax: (206) 667-1319<tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319> .>> .> .> [[alternative HTML version deleted]] .> .> .> _____________________________________________________ .> Bioconductor mailing list .> Bioconductor@r-project.org<mailto:bioconductor@r-project.org> <mailto:bioconductor@r-project.org<mailto:bioconductor@r-project.org>> <mailto:bioconductor@r-__project.org<mailto:bioco nductor@r-__project.org=""> <mailto:bioconductor@r-project.org<mailto:bioconductor@r-proje ct.org="">>> <mailto:bioconductor@r-____project.org<m ailto:bioconductor@r-____project.org=""> <mailto:bioconductor@r-__project.org<mailto:bioconductor@r-__p roject.org="">> <mailto:bioconductor@r-__project.org<mailto:bioco nductor@r-__project.org=""> <mailto:bioconductor@r-project.org<mailto:bioconductor@r-proje ct.org="">>>> <mailto:bioconductor@r-______project.org <mailto:bioconductor@r-______project.org=""> <mailto:bioconductor@r-____project.org<mailto:bioconductor@r-_ ___project.org="">> <mailto:bioconductor@r-____project.org<mailto:bio conductor@r-____project.org=""> <mailto:bioconductor@r-__project.org<mailto:bioconductor@r-__p roject.org="">>> <mailto:bioconductor@r-____project.org<m ailto:bioconductor@r-____project.org=""> <mailto:bioconductor@r-__project.org<mailto:bioconductor@r-__p roject.org="">> <mailto:bioconductor@r-__project.org<mailto:bioco nductor@r-__project.org=""> <mailto:bioconductor@r-project.org<mailto:bioconductor@r-proje ct.org="">>>>> .> https://stat.ethz.ch/mailman/______listinfo/bioconductor <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor="" <https:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">> <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor="" <https:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="" <https:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>> .> Search the archives: http://news.gmane.org/gmane.______science.biology.informatics. ______conductor <http: news.gmane.org="" gmane.____science.biology.informatics._="" ___conductor=""> <http: news.gmane.org="" gmane.____science.biology.informatics._="" ___conductor="" <http:="" news.gmane.org="" gmane.__science.biology.informatics.__c="" onductor="">> <http: news.gmane.org="" gmane.____science.biology.informatics._="" ___conductor="" <http:="" news.gmane.org="" gmane.__science.biology.informatics.__c="" onductor=""> <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor="" <http:="" news.gmane.org="" gmane.science.biology.informatics.conductor="">>> . ._____________________________________________________ .Bioconductor mailing list .Bioconductor@r-project.org<mailto:bioconductor@r-project.org> <mailto:bioconductor@r-project.org<mailto:bioconductor@r-project.org>> <mailto:bioconductor@r-__project.org<mailto:bioco nductor@r-__project.org=""> <mailto:bioconductor@r-project.org<mailto:bioconductor@r-proje ct.org="">>> <mailto:bioconductor@r-____project.org<m ailto:bioconductor@r-____project.org=""> <mailto:bioconductor@r-__project.org<mailto:bioconductor@r-__p roject.org="">> <mailto:bioconductor@r-__project.org<mailto:bioco nductor@r-__project.org=""> <mailto:bioconductor@r-project.org<mailto:bioconductor@r-proje ct.org="">>>> <mailto:bioconductor@r-______project.org <mailto:bioconductor@r-______project.org=""> <mailto:bioconductor@r-____project.org<mailto:bioconductor@r-_ ___project.org="">> <mailto:bioconductor@r-____project.org<mailto:bio conductor@r-____project.org=""> <mailto:bioconductor@r-__project.org<mailto:bioconductor@r-__p roject.org="">>> <mailto:bioconductor@r-____project.org<m ailto:bioconductor@r-____project.org=""> <mailto:bioconductor@r-__project.org<mailto:bioconductor@r-__p roject.org="">> <mailto:bioconductor@r-__project.org<mailto:bioco nductor@r-__project.org=""> <mailto:bioconductor@r-project.org<mailto:bioconductor@r-proje ct.org="">>>>> .https://stat.ethz.ch/mailman/______listinfo/bioconductor <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor="" <https:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">> <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor="" <https:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="" <https:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>> .Search the archives: http://news.gmane.org/gmane.______science.biology.informatics. ______conductor <http: news.gmane.org="" gmane.____science.biology.informatics._="" ___conductor=""> <http: news.gmane.org="" gmane.____science.biology.informatics._="" ___conductor="" <http:="" news.gmane.org="" gmane.__science.biology.informatics.__c="" onductor="">> <http: news.gmane.org="" gmane.____science.biology.informatics._="" ___conductor="" <http:="" news.gmane.org="" gmane.__science.biology.informatics.__c="" onductor=""> <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor="" <http:="" news.gmane.org="" gmane.science.biology.informatics.conductor="">>> -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>>> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>>>> Phone: (206) 667-5791<tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> Fax: (206) 667-1319<tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319> -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org<mailto:hpages@fhcrc.org>>> Phone: (206) 667-5791<tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> Fax: (206) 667-1319<tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319> -- /A model is a lie that helps you see the truth./ / / Howard Skipper <http: cancerres.__aacrjournals.org="" content="" 31="" 9="" __1173.full.="" pdf<http:="" aacrjournals.org="" content="" 31="" 9="" __1173.full.pdf=""> <http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf="">> -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 ... [Message clipped] -- A model is a lie that helps you see the truth. Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]
ADD REPLYlink written 4.9 years ago by Malcolm Cook1.4k
I think %over% and maybe %within% are all that's needed. Could go to %start% and %end%. Michael On Tue, Jan 8, 2013 at 2:59 PM, Cook, Malcolm <mec@stowers.org> wrote: > If we’re voting/brainstorming, I’d go for one operator for value that the > ‘type’ arg of overlap can take on**** > > ** ** > > Thus:**** > > ** ** > > %olStart%**** > > %olEnd%**** > > %olWithin%**** > > %olAny% (perhaps with alias of just ‘%ol%’)**** > > %olEqual% (which should be same as %in%, right)**** > > ** ** > > Doh, I can’t stay away from this issue for some reason..... Anyway, my 2 > cents**** > > ** ** > > ~Malcolm**** > > ** ** > > *From:* Tim Triche, Jr. [mailto:tim.triche@gmail.com] > *Sent:* Tuesday, January 08, 2013 4:12 PM > *To:* Michael Lawrence > *Cc:* Hervé Pagès; Cook, Malcolm; Sean Davis; Vedran Franke; > bioconductor@r-project.org > *Subject:* Re: [BioC] countMatches() (was: table for GenomicRanges)**** > > ** ** > > Michael: your suggestion is both clearer and more concise than mine was. > +1 **** > > ** ** > > (I prefer x %i% y %i% z rather than intersect(x, intersect(y, z)) for the > same reason)**** > > ** ** > > ** ** > > ** ** > > ** ** > > On Tue, Jan 8, 2013 at 2:03 PM, Michael Lawrence < > lawrence.michael@gene.com> wrote:**** > > I would vote for %over% instead of %ov%. Just 2 more characters but way > clearer, at least to me. The hardest thing to type are the %'s. > > Michael**** > > ** ** > > On Tue, Jan 8, 2013 at 11:09 AM, Hervé Pagès <hpages@fhcrc.org> wrote:**** > > Thanks Tim, Malcolm for the feedback. > > @Tim, I won't comment on the variants of %ov% you are proposing for > doing "within" or "equal" instead of "any" (but if people want them, > I'll add them too). For now I just want to focus on restoring the > convenience of the old %in%, whose removal is understandably causing > some frustration. And so we can move on. > > Cheers, > H.**** > > > > > On 01/08/2013 09:50 AM, Tim Triche, Jr. wrote:**** > > hell, I'll add the operators if there's support for them. obviously > they're not a big deal and a patch would take 5 minutes flat. > > my hope was to be very explicit about what each type of operation meant, > so that when a newcomer to the Ranges API sees > > peaks %overlapping% promoters(someGroupOfGenesWeCareAbout) > > it cannot be confused with > > peaks %within% rangesThatCorrespondToSomeChromatinState > > or > > peaks %equal% aBunchOfDNAseFootprints > > or > > DMRs %in% genes ## what the hell does this really mean, anyways? > it's so bad on so many levels > > because whenever someone says "what is the advantage of Ranges-based > analyses?", these are the archetypal sorts of queries that come to mind. > Except that usually in my examples they are based on posterior > probabilities, but perhaps that could stand to change. > > Anyways, that's just my bias, and you're doing the heavy lifting. But > if people agree with the motivations I will write the patch today. > > Cheers, > > --t > > > > > On Tue, Jan 8, 2013 at 9:20 AM, Hervé Pagès <hpages@fhcrc.org****> > <mailto:hpages@fhcrc.org>> wrote: > > Hi Tim, > > I could add the %ov% operator as a replacement for the old %in%. So you > would write 'peaks %ov% genes' instead of 'peaks %in% genes'. Would > just > be a convenience wrapper for 'overlapsAny(peaks, genes)'. > > Cheers, > H. > > > On 01/07/2013 11:45 AM, Tim Triche, Jr. wrote: > > So why not leave %in% as it was and transition everything forward > to > explicitly using { `%within%`, `%overlaps%`|`%overlapping%`, > `%equals%` > } such that > > identical( x %within% table, countOverlaps(x, table, > type='within') > > 0 ) == TRUE > identical( x %overlaps% table, countOverlaps(x, table, > type='any') > > 0 ) == TRUE > identical( x %equals% table, countOverlaps(x, table, > type='equal') > > 0 ) == TRUE > > and for the time being, > > identical( x %overlaps% table, countOverlaps(x, table, > type='any') > > 0 ) == TRUE ## but with a noisy nastygram that will halt if > options("warn"=2) > No breakage for %in% methods until such time as a full > deprecation cycle > has passed, and if the maintainers can't be arsed to do anything > at all > about the warnings by the second full release, then perhaps they > don't > really care that much after all. Just a thought? > > From someone (me) who has their own issues with keeping > everything up > to date and should know better. If you want to use %in% for > > peaks %in% genes (why on earth would you do this rather than > peaks > %in% promoters(genes), anyways?) > > then a nastygram could be emitted "WARNING: YOUR SHORTHAND > NOTATION IS > DOOMED AFTER BIOC 2.13, YOU WILL BE ASSIMILATED" and everyone is > (more > or less) happy. > > > > On Mon, Jan 7, 2013 at 11:33 AM, Michael Lawrence > <lawrence.michael@gene.com <mailto:lawrence.michael@gene.com="">**** > > <mailto:lawrence.michael@gene.__com> <mailto:lawrence.michael@gene.com>>> wrote: > > > > **** > > On Mon, Jan 7, 2013 at 11:00 AM, Hervé Pagès > <hpages@fhcrc.org <mailto:hpages@fhcrc.org="">**** > > <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>> wrote: > > Hi Michael, > > I don't think "match" (the word) always has to mean > "equality" > either. > However having match() (the function) do "whole exact > matching" (aka > "equality") for any kind of vector-like object has the > advantage of: > > (a) making it consistent with base::match() > (?base::match is > pretty > explicit about what the contract of match() is) > > > (a) alone is obviously not enough. We have many methods, > like the > set operations, that treat ranges specially. Are we going > to start > moving everything toward the base behavior? And have > rangeIntersect, > rangeSetdiff, etc? > > (b) preserving its relationship with ==, > duplicated(), unique(), > etc... > > > So it becomes consistent with duplicated/unique, but we lose > consistency with the set operations. > > (c) not frustrating the user who needs something to > do exact > matching on ranges (as I mentioned previously, > if you take > match() away from him/her, s/he'll be left with > nothing). > > > No one has ever asked for match() to behave this way. There > was a > request for a way to tabulate identical ranges. It was a > nice idea > to extract the general "outer equal" findMatches function. > But the > changes seem to be snow-balling. These types of changes > mean a lot > of maintenance work for the users. A deprecation cycle does > not > circumvent that. > > > IMO those advantages counterbalance *by far* the very > little > convenience you get from having 'match(query, subject)' do > 'findOverlaps(query, subject, select="first")' on > IRanges/GRanges objects. If you need to do that, just > use the > latter, or, if you think that's still too much typing, > define > a wrapper e.g. 'ovmatch(query, subject)'. > > There are plenty of specialized tools around for doing > inexact/fuzzy/partial/overlap matching for many > particular types > of vector-like objects: grep() and family, pmatch(), > charmatch(), > agrep(), grepRaw(), matchPattern() and family, > findOverlaps() and > family, findIntervals(), etc... For the reasons I > mentioned > above, none of them should hijack match() to make it do > some > particular type of inexact matching on some particular > type of > objects. Even if, for that particular type of objects, > doing that > particular type of inexact matching is more common than > doing > exact matching. > > H. > > > > On 01/06/2013 05:39 PM, Michael Lawrence wrote: > > I think having overlapsAny is a nice addition and > helps make > the API > more complete and explicit. Are you sure we need to > change > the behavior > of the match method for this relatively uncommon > use case? > > > Yes because otherwise users with a use case of doing > match() > > even if it's uncommon, > > > I don't think > "match" always has to mean "equality". It is a more > general > concept in > my mind. The most common use case for matching > ranges is > overlap. > > > Of course "match" doesn't always have to mean equality. > But of base > > > Michael > > > On Fri, Jan 4, 2013 at 8:34 PM, Hervé Pagès > <hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> > <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>**** > > <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> > <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>>> wrote:**** > > Yes 'peaks %in% genes' is cute and was > probably doing > the right thing > for most users (although not all). But 'exons > %in% > genes' is cute too > and was probably doing the wrong thing for > all users. > Advanced users > like you guys would have no problem switching to > > !is.na <http: is.na=""> <http: is.na="">**** > > <http: is.na="">(findOverlaps(____peaks, genes,**** > > > type="within", > > select="any")) > > or > > !is.na <http: is.na=""> <http: is.na="">**** > > <http: is.na="">(findOverlaps(____peaks, genes,**** > > > type="equal", > > > select="any")) > > in case 'peaks %in% genes' was not doing > exactly what > you wanted, > but most users would not find this particularly > friendly. Even > worse, some users probably didn't realize that > 'peaks > %in% genes' > was not doing exactly what they thought it did > because > "peaks in > genes" in English suggests that the peaks are > within > the genes, > but it's not what 'peaks %in% genes' does. > > Having overlapsAny(), with exactly the same extra > arguments as > countOverlaps() and subsetByOverlaps() (i.e. > 'maxgap', > 'minoverlap', > 'type', 'ignore.strand'), all of them > documented (and > with most > users more or less familiar with them already) > has the > virtue to > expose the user to all the options from the > very start, > and to > help him/her make the right choice. Of course > there > will be users > that don't want or don't have the time to > read/think > about all the > options. Not a big deal: they'll just do > 'overlapsAny(query, subject)', > which is not a lot more typing than 'query %in% > subject', especially > if they use tab completion. > > It's true that it's more common to ask > questions about > overlap than > about equality but there are some use cases > for the > latter (as the > original thread shows). Until now, when you > had such a > use case, you > could not use match() or %in%, which would > have been > the natural things > to use, because they got hijacked to do > something else, > and you were > left with nothing. Not a satisfying situation. > So at a > minimum, we > needed to restore the true/real/original > semantic of > match() to do > "equality" instead of "overlap". But it's hard > to do > this for match() > and not do it for %in% too. For more than 99% of > R > users, %in% is > just a simple wrapper for 'match(x, table, > nomatch = 0) > > 0' (this > is how it has been documented and implemented > in base R > for many > years). Not maintaining this relationship > between %in% > and match() > would only cause grief and frustration to > newcomers to > Bioconductor. > > H. > > > > On 01/04/2013 03:32 PM, Cook, Malcolm wrote: > > Hiya again, > > I am definitely a late comer to BioC, so I > definitely easily > defer to > the tide of history. > > But I do think you miss my point Michael > about the > proposed change > making the relationship between %in% and > match for > {G,I}Ranges{List} > mimic that between other vectors, and I do > think > that changing > the API > would make other late-comers take to BioC > easier/faster. > > That said, I NEVER use %in% so I really > have no > stake in the > matter, and > I DEFINITELY appreciate the argument to not > changing the API > just for > sematic sweetness. > > That that said, Herve is _/so good/_ about > deprecations and warnings > > that make such changes fairly easily > digestible. > > That that that.... enough.... I bow out of > this > one....!!!! > > Always learning and Happy New Year to all > lurkers, > > ~Malcolm > > *From:*Michael Lawrence**** > > [mailto:lawrence.michael@gene > <mailto:lawrence.michael@gene>. > <mailto:lawrence.michael@gene> <mailto:lawrence.michael@gene>.__>____com > > **** > > <mailto:lawrence.michael@gene.> <mailto:lawrence.michael@gene.>____com > <mailto:lawrence.michael@gene.__com> <mailto:lawrence.michael@gene.com>>>] > *Sent:* Friday, January 04, 2013 5:11 PM > *To:* Cook, Malcolm > *Cc:* Sean Davis; Michael Lawrence; Hervé > Pagès > (hpages@fhcrc.org > <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org>>**** > > <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">*** > * > > > <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>>); Tim > > > > Triche, Jr.; Vedran Franke; > bioconductor@r-project.org <mailto:bioconductor@r-project.org> > <mailto:bioconductor@r-__project.org> <mailto:bioconductor@r-project.org>>**** > > <mailto:bioconductor@r-____project.org> <mailto:bioconductor@r-__project.org> > > <mailto:bioconductor@r-__project.org> <mailto:bioconductor@r-project.org>>>**** > > *Subject:* Re: [BioC] countMatches() (was: > table > for GenomicRanges) > > > On Fri, Jan 4, 2013 at 1:56 PM, Cook, Malcolm > <mec@stowers.org <mailto:mec@stowers.org=""> > <mailto:mec@stowers.org <mailto:mec@stowers.org="">> > <mailto:mec@stowers.org> <mailto:mec@stowers.org> <mailto:mec@stowers.org> <mailto:mec@stowers.org>>> > <mailto:mec@stowers.org> <mailto:mec@stowers.org> <mailto:mec@stowers.org> <mailto:mec@stowers.org>> > <mailto:mec@stowers.org <mailto:mec@stowers.org=""> > <mailto:mec@stowers.org <mailto:mec@stowers.org="">>>>> wrote: > > Hiya, > > For what it is worth... > > I think the change to %in% is warranted. > > If I understand correctly, this change > restores the > relationship > between > the semantics of `%in` and the semantics > of `match`. > > From the docs: > > '"%in%" <- function(x, table) match(x, > table, > nomatch = 0) > 0' > > Herve's change restores this relationship. > > > match and %in% were initially consistent > (both > considering any > overlap); > Herve has changed both of them together. > The whole > idea behind > IRanges > is that ranges are special data types with > special > semantics. We > have > reimplemented much of the existing R > vector API > using those > semantics; > this extends beyond match/%in%. I am > hesitant about > making such > sweeping > changes to the API so late in the > life-cycle of the > package. > There was a > feature request for a way to count > identical ranges > in a set of > ranges. > Let's please not get carried away and start > redesigning the API > for this > one, albeit useful, request. There are all > sorts of > inconsistencies in > the API, and many of them were conscious > decisions > that considered > practical use cases. > > Michael > > > Herve, I suspect you were you as a > result able to > completely drop > all the `%in%,BiocClass1,BiocClass2` > definitions and depend > upon > base::%in% > > Am I right? > > If so, may I suggest that Herve stay the > course, with the > addition of > '"%ol%" <- function(a, b) > findOverlaps(a, > b, maxgap=0L, > minoverlap=1L, type='any', > select='all') > 0' > > This would provide a perspicacious > idiom, thereby > optimizing the API > for Michaels observed common use case. > > Just sayin' > > ~Malcolm > > > .-----Original Message----- > .From:**** > > bioconductor-bounces@r-______project.org > <mailto:bioconductor-bounces@r-____project.org> > <mailto:bioconductor-bounces@__r-__project.org> <mailto:bioconductor-bounces@r-__project.org>>**** > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>____r-project.org > <http: r-project.org=""> > <mailto:bioconductor-bounces@__r-project.org> <mailto:bioconductor-bounces@r-project.org>>> > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>**** > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>>______r-project.org > <http: r-project.org=""> > <http: r-project.org="">**** > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>____r-project.org > <http: r-project.org=""> > <mailto:bioconductor-bounces@__r-project.org> <mailto:bioconductor-bounces@r-project.org>>>> > [mailto:bioconductor-bounces@ > <mailto:bioconductor-bounces@>**** > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>>______r-project.org > <http: r-project.org=""> > <http: r-project.org="">**** > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>____r-project.org > <http: r-project.org=""> > <mailto:bioconductor-bounces@__r-project.org> <mailto:bioconductor-bounces@r-project.org>>> > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>**** > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>>______r-project.org > <http: r-project.org=""> > <http: r-project.org="">**** > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>____r-project.org > <http: r-project.org=""> > <mailto:bioconductor-bounces@__r-project.org> <mailto:bioconductor-bounces@r-project.org>>>>] On Behalf Of Sean > Davis > .Sent: Friday, January 04, 2013 3:37 > PM > .To: Michael Lawrence > .Cc: Tim Triche, Jr.; Vedran Franke; > bioconductor@r-project.org <mailto:bioconductor@r-project.org> > <mailto:bioconductor@r-__project.org> <mailto:bioconductor@r-project.org>> > <mailto:bioconductor@r-____project.org> <mailto:bioconductor@r-__project.org> > <mailto:bioconductor@r-__project.org> <mailto:bioconductor@r-project.org>>>**** > > <mailto:bioconductor@r-______project.org> <mailto:bioconductor@r-____project.org>**** > > > <mailto:bioconductor@r-____project.org> <mailto:bioconductor@r-__project.org>> > > > <mailto:bioconductor@r-____project.org> <mailto:bioconductor@r-__project.org> > <mailto:bioconductor@r-__project.org> <mailto:bioconductor@r-project.org>>>> > > .Subject: Re: [BioC] countMatches() > (was: > table for > GenomicRanges) > . > .On Fri, Jan 4, 2013 at 4:32 PM, > Michael > Lawrence > .<lawrence.michael@gene.com> <mailto:lawrence.michael@gene.com> > <mailto:lawrence.michael@gene.__com> <mailto:lawrence.michael@gene.com>> > <mailto:lawrence.michael@gene.> <mailto:lawrence.michael@gene.>____com > <mailto:lawrence.michael@gene.__com> <mailto:lawrence.michael@gene.com>>>**** > > <mailto:lawrence.michael@gene> <mailto:lawrence.michael@gene>. > <mailto:lawrence.michael@gene> <mailto:lawrence.michael@gene>.__>____com**** > > <mailto:lawrence.michael@gene.> <mailto:lawrence.michael@gene.>____com > <mailto:lawrence.michael@gene.__com> <mailto:lawrence.michael@gene.com>>>>> wrote: > .> The change to the behavior of > %in% is a > pretty big > one. Are you > thinking > .> that all set-based operations > should > behave this way? For > example, setdiff > .> and intersect? I really liked > the syntax > of "peaks > %in% genes". > In my > .> experience, it's way more common > to ask > questions > about overlap > than about > .> equality, so I'd rather optimize > the API > for that use > case. But > again, > .> that's just my personal bias. > . > .For what it is worth, I share > Michael's > personal bias here. > . > .Sean > . > . > .> Michael > .> > .> > .> On Fri, Jan 4, 2013 at 1:11 PM, > Hervé Pagès > <hpages@fhcrc.org> <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org>> > <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> > <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>> > <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org> > <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> > <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> > <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org>>>>> wrote: > .> > .>> Hi, > .>> > .>> I added findMatches() and > countMatches() > to the > latest IRanges / > .>> GenomicRanges packages (in BioC > devel only). > .>> > .>> findMatches(x, table): An > enhanced > version of > ‘match’ that > .>> returns all the > matches in a > Hits object. > .>> > .>> countMatches(x, table): > Returns an > integer vector > of the length > .>> of ‘x’, containing > the number > of matches in > ‘table’ for > .>> each element in ‘x’. > .>> > > .>> countMatches() is what you can > use to > tally/count/tabulate > (choose your > > .>> preferred term) the unique > elements in a > GRanges object: > .>> > .>> library(GenomicRanges) > .>> set.seed(33) > .>> gr <- GRanges("chr1",**** > > IRanges(sample(15,20,replace=*______*TRUE),* > *** > > > > > width=5)) > .>> > .>> Then: > .>> > .>> > gr_levels <- sort(unique(gr)) > .>> > countMatches(gr_levels, gr) > .>> [1] 1 1 1 2 4 2 2 1 2 2 2 > .>> > .>> Note that findMatches() and > countMatches() also work on > IRanges and > .>> DNAStringSet objects, as well as > on > ordinary atomic > vectors: > .>> > .>> library(hgu95av2probe) > .>> library(Biostrings) > .>> probes <- > DNAStringSet(hgu95av2probe) > .>> unique_probes <- unique(probes) > .>> count <- > countMatches(unique_probes, > probes) > .>> max(count) # 7 > .>> > .>> I made other changes in > IRanges/GenomicRanges so that > the notion > .>> of "match" between elements of a > vector-like object now > consistently > .>> means "equality" instead of > "overlap", > even for > range-based > objects > .>> like IRanges or GRanges > objects. This > notion of > "equality" is the > .>> same that is used by ==. The most > visible consequence > of those > .>> changes is that using %in% > between 2 > IRanges or > GRanges objects > .>> 'query' and 'subject' in order > to do > overlaps was > replaced by > .>> overlapsAny(query, subject). > .>> > .>> overlapsAny(query, subject): > Finds the > ranges in > ‘query’ that > .>> overlap any of the ranges > in ‘subject’. > .>> > > .>> There are warnings and deprecation > messages in place > to help > smooth > > .>> the transition. > .>> > .>> Cheers, > .>> H. > .>> > .>> -- > .>> Hervé Pagès > .>> > .>> Program in Computational Biology > .>> Division of Public Health Sciences > .>> Fred Hutchinson Cancer Research > Center > .>> 1100 Fairview Ave. N, M1-B514 > .>> P.O. Box 19024 > .>> Seattle, WA 98109-1024 > .>> > .>> E-mail: hpages@fhcrc.org > <mailto:hpages@fhcrc.org> > <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> > <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> > <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>>* > *** > > <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org>> > > <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">*** > * > > <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>>> > > .>> Phone: (206) 667-5791 > <tel:%28206%29%20667-5791 <%28206%29%20667-5791="">> > <tel:%28206%29%20667-5791 <%28206%29%20667-5791="">> < > tel:%28206%29%20667-5791 <%28206%29%20667-5791>> > <tel:%28206%29%20667-5791<%28206%29%20667-5791> > > > .>> Fax: (206) 667-1319 > <tel:%28206%29%20667-1319 <%28206%29%20667-1319="">> > <tel:%28206%29%20667-1319 <%28206%29%20667-1319="">> < > tel:%28206%29%20667-1319 <%28206%29%20667-1319>> > <tel:%28206%29%20667-1319<%28206%29%20667-1319> > > > > .>> > .> > .> [[alternative HTML > version deleted]] > .> > .> > .>**** > > _____________________________________________________ > **** > > > > > .> Bioconductor mailing list > .> Bioconductor@r-project.org > <mailto:bioconductor@r-project.org> > <mailto:bioconductor@r-__project.org> <mailto:bioconductor@r-project.org>> > <mailto:bioconductor@r-____project.org> <mailto:bioconductor@r-__project.org> > <mailto:bioconductor@r-__project.org> <mailto:bioconductor@r-project.org>>>**** > > <mailto:bioconductor@r-______project.org> <mailto:bioconductor@r-____project.org>**** > > > <mailto:bioconductor@r-____project.org> <mailto:bioconductor@r-__project.org>> > > <mailto:bioconductor@r-____project.org> <mailto:bioconductor@r-__project.org> > <mailto:bioconductor@r-__project.org> <mailto:bioconductor@r-project.org>>>> > > .>**** > > https://stat.ethz.ch/mailman/______listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor="">**** > > > > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">> > > > > > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>> > .> Search the archives:**** > > > <http: news.gmane.org="" gmane.______science.biology.informatics._____="" _conductor=""> > > ... > > [Message clipped] [[alternative HTML version deleted]]
ADD REPLYlink written 4.9 years ago by Michael Lawrence9.8k
+1 On Tue, Jan 8, 2013 at 3:07 PM, Michael Lawrence <lawrence.michael@gene.com>wrote: > I think %over% and maybe %within% are all that's needed. Could go to > %start% and %end%. > > Michael > > > > > > On Tue, Jan 8, 2013 at 2:59 PM, Cook, Malcolm <mec@stowers.org> wrote: > >> If we’re voting/brainstorming, I’d go for one operator for value that >> the ‘type’ arg of overlap can take on**** >> >> ** ** >> >> Thus:**** >> >> ** ** >> >> %olStart%**** >> >> %olEnd%**** >> >> %olWithin%**** >> >> %olAny% (perhaps with alias of just ‘%ol%’)**** >> >> %olEqual% (which should be same as %in%, right)**** >> >> ** ** >> >> Doh, I can’t stay away from this issue for some reason..... Anyway, my 2 >> cents**** >> >> ** ** >> >> ~Malcolm**** >> >> ** ** >> >> *From:* Tim Triche, Jr. [mailto:tim.triche@gmail.com] >> *Sent:* Tuesday, January 08, 2013 4:12 PM >> *To:* Michael Lawrence >> *Cc:* Hervé Pagès; Cook, Malcolm; Sean Davis; Vedran Franke; >> bioconductor@r-project.org >> *Subject:* Re: [BioC] countMatches() (was: table for GenomicRanges)**** >> >> ** ** >> >> Michael: your suggestion is both clearer and more concise than mine was. >> +1 **** >> >> ** ** >> >> (I prefer x %i% y %i% z rather than intersect(x, intersect(y, z)) for the >> same reason)**** >> >> ** ** >> >> ** ** >> >> ** ** >> >> ** ** >> >> On Tue, Jan 8, 2013 at 2:03 PM, Michael Lawrence < >> lawrence.michael@gene.com> wrote:**** >> >> I would vote for %over% instead of %ov%. Just 2 more characters but way >> clearer, at least to me. The hardest thing to type are the %'s. >> >> Michael**** >> >> ** ** >> >> On Tue, Jan 8, 2013 at 11:09 AM, Hervé Pagès <hpages@fhcrc.org> wrote:*** >> * >> >> Thanks Tim, Malcolm for the feedback. >> >> @Tim, I won't comment on the variants of %ov% you are proposing for >> doing "within" or "equal" instead of "any" (but if people want them, >> I'll add them too). For now I just want to focus on restoring the >> convenience of the old %in%, whose removal is understandably causing >> some frustration. And so we can move on. >> >> Cheers, >> H.**** >> >> >> >> >> On 01/08/2013 09:50 AM, Tim Triche, Jr. wrote:**** >> >> hell, I'll add the operators if there's support for them. obviously >> they're not a big deal and a patch would take 5 minutes flat. >> >> my hope was to be very explicit about what each type of operation meant, >> so that when a newcomer to the Ranges API sees >> >> peaks %overlapping% promoters(someGroupOfGenesWeCareAbout) >> >> it cannot be confused with >> >> peaks %within% rangesThatCorrespondToSomeChromatinState >> >> or >> >> peaks %equal% aBunchOfDNAseFootprints >> >> or >> >> DMRs %in% genes ## what the hell does this really mean, anyways? >> it's so bad on so many levels >> >> because whenever someone says "what is the advantage of Ranges- based >> analyses?", these are the archetypal sorts of queries that come to mind. >> Except that usually in my examples they are based on posterior >> probabilities, but perhaps that could stand to change. >> >> Anyways, that's just my bias, and you're doing the heavy lifting. But >> if people agree with the motivations I will write the patch today. >> >> Cheers, >> >> --t >> >> >> >> >> On Tue, Jan 8, 2013 at 9:20 AM, Hervé Pagès <hpages@fhcrc.org****>> >> <mailto:hpages@fhcrc.org>> wrote: >> >> Hi Tim, >> >> I could add the %ov% operator as a replacement for the old %in%. So >> you >> would write 'peaks %ov% genes' instead of 'peaks %in% genes'. Would >> just >> be a convenience wrapper for 'overlapsAny(peaks, genes)'. >> >> Cheers, >> H. >> >> >> On 01/07/2013 11:45 AM, Tim Triche, Jr. wrote: >> >> So why not leave %in% as it was and transition everything forward >> to >> explicitly using { `%within%`, `%overlaps%`|`%overlapping%`, >> `%equals%` >> } such that >> >> identical( x %within% table, countOverlaps(x, table, >> type='within') > >> 0 ) == TRUE >> identical( x %overlaps% table, countOverlaps(x, table, >> type='any') > >> 0 ) == TRUE >> identical( x %equals% table, countOverlaps(x, table, >> type='equal') > >> 0 ) == TRUE >> >> and for the time being, >> >> identical( x %overlaps% table, countOverlaps(x, table, >> type='any') > >> 0 ) == TRUE ## but with a noisy nastygram that will halt if >> options("warn"=2) >> No breakage for %in% methods until such time as a full >> deprecation cycle >> has passed, and if the maintainers can't be arsed to do anything >> at all >> about the warnings by the second full release, then perhaps they >> don't >> really care that much after all. Just a thought? >> >> From someone (me) who has their own issues with keeping >> everything up >> to date and should know better. If you want to use %in% for >> >> peaks %in% genes (why on earth would you do this rather than >> peaks >> %in% promoters(genes), anyways?) >> >> then a nastygram could be emitted "WARNING: YOUR SHORTHAND >> NOTATION IS >> DOOMED AFTER BIOC 2.13, YOU WILL BE ASSIMILATED" and everyone is >> (more >> or less) happy. >> >> >> >> On Mon, Jan 7, 2013 at 11:33 AM, Michael Lawrence >> <lawrence.michael@gene.com <mailto:lawrence.michael@gene.com="">**** >> >> <mailto:lawrence.michael@gene.__com>> <mailto:lawrence.michael@gene.com>>> wrote: >> >> >> >> **** >> >> On Mon, Jan 7, 2013 at 11:00 AM, Hervé Pagès >> <hpages@fhcrc.org <mailto:hpages@fhcrc.org="">**** >> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>> wrote: >> >> Hi Michael, >> >> I don't think "match" (the word) always has to mean >> "equality" >> either. >> However having match() (the function) do "whole exact >> matching" (aka >> "equality") for any kind of vector-like object has the >> advantage of: >> >> (a) making it consistent with base::match() >> (?base::match is >> pretty >> explicit about what the contract of match() is) >> >> >> (a) alone is obviously not enough. We have many methods, >> like the >> set operations, that treat ranges specially. Are we going >> to start >> moving everything toward the base behavior? And have >> rangeIntersect, >> rangeSetdiff, etc? >> >> (b) preserving its relationship with ==, >> duplicated(), unique(), >> etc... >> >> >> So it becomes consistent with duplicated/unique, but we lose >> consistency with the set operations. >> >> (c) not frustrating the user who needs something to >> do exact >> matching on ranges (as I mentioned previously, >> if you take >> match() away from him/her, s/he'll be left with >> nothing). >> >> >> No one has ever asked for match() to behave this way. There >> was a >> request for a way to tabulate identical ranges. It was a >> nice idea >> to extract the general "outer equal" findMatches function. >> But the >> changes seem to be snow-balling. These types of changes >> mean a lot >> of maintenance work for the users. A deprecation cycle does >> not >> circumvent that. >> >> >> IMO those advantages counterbalance *by far* the very >> little >> convenience you get from having 'match(query, subject)' >> do >> 'findOverlaps(query, subject, select="first")' on >> IRanges/GRanges objects. If you need to do that, just >> use the >> latter, or, if you think that's still too much typing, >> define >> a wrapper e.g. 'ovmatch(query, subject)'. >> >> There are plenty of specialized tools around for doing >> inexact/fuzzy/partial/overlap matching for many >> particular types >> of vector-like objects: grep() and family, pmatch(), >> charmatch(), >> agrep(), grepRaw(), matchPattern() and family, >> findOverlaps() and >> family, findIntervals(), etc... For the reasons I >> mentioned >> above, none of them should hijack match() to make it do >> some >> particular type of inexact matching on some particular >> type of >> objects. Even if, for that particular type of objects, >> doing that >> particular type of inexact matching is more common than >> doing >> exact matching. >> >> H. >> >> >> >> On 01/06/2013 05:39 PM, Michael Lawrence wrote: >> >> I think having overlapsAny is a nice addition and >> helps make >> the API >> more complete and explicit. Are you sure we need to >> change >> the behavior >> of the match method for this relatively uncommon >> use case? >> >> >> Yes because otherwise users with a use case of doing >> match() >> >> even if it's uncommon, >> >> >> I don't think >> "match" always has to mean "equality". It is a more >> general >> concept in >> my mind. The most common use case for matching >> ranges is >> overlap. >> >> >> Of course "match" doesn't always have to mean equality. >> But of base >> >> >> Michael >> >> >> On Fri, Jan 4, 2013 at 8:34 PM, Hervé Pagès >> <hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>**** >> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>>> wrote:**** >> >> Yes 'peaks %in% genes' is cute and was >> probably doing >> the right thing >> for most users (although not all). But 'exons >> %in% >> genes' is cute too >> and was probably doing the wrong thing for >> all users. >> Advanced users >> like you guys would have no problem switching to >> >> !is.na <http: is.na=""> <http: is.na="">**** >> >> <http: is.na="">(findOverlaps(____peaks, genes,**** >> >> >> type="within", >> >> select="any")) >> >> or >> >> !is.na <http: is.na=""> <http: is.na="">**** >> >> <http: is.na="">(findOverlaps(____peaks, genes,**** >> >> >> type="equal", >> >> >> select="any")) >> >> in case 'peaks %in% genes' was not doing >> exactly what >> you wanted, >> but most users would not find this particularly >> friendly. Even >> worse, some users probably didn't realize that >> 'peaks >> %in% genes' >> was not doing exactly what they thought it did >> because >> "peaks in >> genes" in English suggests that the peaks are >> within >> the genes, >> but it's not what 'peaks %in% genes' does. >> >> Having overlapsAny(), with exactly the same >> extra >> arguments as >> countOverlaps() and subsetByOverlaps() (i.e. >> 'maxgap', >> 'minoverlap', >> 'type', 'ignore.strand'), all of them >> documented (and >> with most >> users more or less familiar with them already) >> has the >> virtue to >> expose the user to all the options from the >> very start, >> and to >> help him/her make the right choice. Of course >> there >> will be users >> that don't want or don't have the time to >> read/think >> about all the >> options. Not a big deal: they'll just do >> 'overlapsAny(query, subject)', >> which is not a lot more typing than 'query %in% >> subject', especially >> if they use tab completion. >> >> It's true that it's more common to ask >> questions about >> overlap than >> about equality but there are some use cases >> for the >> latter (as the >> original thread shows). Until now, when you >> had such a >> use case, you >> could not use match() or %in%, which would >> have been >> the natural things >> to use, because they got hijacked to do >> something else, >> and you were >> left with nothing. Not a satisfying situation. >> So at a >> minimum, we >> needed to restore the true/real/original >> semantic of >> match() to do >> "equality" instead of "overlap". But it's hard >> to do >> this for match() >> and not do it for %in% too. For more than 99% >> of R >> users, %in% is >> just a simple wrapper for 'match(x, table, >> nomatch = 0) >> > 0' (this >> is how it has been documented and implemented >> in base R >> for many >> years). Not maintaining this relationship >> between %in% >> and match() >> would only cause grief and frustration to >> newcomers to >> Bioconductor. >> >> H. >> >> >> >> On 01/04/2013 03:32 PM, Cook, Malcolm wrote: >> >> Hiya again, >> >> I am definitely a late comer to BioC, so I >> definitely easily >> defer to >> the tide of history. >> >> But I do think you miss my point Michael >> about the >> proposed change >> making the relationship between %in% and >> match for >> {G,I}Ranges{List} >> mimic that between other vectors, and I do >> think >> that changing >> the API >> would make other late-comers take to BioC >> easier/faster. >> >> That said, I NEVER use %in% so I really >> have no >> stake in the >> matter, and >> I DEFINITELY appreciate the argument to not >> changing the API >> just for >> sematic sweetness. >> >> That that said, Herve is _/so good/_ about >> deprecations and warnings >> >> that make such changes fairly easily >> digestible. >> >> That that that.... enough.... I bow out of >> this >> one....!!!! >> >> Always learning and Happy New Year to all >> lurkers, >> >> ~Malcolm >> >> *From:*Michael Lawrence**** >> >> [mailto:lawrence.michael@gene >> <mailto:lawrence.michael@gene>. >> <mailto:lawrence.michael@gene>> <mailto:lawrence.michael@gene>.__>____com >> >> **** >> >> <mailto:lawrence.michael@gene.>> <mailto:lawrence.michael@gene.>____com >> <mailto:lawrence.michael@gene.__com>> <mailto:lawrence.michael@gene.com>>>] >> *Sent:* Friday, January 04, 2013 5:11 PM >> *To:* Cook, Malcolm >> *Cc:* Sean Davis; Michael Lawrence; Hervé >> Pagès >> (hpages@fhcrc.org >> <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>>**** >> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">** >> ** >> >> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>>); Tim >> >> >> >> Triche, Jr.; Vedran Franke; >> bioconductor@r-project.org <mailto:bioconductor@r-project.org> >> <mailto:bioconductor@r-__project.org>> <mailto:bioconductor@r-project.org>>**** >> >> <mailto:bioconductor@r-____project.org>> <mailto:bioconductor@r-__project.org> >> >> <mailto:bioconductor@r-__project.org>> <mailto:bioconductor@r-project.org>>>**** >> >> *Subject:* Re: [BioC] countMatches() (was: >> table >> for GenomicRanges) >> >> >> On Fri, Jan 4, 2013 at 1:56 PM, Cook, >> Malcolm >> <mec@stowers.org <mailto:mec@stowers.org=""> >> <mailto:mec@stowers.org <mailto:mec@stowers.org="">> >> <mailto:mec@stowers.org>> <mailto:mec@stowers.org> <mailto:mec@stowers.org>> <mailto:mec@stowers.org>>> >> <mailto:mec@stowers.org>> <mailto:mec@stowers.org> <mailto:mec@stowers.org>> <mailto:mec@stowers.org>> >> <mailto:mec@stowers.org <mailto:mec@stowers.org=""> >> <mailto:mec@stowers.org <mailto:mec@stowers.org="">>>>> wrote: >> >> Hiya, >> >> For what it is worth... >> >> I think the change to %in% is warranted. >> >> If I understand correctly, this change >> restores the >> relationship >> between >> the semantics of `%in` and the semantics >> of `match`. >> >> From the docs: >> >> '"%in%" <- function(x, table) match(x, >> table, >> nomatch = 0) > 0' >> >> Herve's change restores this relationship. >> >> >> match and %in% were initially consistent >> (both >> considering any >> overlap); >> Herve has changed both of them together. >> The whole >> idea behind >> IRanges >> is that ranges are special data types with >> special >> semantics. We >> have >> reimplemented much of the existing R >> vector API >> using those >> semantics; >> this extends beyond match/%in%. I am >> hesitant about >> making such >> sweeping >> changes to the API so late in the >> life-cycle of the >> package. >> There was a >> feature request for a way to count >> identical ranges >> in a set of >> ranges. >> Let's please not get carried away and start >> redesigning the API >> for this >> one, albeit useful, request. There are all >> sorts of >> inconsistencies in >> the API, and many of them were conscious >> decisions >> that considered >> practical use cases. >> >> Michael >> >> >> Herve, I suspect you were you as a >> result able to >> completely drop >> all the `%in%,BiocClass1,BiocClass2` >> definitions and depend >> upon >> base::%in% >> >> Am I right? >> >> If so, may I suggest that Herve stay >> the >> course, with the >> addition of >> '"%ol%" <- function(a, b) >> findOverlaps(a, >> b, maxgap=0L, >> minoverlap=1L, type='any', >> select='all') > 0' >> >> This would provide a perspicacious >> idiom, thereby >> optimizing the API >> for Michaels observed common use case. >> >> Just sayin' >> >> ~Malcolm >> >> >> .-----Original Message----- >> .From:**** >> >> bioconductor-bounces@r-______project.org >> <mailto:bioconductor-bounces@r-____project.org> >> <mailto:bioconductor-bounces@__r-__project.org>> <mailto:bioconductor-bounces@r-__project.org>>**** >> >> <mailto:bioconductor-bounces@>> <mailto:bioconductor-bounces@>____r-project.org >> <http: r-project.org=""> >> <mailto:bioconductor-bounces@__r-project.org>> <mailto:bioconductor-bounces@r-project.org>>> >> >> <mailto:bioconductor-bounces@>> <mailto:bioconductor-bounces@>**** >> >> <mailto:bioconductor-bounces@>> <mailto:bioconductor-bounces@>>______r-project.org >> <http: r-project.org=""> >> <http: r-project.org="">**** >> >> <mailto:bioconductor-bounces@>> <mailto:bioconductor-bounces@>____r-project.org >> <http: r-project.org=""> >> <mailto:bioconductor-bounces@__r-project.org>> <mailto:bioconductor-bounces@r-project.org>>>> >> [mailto:bioconductor-bounces@ >> <mailto:bioconductor-bounces@>**** >> >> <mailto:bioconductor-bounces@>> <mailto:bioconductor-bounces@>>______r-project.org >> <http: r-project.org=""> >> <http: r-project.org="">**** >> >> <mailto:bioconductor-bounces@>> <mailto:bioconductor-bounces@>____r-project.org >> <http: r-project.org=""> >> <mailto:bioconductor-bounces@__r-project.org>> <mailto:bioconductor-bounces@r-project.org>>> >> >> <mailto:bioconductor-bounces@>> <mailto:bioconductor-bounces@>**** >> >> <mailto:bioconductor-bounces@>> <mailto:bioconductor-bounces@>>______r-project.org >> <http: r-project.org=""> >> <http: r-project.org="">**** >> >> <mailto:bioconductor-bounces@>> <mailto:bioconductor-bounces@>____r-project.org >> <http: r-project.org=""> >> <mailto:bioconductor-bounces@__r-project.org>> <mailto:bioconductor-bounces@r-project.org>>>>] On Behalf Of Sean >> Davis >> .Sent: Friday, January 04, 2013 3:37 >> PM >> .To: Michael Lawrence >> .Cc: Tim Triche, Jr.; Vedran Franke; >> bioconductor@r-project.org <mailto:bioconductor@r-project.org> >> <mailto:bioconductor@r-__project.org>> <mailto:bioconductor@r-project.org>> >> <mailto:bioconductor@r-____project.org>> <mailto:bioconductor@r-__project.org> >> <mailto:bioconductor@r-__project.org>> <mailto:bioconductor@r-project.org>>>**** >> >> <mailto:bioconductor@r-______project.org>> <mailto:bioconductor@r-____project.org>**** >> >> >> <mailto:bioconductor@r-____project.org>> <mailto:bioconductor@r-__project.org>> >> >> >> <mailto:bioconductor@r-____project.org>> <mailto:bioconductor@r-__project.org> >> <mailto:bioconductor@r-__project.org>> <mailto:bioconductor@r-project.org>>>> >> >> .Subject: Re: [BioC] countMatches() >> (was: >> table for >> GenomicRanges) >> . >> .On Fri, Jan 4, 2013 at 4:32 PM, >> Michael >> Lawrence >> .<lawrence.michael@gene.com>> <mailto:lawrence.michael@gene.com> >> <mailto:lawrence.michael@gene.__com>> <mailto:lawrence.michael@gene.com>> >> <mailto:lawrence.michael@gene.>> <mailto:lawrence.michael@gene.>____com >> <mailto:lawrence.michael@gene.__com>> <mailto:lawrence.michael@gene.com>>>**** >> >> <mailto:lawrence.michael@gene>> <mailto:lawrence.michael@gene>. >> <mailto:lawrence.michael@gene>> <mailto:lawrence.michael@gene>.__>____com**** >> >> <mailto:lawrence.michael@gene.>> <mailto:lawrence.michael@gene.>____com >> <mailto:lawrence.michael@gene.__com>> <mailto:lawrence.michael@gene.com>>>>> wrote: >> .> The change to the behavior of >> %in% is a >> pretty big >> one. Are you >> thinking >> .> that all set-based operations >> should >> behave this way? For >> example, setdiff >> .> and intersect? I really liked >> the syntax >> of "peaks >> %in% genes". >> In my >> .> experience, it's way more common >> to ask >> questions >> about overlap >> than about >> .> equality, so I'd rather optimize >> the API >> for that use >> case. But >> again, >> .> that's just my personal bias. >> . >> .For what it is worth, I share >> Michael's >> personal bias here. >> . >> .Sean >> . >> . >> .> Michael >> .> >> .> >> .> On Fri, Jan 4, 2013 at 1:11 PM, >> Hervé Pagès >> <hpages@fhcrc.org>> <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>> >> <mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>>>>> wrote: >> .> >> .>> Hi, >> .>> >> .>> I added findMatches() and >> countMatches() >> to the >> latest IRanges / >> .>> GenomicRanges packages (in BioC >> devel only). >> .>> >> .>> findMatches(x, table): An >> enhanced >> version of >> ‘match’ that >> .>> returns all the >> matches in a >> Hits object. >> .>> >> .>> countMatches(x, table): >> Returns an >> integer vector >> of the length >> .>> of ‘x’, containing >> the number >> of matches in >> ‘table’ for >> .>> each element in ‘x’. >> .>> >> >> .>> countMatches() is what you can >> use to >> tally/count/tabulate >> (choose your >> >> .>> preferred term) the unique >> elements in a >> GRanges object: >> .>> >> .>> library(GenomicRanges) >> .>> set.seed(33) >> .>> gr <- GRanges("chr1",**** >> >> IRanges(sample(15,20,replace=*______*TRUE), >> **** >> >> >> >> >> width=5)) >> .>> >> .>> Then: >> .>> >> .>> > gr_levels <- sort(unique(gr)) >> .>> > countMatches(gr_levels, gr) >> .>> [1] 1 1 1 2 4 2 2 1 2 2 2 >> .>> >> .>> Note that findMatches() and >> countMatches() also work on >> IRanges and >> .>> DNAStringSet objects, as well as >> on >> ordinary atomic >> vectors: >> .>> >> .>> library(hgu95av2probe) >> .>> library(Biostrings) >> .>> probes <- >> DNAStringSet(hgu95av2probe) >> .>> unique_probes <- unique(probes) >> .>> count <- >> countMatches(unique_probes, >> probes) >> .>> max(count) # 7 >> .>> >> .>> I made other changes in >> IRanges/GenomicRanges so that >> the notion >> .>> of "match" between elements of a >> vector-like object now >> consistently >> .>> means "equality" instead of >> "overlap", >> even for >> range-based >> objects >> .>> like IRanges or GRanges >> objects. This >> notion of >> "equality" is the >> .>> same that is used by ==. The most >> visible consequence >> of those >> .>> changes is that using %in% >> between 2 >> IRanges or >> GRanges objects >> .>> 'query' and 'subject' in order >> to do >> overlaps was >> replaced by >> .>> overlapsAny(query, subject). >> .>> >> .>> overlapsAny(query, subject): >> Finds the >> ranges in >> ‘query’ that >> .>> overlap any of the ranges >> in ‘subject’. >> .>> >> >> .>> There are warnings and >> deprecation >> messages in place >> to help >> smooth >> >> .>> the transition. >> .>> >> .>> Cheers, >> .>> H. >> .>> >> .>> -- >> .>> Hervé Pagès >> .>> >> .>> Program in Computational Biology >> .>> Division of Public Health >> Sciences >> .>> Fred Hutchinson Cancer Research >> Center >> .>> 1100 Fairview Ave. N, M1-B514 >> .>> P.O. Box 19024 >> .>> Seattle, WA 98109-1024 >> .>> >> .>> E-mail: hpages@fhcrc.org >> <mailto:hpages@fhcrc.org> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>> >> **** >> >> <mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>> >> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">** >> ** >> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>>> >> >> .>> Phone: (206) 667-5791 >> <tel:%28206%29%20667-5791 <%28206%29%20667-5791="">> >> <tel:%28206%29%20667-5791 <%28206%29%20667-5791="">> < >> tel:%28206%29%20667-5791 <%28206%29%20667-5791>> >> <tel:%28206%29%20667-5791<%28206%29%20667-5791> >> > >> .>> Fax: (206) 667-1319 >> <tel:%28206%29%20667-1319 <%28206%29%20667-1319="">> >> <tel:%28206%29%20667-1319 <%28206%29%20667-1319="">> < >> tel:%28206%29%20667-1319 <%28206%29%20667-1319>> >> <tel:%28206%29%20667-1319<%28206%29%20667-1319> >> > >> >> .>> >> .> >> .> [[alternative HTML >> version deleted]] >> .> >> .> >> .>**** >> >> _____________________________________________________ >> **** >> >> >> >> >> .> Bioconductor mailing list >> .> Bioconductor@r-project.org >> <mailto:bioconductor@r-project.org> >> <mailto:bioconductor@r-__project.org>> <mailto:bioconductor@r-project.org>> >> <mailto:bioconductor@r-____project.org>> <mailto:bioconductor@r-__project.org> >> <mailto:bioconductor@r-__project.org>> <mailto:bioconductor@r-project.org>>>**** >> >> <mailto:bioconductor@r-______project.org>> <mailto:bioconductor@r-____project.org>**** >> >> >> <mailto:bioconductor@r-____project.org>> <mailto:bioconductor@r-__project.org>> >> >> <mailto:bioconductor@r-____project.org>> <mailto:bioconductor@r-__project.org> >> <mailto:bioconductor@r-__project.org>> <mailto:bioconductor@r-project.org>>>> >> >> .>**** >> >> https://stat.ethz.ch/mailman/______listinfo/bioconductor >> <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor="">**** >> >> >> >> <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor="">> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">> >> >> >> >> >> <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor="">> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> >> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>> >> .> Search the archives:**** >> >> >> <http: news.gmane.org="" gmane.______science.biology.informatics.____="" __conductor=""> >> >> ... >> >> [Message clipped] > > > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]
ADD REPLYlink written 4.9 years ago by Tim Triche4.2k
Thanks all for the feedback. Will do %over% and %within%. Hopefully we can consider this is the end of the thread :-b I'll just post a quick note on Bioc-devel when this is ready. Cheers, H. On 01/08/2013 03:07 PM, Michael Lawrence wrote: > I think %over% and maybe %within% are all that's needed. Could go to > %start% and %end%. > > Michael > > > > > > On Tue, Jan 8, 2013 at 2:59 PM, Cook, Malcolm <mec at="" stowers.org=""> <mailto:mec at="" stowers.org="">> wrote: > > If we?re voting/brainstorming, I?d go for one operator for value > that the ?type? arg of overlap can take on____ > > __ __ > > Thus:____ > > __ __ > > %olStart%____ > > %olEnd%____ > > %olWithin%____ > > %olAny% (perhaps with alias of just ?%ol%?)____ > > %olEqual% (which should be same as %in%, right)____ > > __ __ > > Doh, I can?t stay away from this issue for some reason..... Anyway, > my 2 cents____ > > __ __ > > ~Malcolm____ > > __ __ > > *From:*Tim Triche, Jr. [mailto:tim.triche at gmail.com > <mailto:tim.triche at="" gmail.com="">] > *Sent:* Tuesday, January 08, 2013 4:12 PM > *To:* Michael Lawrence > *Cc:* Hervé Pagès; Cook, Malcolm; Sean Davis; Vedran Franke; > bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > *Subject:* Re: [BioC] countMatches() (was: table for GenomicRanges)____ > > __ __ > > Michael: your suggestion is both clearer and more concise than mine > was. +1 ____ > > __ __ > > (I prefer x %i% y %i% z rather than intersect(x, intersect(y, z)) > for the same reason)____ > > __ __ > > __ __ > > __ __ > > __ __ > > On Tue, Jan 8, 2013 at 2:03 PM, Michael Lawrence > <lawrence.michael at="" gene.com="" <mailto:lawrence.michael="" at="" gene.com="">> > wrote:____ > > I would vote for %over% instead of %ov%. Just 2 more characters but > way clearer, at least to me. The hardest thing to type are the %'s. > > Michael____ > > __ __ > > On Tue, Jan 8, 2013 at 11:09 AM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote:____ > > Thanks Tim, Malcolm for the feedback. > > @Tim, I won't comment on the variants of %ov% you are proposing for > doing "within" or "equal" instead of "any" (but if people want them, > I'll add them too). For now I just want to focus on restoring the > convenience of the old %in%, whose removal is understandably causing > some frustration. And so we can move on. > > Cheers, > H.____ > > > > > On 01/08/2013 09:50 AM, Tim Triche, Jr. wrote:____ > > hell, I'll add the operators if there's support for them. > obviously > they're not a big deal and a patch would take 5 minutes flat. > > my hope was to be very explicit about what each type of > operation meant, > so that when a newcomer to the Ranges API sees > > peaks %overlapping% promoters(someGroupOfGenesWeCareAbout) > > it cannot be confused with > > peaks %within% rangesThatCorrespondToSomeChromatinState > > or > > peaks %equal% aBunchOfDNAseFootprints > > or > > DMRs %in% genes ## what the hell does this really mean, > anyways? > it's so bad on so many levels > > because whenever someone says "what is the advantage of > Ranges-based > analyses?", these are the archetypal sorts of queries that > come to mind. > Except that usually in my examples they are based on > posterior > probabilities, but perhaps that could stand to change. > > Anyways, that's just my bias, and you're doing the heavy > lifting. But > if people agree with the motivations I will write the patch > today. > > Cheers, > > --t > > > > > On Tue, Jan 8, 2013 at 9:20 AM, Hervé Pagès > <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">____ > > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> wrote: > > Hi Tim, > > I could add the %ov% operator as a replacement for the > old %in%. So you > would write 'peaks %ov% genes' instead of 'peaks %in% > genes'. Would just > be a convenience wrapper for 'overlapsAny(peaks, genes)'. > > Cheers, > H. > > > On 01/07/2013 11:45 AM, Tim Triche, Jr. wrote: > > So why not leave %in% as it was and transition > everything forward to > explicitly using { `%within%`, > `%overlaps%`|`%overlapping%`, > `%equals%` > } such that > > identical( x %within% table, countOverlaps(x, > table, > type='within') > > 0 ) == TRUE > identical( x %overlaps% table, countOverlaps(x, > table, > type='any') > > 0 ) == TRUE > identical( x %equals% table, countOverlaps(x, > table, > type='equal') > > 0 ) == TRUE > > and for the time being, > > identical( x %overlaps% table, countOverlaps(x, > table, > type='any') > > 0 ) == TRUE ## but with a noisy nastygram that will > halt if > options("warn"=2) > No breakage for %in% methods until such time as a full > deprecation cycle > has passed, and if the maintainers can't be arsed > to do anything > at all > about the warnings by the second full release, then > perhaps they > don't > really care that much after all. Just a thought? > > From someone (me) who has their own issues with > keeping > everything up > to date and should know better. If you want to use > %in% for > > peaks %in% genes (why on earth would you do > this rather than > peaks > %in% promoters(genes), anyways?) > > then a nastygram could be emitted "WARNING: YOUR > SHORTHAND > NOTATION IS > DOOMED AFTER BIOC 2.13, YOU WILL BE ASSIMILATED" > and everyone is > (more > or less) happy. > > > > On Mon, Jan 7, 2013 at 11:33 AM, Michael Lawrence > <lawrence.michael at="" gene.com=""> <mailto:lawrence.michael at="" gene.com=""> > <mailto:lawrence.michael at="" gene.com=""> <mailto:lawrence.michael at="" gene.com="">>____ > > <mailto:lawrence.michael at="" gene.=""> <mailto:lawrence.michael at="" gene.="">__com > <mailto:lawrence.michael at="" gene.com=""> <mailto:lawrence.michael at="" gene.com="">>>> wrote: > > > > ____ > > On Mon, Jan 7, 2013 at 11:00 AM, Hervé Pagès > <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>____ > > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">>>> wrote: > > Hi Michael, > > I don't think "match" (the word) always > has to mean > "equality" > either. > However having match() (the function) do > "whole exact > matching" (aka > "equality") for any kind of vector- like > object has the > advantage of: > > (a) making it consistent with base::match() > (?base::match is > pretty > explicit about what the contract of > match() is) > > > (a) alone is obviously not enough. We have > many methods, > like the > set operations, that treat ranges specially. > Are we going > to start > moving everything toward the base behavior? > And have > rangeIntersect, > rangeSetdiff, etc? > > (b) preserving its relationship with ==, > duplicated(), unique(), > etc... > > > So it becomes consistent with > duplicated/unique, but we lose > consistency with the set operations. > > (c) not frustrating the user who needs > something to > do exact > matching on ranges (as I mentioned > previously, > if you take > match() away from him/her, s/he'll > be left with > nothing). > > > No one has ever asked for match() to behave > this way. There > was a > request for a way to tabulate identical > ranges. It was a > nice idea > to extract the general "outer equal" > findMatches function. > But the > changes seem to be snow-balling. These types > of changes > mean a lot > of maintenance work for the users. A > deprecation cycle does not > circumvent that. > > > IMO those advantages counterbalance *by > far* the very > little > convenience you get from having > 'match(query, subject)' do > 'findOverlaps(query, subject, > select="first")' on > IRanges/GRanges objects. If you need to do > that, just > use the > latter, or, if you think that's still too > much typing, > define > a wrapper e.g. 'ovmatch(query, subject)'. > > There are plenty of specialized tools > around for doing > inexact/fuzzy/partial/overlap matching for > many > particular types > of vector-like objects: grep() and family, > pmatch(), > charmatch(), > agrep(), grepRaw(), matchPattern() and family, > findOverlaps() and > family, findIntervals(), etc... For the > reasons I mentioned > above, none of them should hijack match() > to make it do > some > particular type of inexact matching on > some particular > type of > objects. Even if, for that particular type > of objects, > doing that > particular type of inexact matching is > more common than > doing > exact matching. > > H. > > > > On 01/06/2013 05:39 PM, Michael Lawrence > wrote: > > I think having overlapsAny is a nice > addition and > helps make > the API > more complete and explicit. Are you > sure we need to > change > the behavior > of the match method for this > relatively uncommon > use case? > > > Yes because otherwise users with a use > case of doing > match() > > even if it's uncommon, > > > I don't think > "match" always has to mean "equality". > It is a more > general > concept in > my mind. The most common use case for > matching > ranges is > overlap. > > > Of course "match" doesn't always have to > mean equality. > But of base > > > Michael > > > On Fri, Jan 4, 2013 at 8:34 PM, Herv? > Pag?s > <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>____ > > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>>> > wrote:____ > > Yes 'peaks %in% genes' is cute > and was > probably doing > the right thing > for most users (although not > all). But 'exons %in% > genes' is cute too > and was probably doing the wrong > thing for > all users. > Advanced users > like you guys would have no > problem switching to > > !is.na <http: is.na=""> > <http: is.na=""> <http: is.na="">____ > > <http: is.na="">(findOverlaps(____peaks, > genes,____ > > > type="within", > > select="any")) > > or > > !is.na <http: is.na=""> > <http: is.na=""> <http: is.na="">____ > > <http: is.na="">(findOverlaps(____peaks, > genes,____ > > > type="equal", > > > select="any")) > > in case 'peaks %in% genes' was > not doing > exactly what > you wanted, > but most users would not find > this particularly > friendly. Even > worse, some users probably didn't > realize that > 'peaks > %in% genes' > was not doing exactly what they > thought it did > because > "peaks in > genes" in English suggests that > the peaks are > within > the genes, > but it's not what 'peaks %in% > genes' does. > > Having overlapsAny(), with > exactly the same extra > arguments as > countOverlaps() and > subsetByOverlaps() (i.e. > 'maxgap', > 'minoverlap', > 'type', 'ignore.strand'), all of them > documented (and > with most > users more or less familiar with > them already) > has the > virtue to > expose the user to all the > options from the > very start, > and to > help him/her make the right > choice. Of course > there > will be users > that don't want or don't have the > time to > read/think > about all the > options. Not a big deal: they'll > just do > 'overlapsAny(query, subject)', > which is not a lot more typing > than 'query %in% > subject', especially > if they use tab completion. > > It's true that it's more common > to ask > questions about > overlap than > about equality but there are some > use cases > for the > latter (as the > original thread shows). Until > now, when you > had such a > use case, you > could not use match() or %in%, > which would > have been > the natural things > to use, because they got hijacked > to do > something else, > and you were > left with nothing. Not a > satisfying situation. > So at a > minimum, we > needed to restore the > true/real/original > semantic of > match() to do > "equality" instead of "overlap". > But it's hard > to do > this for match() > and not do it for %in% too. For > more than 99% of R > users, %in% is > just a simple wrapper for > 'match(x, table, > nomatch = 0) > > 0' (this > is how it has been documented and > implemented > in base R > for many > years). Not maintaining this > relationship > between %in% > and match() > would only cause grief and > frustration to > newcomers to > Bioconductor. > > H. > > > > On 01/04/2013 03:32 PM, Cook, > Malcolm wrote: > > Hiya again, > > I am definitely a late comer > to BioC, so I > definitely easily > defer to > the tide of history. > > But I do think you miss my > point Michael > about the > proposed change > making the relationship > between %in% and > match for > {G,I}Ranges{List} > mimic that between other > vectors, and I do > think > that changing > the API > would make other late- comers > take to BioC > easier/faster. > > That said, I NEVER use %in% > so I really > have no > stake in the > matter, and > I DEFINITELY appreciate the > argument to not > changing the API > just for > sematic sweetness. > > That that said, Herve is _/so > good/_ about > deprecations and warnings > > that make such changes fairly > easily > digestible. > > That that that.... enough.... > I bow out of > this > one....!!!! > > Always learning and Happy New > Year to all > lurkers, > > ~Malcolm > > *From:*Michael Lawrence____ > > [mailto:lawrence.michael at gene > <mailto:lawrence.michael at="" gene=""> > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene="">>. > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene=""> > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene="">>.__>____com > > ____ > > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene="">. > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene="">.>____com > <mailto:lawrence.michael at="" gene.=""> <mailto:lawrence.michael at="" gene.="">__com > <mailto:lawrence.michael at="" gene.com=""> <mailto:lawrence.michael at="" gene.com="">>>>] > *Sent:* Friday, January 04, > 2013 5:11 PM > *To:* Cook, Malcolm > *Cc:* Sean Davis; Michael > Lawrence; Herv? > Pag?s > (hpages at fhcrc.org > <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">>>____ > > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">>____ > > > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>>); Tim > > > > Triche, Jr.; Vedran Franke; > bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">>>____ > > > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-____project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org="">> > > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">>>>____ > > *Subject:* Re: [BioC] > countMatches() (was: > table > for GenomicRanges) > > > On Fri, Jan 4, 2013 at 1:56 > PM, Cook, Malcolm > <mec at="" stowers.org=""> <mailto:mec at="" stowers.org=""> <mailto:mec at="" stowers.org=""> <mailto:mec at="" stowers.org="">> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org=""> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org="">>> > <mailto:mec at="" stowers.org=""> <mailto:mec at="" stowers.org=""> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org="">> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org=""> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org="">>>> > <mailto:mec at="" stowers.org=""> <mailto:mec at="" stowers.org=""> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org="">> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org=""> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org="">>> > <mailto:mec at="" stowers.org=""> <mailto:mec at="" stowers.org=""> <mailto:mec at="" stowers.org=""> <mailto:mec at="" stowers.org="">> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org=""> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org="">>>>>> wrote: > > Hiya, > > For what it is worth... > > I think the change to %in% is > warranted. > > If I understand correctly, > this change > restores the > relationship > between > the semantics of `%in` and > the semantics > of `match`. > > From the docs: > > '"%in%" <- function(x, > table) match(x, > table, > nomatch = 0) > 0' > > Herve's change restores this > relationship. > > > match and %in% were initially > consistent (both > considering any > overlap); > Herve has changed both of > them together. > The whole > idea behind > IRanges > is that ranges are special > data types with > special > semantics. We > have > reimplemented much of the > existing R > vector API > using those > semantics; > this extends beyond > match/%in%. I am > hesitant about > making such > sweeping > changes to the API so late in the > life-cycle of the > package. > There was a > feature request for a way to > count > identical ranges > in a set of > ranges. > Let's please not get carried > away and start > redesigning the API > for this > one, albeit useful, request. > There are all > sorts of > inconsistencies in > the API, and many of them > were conscious > decisions > that considered > practical use cases. > > Michael > > > Herve, I suspect you > were you as a > result able to > completely drop > all the > `%in%,BiocClass1,BiocClass2` > definitions and depend > upon > base::%in% > > Am I right? > > If so, may I suggest > that Herve stay the > course, with the > addition of > '"%ol%" <- function(a, b) > findOverlaps(a, > b, maxgap=0L, > minoverlap=1L, type='any', > select='all') > 0' > > This would provide a > perspicacious > idiom, thereby > optimizing the API > for Michaels observed > common use case. > > Just sayin' > > ~Malcolm > > > .-----Original > Message----- > .From:____ > > bioconductor-bounces at r-______project.org > <mailto:bioconductor-bounces at="" r-______project.org=""> > <mailto:bioconductor-bounces at="" r-____project.org=""> <mailto:bioconductor-bounces at="" r-____project.org="">> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>__r-__project.org > <http: r-__project.org=""> > <mailto:bioconductor-bounces at="" r-__project.org=""> <mailto:bioconductor-bounces at="" r-__project.org="">>>____ > > <mailto:bioconductor- bounces@=""> <mailto:bioconductor-bounces@> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>>____r-project.org > <http: r-project.org=""> > <http: r-project.org=""> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>__r-project.org > <http: r-project.org=""> > <mailto:bioconductor-bounces at="" r-project.org=""> <mailto:bioconductor-bounces at="" r-project.org="">>>> > > > <mailto:bioconductor-bounces@ <mailto:bioconductor-="" bounces@=""> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>>____ > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>>>______r-project.org > <http: r-project.org=""> > <http: r-project.org=""> > <http: r-project.org="">____ > > <mailto:bioconductor- bounces@=""> <mailto:bioconductor-bounces@> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>>____r-project.org > <http: r-project.org=""> > <http: r-project.org=""> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>__r-project.org > <http: r-project.org=""> > <mailto:bioconductor-bounces at="" r-project.org=""> <mailto:bioconductor-bounces at="" r-project.org="">>>>> > > [mailto:bioconductor-bounces@ <mailto:bioconductor- bounces@=""> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>>____ > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>>>______r-project.org > <http: r-project.org=""> > <http: r-project.org=""> > <http: r-project.org="">____ > > <mailto:bioconductor- bounces@=""> <mailto:bioconductor-bounces@> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>>____r-project.org > <http: r-project.org=""> > <http: r-project.org=""> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>__r-project.org > <http: r-project.org=""> > <mailto:bioconductor-bounces at="" r-project.org=""> <mailto:bioconductor-bounces at="" r-project.org="">>>> > > > <mailto:bioconductor-bounces@ <mailto:bioconductor-="" bounces@=""> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>>____ > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>>>______r-project.org > <http: r-project.org=""> > <http: r-project.org=""> > <http: r-project.org="">____ > > <mailto:bioconductor- bounces@=""> <mailto:bioconductor-bounces@> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>>____r-project.org > <http: r-project.org=""> > <http: r-project.org=""> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>__r-project.org > <http: r-project.org=""> > <mailto:bioconductor-bounces at="" r-project.org=""> <mailto:bioconductor-bounces at="" r-project.org="">>>>>] On Behalf > Of Sean > Davis > .Sent: Friday, January > 04, 2013 3:37 PM > .To: Michael Lawrence > .Cc: Tim Triche, Jr.; > Vedran Franke; > bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">>> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-____project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org="">> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">>>>____ > > <mailto:bioconductor at="" r-______project.org=""> <mailto:bioconductor at="" r-______project.org=""> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-____project.org="">>____ > > > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-____project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org="">>> > > > > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-____project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org="">> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">>>>> > > .Subject: Re: [BioC] > countMatches() > (was: > table for > GenomicRanges) > . > .On Fri, Jan 4, 2013 > at 4:32 PM, > Michael > Lawrence > > .<lawrence.michael at="" gene.com="" <mailto:lawrence.michael="" at="" gene.com=""> > <mailto:lawrence.michael at="" gene.com=""> <mailto:lawrence.michael at="" gene.com="">> > <mailto:lawrence.michael at="" gene.=""> <mailto:lawrence.michael at="" gene.="">__com > <mailto:lawrence.michael at="" gene.com=""> <mailto:lawrence.michael at="" gene.com="">>> > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene="">. > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene="">.>____com > <mailto:lawrence.michael at="" gene.=""> <mailto:lawrence.michael at="" gene.="">__com > <mailto:lawrence.michael at="" gene.com=""> <mailto:lawrence.michael at="" gene.com="">>>>____ > > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene=""> > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene="">>. > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene=""> > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene="">>.__>____com____ > > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene="">. > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene="">.>____com > <mailto:lawrence.michael at="" gene.=""> <mailto:lawrence.michael at="" gene.="">__com > <mailto:lawrence.michael at="" gene.com=""> <mailto:lawrence.michael at="" gene.com="">>>>>> wrote: > .> The change to the > behavior of > %in% is a > pretty big > one. Are you > thinking > .> that all set- based > operations should > behave this way? For > example, setdiff > .> and intersect? I > really liked > the syntax > of "peaks > %in% genes". > In my > .> experience, it's > way more common > to ask > questions > about overlap > than about > .> equality, so I'd > rather optimize > the API > for that use > case. But > again, > .> that's just my > personal bias. > . > .For what it is worth, > I share > Michael's > personal bias here. > . > .Sean > . > . > .> Michael > .> > .> > .> On Fri, Jan 4, 2013 > at 1:11 PM, > Hervé Pagès > <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>> > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">>> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">>>>>> wrote: > .> > .>> Hi, > .>> > .>> I added > findMatches() and > countMatches() > to the > latest IRanges / > .>> GenomicRanges > packages (in BioC > devel only). > .>> > .>> findMatches(x, > table): An > enhanced > version of > ?match? that > .>> returns > all the > matches in a > Hits object. > .>> > .>> countMatches(x, > table): > Returns an > integer vector > of the length > .>> of ?x?, > containing > the number > of matches in > ?table? for > .>> each > element in ?x?. > .>> > > .>> countMatches() is > what you can > use to > tally/count/tabulate > (choose your > > .>> preferred term) > the unique > elements in a > GRanges object: > .>> > .>> > library(GenomicRanges) > .>> set.seed(33) > .>> gr <- > GRanges("chr1",____ > > > IRanges(sample(15,20,replace=*______*TRUE),____ > > > > > width=5)) > .>> > .>> Then: > .>> > .>> > gr_levels <- > sort(unique(gr)) > .>> > > countMatches(gr_levels, gr) > .>> [1] 1 1 1 2 4 2 > 2 1 2 2 2 > .>> > .>> Note that > findMatches() and > countMatches() also work on > IRanges and > .>> DNAStringSet > objects, as well as on > ordinary atomic > vectors: > .>> > .>> > library(hgu95av2probe) > .>> library(Biostrings) > .>> probes <- > DNAStringSet(hgu95av2probe) > .>> unique_probes <- > unique(probes) > .>> count <- > countMatches(unique_probes, > probes) > .>> max(count) # 7 > .>> > .>> I made other > changes in > IRanges/GenomicRanges so that > the notion > .>> of "match" between > elements of a > vector-like object now > consistently > .>> means "equality" > instead of > "overlap", > even for > range-based > objects > .>> like IRanges or > GRanges > objects. This > notion of > "equality" is the > .>> same that is used > by ==. The most > visible consequence > of those > .>> changes is that > using %in% > between 2 > IRanges or > GRanges objects > .>> 'query' and > 'subject' in order > to do > overlaps was > replaced by > .>> overlapsAny(query, > subject). > .>> > .>> > overlapsAny(query, subject): > Finds the > ranges in > ?query? that > .>> overlap any > of the ranges > in ?subject?. > .>> > > .>> There are warnings > and deprecation > messages in place > to help > smooth > > .>> the transition. > .>> > .>> Cheers, > .>> H. > .>> > .>> -- > .>> Hervé Pagès > .>> > .>> Program in > Computational Biology > .>> Division of Public > Health Sciences > .>> Fred Hutchinson > Cancer Research > Center > .>> 1100 Fairview Ave. > N, M1-B514 > .>> P.O. Box 19024 > .>> Seattle, WA 98109-1024 > .>> > .>> E-mail: > hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">>> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">>>>____ > > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> > > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">>____ > > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>>> > > .>> Phone: (206) > 667-5791 <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> > .>> Fax: (206) > 667-1319 <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > > .>> > .> > .> > [[alternative HTML > version deleted]] > .> > .> > .>____ > > > _________________________________________________________ > > > > > .> Bioconductor > mailing list > .> > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">>> > > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-____project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org="">> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">>>>____ > > > <mailto:bioconductor at="" r-______project.org=""> <mailto:bioconductor at="" r-______project.org=""> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-____project.org="">>____ > > > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-____project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org="">>> > > > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-____project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org="">> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">>>>> > > .>____ > > https://stat.ethz.ch/mailman/______listinfo/bioconductor > > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor="">____ > > > > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">> > > > > > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>> > .> Search the > archives:____ > > <http: news.gmane.org="" gmane.______science.biology.infor="" matics.______conductor=""> > > ... > > [Message clipped] > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLYlink written 4.9 years ago by Hervé Pagès ♦♦ 13k
On 01/08/2013 02:59 PM, Cook, Malcolm wrote: > If we?re voting/brainstorming, I?d go for one operator for value that > the ?type? arg of overlap can take on > > Thus: > > %olStart% > > %olEnd% > > %olWithin% > > %olAny% (perhaps with alias of just ?%ol%?) > > %olEqual% (which should be same as %in%, right) Except for zero-width ranges: they never overlap with anything, but 2 zero-width ranges with the same start are considered equal: > ir <- IRanges(start=5:7, width=0:2) > ir IRanges of length 3 start end width [1] 5 4 0 [2] 6 6 1 [3] 7 8 2 > overlapsAny(ir, ir, type="equal") [1] FALSE TRUE TRUE > suppressWarnings(ir %in% ir) [1] TRUE TRUE TRUE Also I believe the new %in% should generally be faster than overlapsAny( , type="equal"), and also perhaps more memory efficient, but I didn't do enough testing to quantify this. H. > > Doh, I can?t stay away from this issue for some reason..... Anyway, my 2 > cents > > ~Malcolm > > *From:*Tim Triche, Jr. [mailto:tim.triche at gmail.com] > *Sent:* Tuesday, January 08, 2013 4:12 PM > *To:* Michael Lawrence > *Cc:* Hervé Pagès; Cook, Malcolm; Sean Davis; Vedran Franke; > bioconductor at r-project.org > *Subject:* Re: [BioC] countMatches() (was: table for GenomicRanges) > > Michael: your suggestion is both clearer and more concise than mine was. > +1 > > (I prefer x %i% y %i% z rather than intersect(x, intersect(y, z)) for > the same reason) > > On Tue, Jan 8, 2013 at 2:03 PM, Michael Lawrence > <lawrence.michael at="" gene.com="" <mailto:lawrence.michael="" at="" gene.com="">> wrote: > > I would vote for %over% instead of %ov%. Just 2 more characters but way > clearer, at least to me. The hardest thing to type are the %'s. > > Michael > > On Tue, Jan 8, 2013 at 11:09 AM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote: > > Thanks Tim, Malcolm for the feedback. > > @Tim, I won't comment on the variants of %ov% you are proposing for > doing "within" or "equal" instead of "any" (but if people want them, > I'll add them too). For now I just want to focus on restoring the > convenience of the old %in%, whose removal is understandably causing > some frustration. And so we can move on. > > Cheers, > H. > > > > > On 01/08/2013 09:50 AM, Tim Triche, Jr. wrote: > > hell, I'll add the operators if there's support for them. obviously > they're not a big deal and a patch would take 5 minutes flat. > > my hope was to be very explicit about what each type of > operation meant, > so that when a newcomer to the Ranges API sees > > peaks %overlapping% promoters(someGroupOfGenesWeCareAbout) > > it cannot be confused with > > peaks %within% rangesThatCorrespondToSomeChromatinState > > or > > peaks %equal% aBunchOfDNAseFootprints > > or > > DMRs %in% genes ## what the hell does this really mean, > anyways? > it's so bad on so many levels > > because whenever someone says "what is the advantage of Ranges-based > analyses?", these are the archetypal sorts of queries that come > to mind. > Except that usually in my examples they are based on posterior > probabilities, but perhaps that could stand to change. > > Anyways, that's just my bias, and you're doing the heavy > lifting. But > if people agree with the motivations I will write the patch today. > > Cheers, > > --t > > > > > On Tue, Jan 8, 2013 at 9:20 AM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> > > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> wrote: > > Hi Tim, > > I could add the %ov% operator as a replacement for the old > %in%. So you > would write 'peaks %ov% genes' instead of 'peaks %in% > genes'. Would just > be a convenience wrapper for 'overlapsAny(peaks, genes)'. > > Cheers, > H. > > > On 01/07/2013 11:45 AM, Tim Triche, Jr. wrote: > > So why not leave %in% as it was and transition > everything forward to > explicitly using { `%within%`, > `%overlaps%`|`%overlapping%`, > `%equals%` > } such that > > identical( x %within% table, countOverlaps(x, table, > type='within') > > 0 ) == TRUE > identical( x %overlaps% table, countOverlaps(x, table, > type='any') > > 0 ) == TRUE > identical( x %equals% table, countOverlaps(x, table, > type='equal') > > 0 ) == TRUE > > and for the time being, > > identical( x %overlaps% table, countOverlaps(x, table, > type='any') > > 0 ) == TRUE ## but with a noisy nastygram that will halt if > options("warn"=2) > No breakage for %in% methods until such time as a full > deprecation cycle > has passed, and if the maintainers can't be arsed to do > anything > at all > about the warnings by the second full release, then > perhaps they > don't > really care that much after all. Just a thought? > > From someone (me) who has their own issues with keeping > everything up > to date and should know better. If you want to use > %in% for > > peaks %in% genes (why on earth would you do this > rather than > peaks > %in% promoters(genes), anyways?) > > then a nastygram could be emitted "WARNING: YOUR SHORTHAND > NOTATION IS > DOOMED AFTER BIOC 2.13, YOU WILL BE ASSIMILATED" and > everyone is > (more > or less) happy. > > > > On Mon, Jan 7, 2013 at 11:33 AM, Michael Lawrence > <lawrence.michael at="" gene.com=""> <mailto:lawrence.michael at="" gene.com=""> > <mailto:lawrence.michael at="" gene.com=""> <mailto:lawrence.michael at="" gene.com="">> > > <mailto:lawrence.michael at="" gene.=""> <mailto:lawrence.michael at="" gene.="">__com > <mailto:lawrence.michael at="" gene.com=""> <mailto:lawrence.michael at="" gene.com="">>>> wrote: > > > > On Mon, Jan 7, 2013 at 11:00 AM, Hervé Pagès > <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>> wrote: > > Hi Michael, > > I don't think "match" (the word) always has to > mean > "equality" > either. > However having match() (the function) do > "whole exact > matching" (aka > "equality") for any kind of vector-like object > has the > advantage of: > > (a) making it consistent with base::match() > (?base::match is > pretty > explicit about what the contract of > match() is) > > > (a) alone is obviously not enough. We have many > methods, > like the > set operations, that treat ranges specially. Are > we going > to start > moving everything toward the base behavior? And have > rangeIntersect, > rangeSetdiff, etc? > > (b) preserving its relationship with ==, > duplicated(), unique(), > etc... > > > So it becomes consistent with duplicated/unique, > but we lose > consistency with the set operations. > > (c) not frustrating the user who needs > something to > do exact > matching on ranges (as I mentioned > previously, > if you take > match() away from him/her, s/he'll be > left with > nothing). > > > No one has ever asked for match() to behave this > way. There > was a > request for a way to tabulate identical ranges. It > was a > nice idea > to extract the general "outer equal" findMatches > function. > But the > changes seem to be snow-balling. These types of > changes > mean a lot > of maintenance work for the users. A deprecation > cycle does not > circumvent that. > > > IMO those advantages counterbalance *by far* > the very > little > convenience you get from having 'match(query, > subject)' do > 'findOverlaps(query, subject, select="first")' on > IRanges/GRanges objects. If you need to do > that, just > use the > latter, or, if you think that's still too much > typing, > define > a wrapper e.g. 'ovmatch(query, subject)'. > > There are plenty of specialized tools around > for doing > inexact/fuzzy/partial/overlap matching for many > particular types > of vector-like objects: grep() and family, > pmatch(), > charmatch(), > agrep(), grepRaw(), matchPattern() and family, > findOverlaps() and > family, findIntervals(), etc... For the > reasons I mentioned > above, none of them should hijack match() to > make it do > some > particular type of inexact matching on some > particular > type of > objects. Even if, for that particular type of > objects, > doing that > particular type of inexact matching is more > common than > doing > exact matching. > > H. > > > > On 01/06/2013 05:39 PM, Michael Lawrence wrote: > > I think having overlapsAny is a nice > addition and > helps make > the API > more complete and explicit. Are you sure > we need to > change > the behavior > of the match method for this relatively > uncommon > use case? > > > Yes because otherwise users with a use case of > doing > match() > > even if it's uncommon, > > > I don't think > "match" always has to mean "equality". It > is a more > general > concept in > my mind. The most common use case for matching > ranges is > overlap. > > > Of course "match" doesn't always have to mean > equality. > But of base > > > Michael > > > On Fri, Jan 4, 2013 at 8:34 PM, Hervé Pagès > <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> > > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>>> wrote: > > Yes 'peaks %in% genes' is cute and was > probably doing > the right thing > for most users (although not all). > But 'exons %in% > genes' is cute too > and was probably doing the wrong > thing for > all users. > Advanced users > like you guys would have no problem > switching to > > !is.na <http: is.na=""> > <http: is.na=""> <http: is.na=""> > > <http: is.na="">(findOverlaps(____peaks, genes, > > > type="within", > > select="any")) > > or > > !is.na <http: is.na=""> > <http: is.na=""> <http: is.na=""> > > <http: is.na="">(findOverlaps(____peaks, genes, > > > type="equal", > > > select="any")) > > in case 'peaks %in% genes' was not doing > exactly what > you wanted, > but most users would not find this > particularly > friendly. Even > worse, some users probably didn't > realize that > 'peaks > %in% genes' > was not doing exactly what they > thought it did > because > "peaks in > genes" in English suggests that the > peaks are > within > the genes, > but it's not what 'peaks %in% genes' > does. > > Having overlapsAny(), with exactly > the same extra > arguments as > countOverlaps() and > subsetByOverlaps() (i.e. > 'maxgap', > 'minoverlap', > 'type', 'ignore.strand'), all of them > documented (and > with most > users more or less familiar with them > already) > has the > virtue to > expose the user to all the options > from the > very start, > and to > help him/her make the right choice. > Of course > there > will be users > that don't want or don't have the time to > read/think > about all the > options. Not a big deal: they'll just do > 'overlapsAny(query, subject)', > which is not a lot more typing than > 'query %in% > subject', especially > if they use tab completion. > > It's true that it's more common to ask > questions about > overlap than > about equality but there are some use > cases > for the > latter (as the > original thread shows). Until now, > when you > had such a > use case, you > could not use match() or %in%, which > would > have been > the natural things > to use, because they got hijacked to do > something else, > and you were > left with nothing. Not a satisfying > situation. > So at a > minimum, we > needed to restore the true/real/original > semantic of > match() to do > "equality" instead of "overlap". But > it's hard > to do > this for match() > and not do it for %in% too. For more > than 99% of R > users, %in% is > just a simple wrapper for 'match(x, > table, > nomatch = 0) > > 0' (this > is how it has been documented and > implemented > in base R > for many > years). Not maintaining this relationship > between %in% > and match() > would only cause grief and frustration to > newcomers to > Bioconductor. > > H. > > > > On 01/04/2013 03:32 PM, Cook, Malcolm > wrote: > > Hiya again, > > I am definitely a late comer to > BioC, so I > definitely easily > defer to > the tide of history. > > But I do think you miss my point > Michael > about the > proposed change > making the relationship between > %in% and > match for > {G,I}Ranges{List} > mimic that between other vectors, > and I do > think > that changing > the API > would make other late-comers take > to BioC > easier/faster. > > That said, I NEVER use %in% so I > really > have no > stake in the > matter, and > I DEFINITELY appreciate the > argument to not > changing the API > just for > sematic sweetness. > > That that said, Herve is _/so > good/_ about > deprecations and warnings > > that make such changes fairly easily > digestible. > > That that that.... enough.... I > bow out of > this > one....!!!! > > Always learning and Happy New > Year to all > lurkers, > > ~Malcolm > > *From:*Michael Lawrence > > [mailto:lawrence.michael at gene > <mailto:lawrence.michael at="" gene=""> > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene="">>. > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene=""> > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene="">>.__>____com > > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene="">. > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene="">.>____com > <mailto:lawrence.michael at="" gene.=""> <mailto:lawrence.michael at="" gene.="">__com > <mailto:lawrence.michael at="" gene.com=""> <mailto:lawrence.michael at="" gene.com="">>>>] > *Sent:* Friday, January 04, 2013 > 5:11 PM > *To:* Cook, Malcolm > *Cc:* Sean Davis; Michael > Lawrence; Herv? > Pag?s > (hpages at fhcrc.org > <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> > > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> > > > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>>); Tim > > > > Triche, Jr.; Vedran Franke; > bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">>> > > > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-____project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org="">> > > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">>>> > > *Subject:* Re: [BioC] > countMatches() (was: > table > for GenomicRanges) > > > On Fri, Jan 4, 2013 at 1:56 PM, > Cook, Malcolm > <mec at="" stowers.org="" <mailto:mec="" at="" stowers.org=""> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org="">> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org=""> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org="">>> > <mailto:mec at="" stowers.org=""> <mailto:mec at="" stowers.org=""> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org="">> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org=""> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org="">>>> > <mailto:mec at="" stowers.org=""> <mailto:mec at="" stowers.org=""> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org="">> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org=""> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org="">>> > <mailto:mec at="" stowers.org=""> <mailto:mec at="" stowers.org=""> <mailto:mec at="" stowers.org=""> <mailto:mec at="" stowers.org="">> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org=""> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org="">>>>>> wrote: > > Hiya, > > For what it is worth... > > I think the change to %in% is > warranted. > > If I understand correctly, this > change > restores the > relationship > between > the semantics of `%in` and the > semantics > of `match`. > > From the docs: > > '"%in%" <- function(x, table) > match(x, > table, > nomatch = 0) > 0' > > Herve's change restores this > relationship. > > > match and %in% were initially > consistent (both > considering any > overlap); > Herve has changed both of them > together. > The whole > idea behind > IRanges > is that ranges are special data > types with > special > semantics. We > have > reimplemented much of the existing R > vector API > using those > semantics; > this extends beyond match/%in%. I am > hesitant about > making such > sweeping > changes to the API so late in the > life-cycle of the > package. > There was a > feature request for a way to count > identical ranges > in a set of > ranges. > Let's please not get carried away > and start > redesigning the API > for this > one, albeit useful, request. > There are all > sorts of > inconsistencies in > the API, and many of them were > conscious > decisions > that considered > practical use cases. > > Michael > > > Herve, I suspect you were > you as a > result able to > completely drop > all the > `%in%,BiocClass1,BiocClass2` > definitions and depend > upon > base::%in% > > Am I right? > > If so, may I suggest that > Herve stay the > course, with the > addition of > '"%ol%" <- function(a, b) > findOverlaps(a, > b, maxgap=0L, > minoverlap=1L, type='any', > select='all') > 0' > > This would provide a > perspicacious > idiom, thereby > optimizing the API > for Michaels observed common > use case. > > Just sayin' > > ~Malcolm > > > .-----Original Message----- > .From: > > bioconductor-bounces at r-______project.org > <mailto:bioconductor-bounces at="" r-______project.org=""> > <mailto:bioconductor-bounces at="" r-____project.org=""> <mailto:bioconductor-bounces at="" r-____project.org="">> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>__r-__project.org > <http: r-__project.org=""> > <mailto:bioconductor-bounces at="" r-__project.org=""> <mailto:bioconductor-bounces at="" r-__project.org="">>> > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>>____r-project.org > <http: r-project.org=""> > <http: r-project.org=""> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>__r-project.org <http: r-project.org=""> > <mailto:bioconductor-bounces at="" r-project.org=""> <mailto:bioconductor-bounces at="" r-project.org="">>>> > > > <mailto:bioconductor-bounces@ <mailto:bioconductor-="" bounces@=""> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>> > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>>>______r-project.org > <http: r-project.org=""> > <http: r-project.org=""> > <http: r-project.org=""> > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>>____r-project.org > <http: r-project.org=""> > <http: r-project.org=""> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>__r-project.org <http: r-project.org=""> > <mailto:bioconductor-bounces at="" r-project.org=""> <mailto:bioconductor-bounces at="" r-project.org="">>>>> > > [mailto:bioconductor-bounces@ <mailto:bioconductor- bounces@=""> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>> > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>>>______r-project.org > <http: r-project.org=""> > <http: r-project.org=""> > <http: r-project.org=""> > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>>____r-project.org > <http: r-project.org=""> > <http: r-project.org=""> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>__r-project.org <http: r-project.org=""> > <mailto:bioconductor-bounces at="" r-project.org=""> <mailto:bioconductor-bounces at="" r-project.org="">>>> > > > <mailto:bioconductor-bounces@ <mailto:bioconductor-="" bounces@=""> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>> > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>>>______r-project.org > <http: r-project.org=""> > <http: r-project.org=""> > <http: r-project.org=""> > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>>____r-project.org > <http: r-project.org=""> > <http: r-project.org=""> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>__r-project.org <http: r-project.org=""> > <mailto:bioconductor-bounces at="" r-project.org=""> <mailto:bioconductor-bounces at="" r-project.org="">>>>>] On Behalf Of Sean > Davis > .Sent: Friday, January 04, > 2013 3:37 PM > .To: Michael Lawrence > .Cc: Tim Triche, Jr.; > Vedran Franke; > bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">>> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-____project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org="">> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">>>> > > <mailto:bioconductor at="" r-______project.org=""> <mailto:bioconductor at="" r-______project.org=""> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-____project.org="">> > > > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-____project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org="">>> > > > > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-____project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org="">> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">>>>> > > .Subject: Re: [BioC] > countMatches() > (was: > table for > GenomicRanges) > . > .On Fri, Jan 4, 2013 at > 4:32 PM, > Michael > Lawrence > > .<lawrence.michael at="" gene.com="" <mailto:lawrence.michael="" at="" gene.com=""> > <mailto:lawrence.michael at="" gene.com=""> <mailto:lawrence.michael at="" gene.com="">> > <mailto:lawrence.michael at="" gene.=""> <mailto:lawrence.michael at="" gene.="">__com > <mailto:lawrence.michael at="" gene.com=""> <mailto:lawrence.michael at="" gene.com="">>> > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene="">. > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene="">.>____com > <mailto:lawrence.michael at="" gene.=""> <mailto:lawrence.michael at="" gene.="">__com > <mailto:lawrence.michael at="" gene.com=""> <mailto:lawrence.michael at="" gene.com="">>>> > > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene=""> > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene="">>. > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene=""> > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene="">>.__>____com > > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene="">. > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene="">.>____com > <mailto:lawrence.michael at="" gene.=""> <mailto:lawrence.michael at="" gene.="">__com > <mailto:lawrence.michael at="" gene.com=""> <mailto:lawrence.michael at="" gene.com="">>>>>> wrote: > .> The change to the > behavior of > %in% is a > pretty big > one. Are you > thinking > .> that all set-based > operations should > behave this way? For > example, setdiff > .> and intersect? I really > liked > the syntax > of "peaks > %in% genes". > In my > .> experience, it's way > more common > to ask > questions > about overlap > than about > .> equality, so I'd rather > optimize > the API > for that use > case. But > again, > .> that's just my personal > bias. > . > .For what it is worth, I share > Michael's > personal bias here. > . > .Sean > . > . > .> Michael > .> > .> > .> On Fri, Jan 4, 2013 at > 1:11 PM, > Hervé Pagès > <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>> > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">>> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>>>> > wrote: > .> > .>> Hi, > .>> > .>> I added findMatches() and > countMatches() > to the > latest IRanges / > .>> GenomicRanges packages > (in BioC > devel only). > .>> > .>> findMatches(x, > table): An > enhanced > version of > ?match? that > .>> returns all the > matches in a > Hits object. > .>> > .>> countMatches(x, table): > Returns an > integer vector > of the length > .>> of ?x?, > containing > the number > of matches in > ?table? for > .>> each element > in ?x?. > .>> > > .>> countMatches() is what > you can > use to > tally/count/tabulate > (choose your > > .>> preferred term) the unique > elements in a > GRanges object: > .>> > .>> library(GenomicRanges) > .>> set.seed(33) > .>> gr <- GRanges("chr1", > > > IRanges(sample(15,20,replace=*______*TRUE), > > > > > width=5)) > .>> > .>> Then: > .>> > .>> > gr_levels <- > sort(unique(gr)) > .>> > > countMatches(gr_levels, gr) > .>> [1] 1 1 1 2 4 2 2 1 > 2 2 2 > .>> > .>> Note that > findMatches() and > countMatches() also work on > IRanges and > .>> DNAStringSet objects, > as well as on > ordinary atomic > vectors: > .>> > .>> library(hgu95av2probe) > .>> library(Biostrings) > .>> probes <- > DNAStringSet(hgu95av2probe) > .>> unique_probes <- > unique(probes) > .>> count <- > countMatches(unique_probes, > probes) > .>> max(count) # 7 > .>> > .>> I made other changes in > IRanges/GenomicRanges so that > the notion > .>> of "match" between > elements of a > vector-like object now > consistently > .>> means "equality" > instead of > "overlap", > even for > range-based > objects > .>> like IRanges or GRanges > objects. This > notion of > "equality" is the > .>> same that is used by > ==. The most > visible consequence > of those > .>> changes is that using %in% > between 2 > IRanges or > GRanges objects > .>> 'query' and 'subject' > in order > to do > overlaps was > replaced by > .>> overlapsAny(query, > subject). > .>> > .>> overlapsAny(query, > subject): > Finds the > ranges in > ?query? that > .>> overlap any of > the ranges > in ?subject?. > .>> > > .>> There are warnings and > deprecation > messages in place > to help > smooth > > .>> the transition. > .>> > .>> Cheers, > .>> H. > .>> > .>> -- > .>> Hervé Pagès > .>> > .>> Program in > Computational Biology > .>> Division of Public > Health Sciences > .>> Fred Hutchinson Cancer > Research > Center > .>> 1100 Fairview Ave. N, > M1-B514 > .>> P.O. Box 19024 > .>> Seattle, WA 98109-1024 > .>> > .>> E-mail: > hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">>> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">>>> > > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> > > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> > > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>>> > > .>> Phone: (206) 667-5791 > <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> > .>> Fax: (206) 667-1319 > <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > > .>> > .> > .> [[alternative HTML > version deleted]] > .> > .> > .> > > > _____________________________________________________ > > > > > .> Bioconductor mailing list > .> > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">>> > > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-____project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org="">> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">>>> > > > <mailto:bioconductor at="" r-______project.org=""> <mailto:bioconductor at="" r-______project.org=""> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-____project.org="">> > > > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-____project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org="">>> > > > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-____project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org="">> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">>>>> > > .> > > https://stat.ethz.ch/mailman/______listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> > > > > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">> > > > > > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>> > .> Search the archives: > > http://news.gmane.org/gmane.______science.biology.informatic s.______conductor > > <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> > > > > > <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> > <http: news.gmane.org="" gmane.__science.biology.informatics._="" _conductor="">> > > > > > <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> > <http: news.gmane.org="" gmane.__science.biology.informatics._="" _conductor=""> > > > <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor=""> > <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">>> > . > > ._____________________________________________________ > > > > > .Bioconductor mailing list > > .Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">>> > > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-____project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org="">> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">>>> > > > <mailto:bioconductor at="" r-______project.org=""> <mailto:bioconductor at="" r-______project.org=""> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-____project.org="">> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-____project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org="">>> > > > > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-____project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org="">> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">>>>> > > > .https://stat.ethz.ch/mailman/______listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">> > > > > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>> > .Search the archives: > > http://news.gmane.org/gmane.______science.biology.informatic s.______conductor > > <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> > > > <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> > <http: news.gmane.org="" gmane.__science.biology.informatics._="" _conductor="">> > > > > > <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> > <http: news.gmane.org="" gmane.__science.biology.informatics._="" _conductor=""> > > > <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor=""> > <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">>> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> > > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>> > > Phone: (206) 667-5791 > <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> > > Fax: (206) 667-1319 > <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> > Phone: (206) 667-5791 > <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > > > > > > -- > > /A model is a lie that helps you see the truth./ > / > / > Howard Skipper > > > <http: cancerres.__aacrjournals.org="" content="" 31="" 9="" __1173.full.pdf="" <http:="" aacrjournals.org="" content="" 31="" 9="" __1173.full.pdf=""> > <http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf="">> > > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > > ... > > [Message clipped] > > > > -- > /A model is a lie that helps you see the truth./ > > Howard Skipper > <http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLYlink written 4.9 years ago by Hervé Pagès ♦♦ 13k
.Hi Tim, . .I could add the %ov% operator as a replacement for the old %in%. So you .would write 'peaks %ov% genes' instead of 'peaks %in% genes'. Would just .be a convenience wrapper for 'overlapsAny(peaks, genes)'. [cloak off] Herve, I think this is the BEST course, and except for one letter, is what I hoped I meant back when I wrote: > If so, may I suggest that Herve stay the > course, with the > addition of > '"%ol%" <- function(a, b) findOverlaps(a, > b, maxgap=0L, > minoverlap=1L, type='any', select='all') > 0' Stay the course, captain. [cloak on] . .Cheers, .H. . .On 01/07/2013 11:45 AM, Tim Triche, Jr. wrote: .> So why not leave %in% as it was and transition everything forward to .> explicitly using { `%within%`, `%overlaps%`|`%overlapping%`, `%equals%` .> } such that .> .> identical( x %within% table, countOverlaps(x, table, type='within') > .> 0 ) == TRUE .> identical( x %overlaps% table, countOverlaps(x, table, type='any') > .> 0 ) == TRUE .> identical( x %equals% table, countOverlaps(x, table, type='equal') > .> 0 ) == TRUE .> .> and for the time being, .> .> identical( x %overlaps% table, countOverlaps(x, table, type='any') > .> 0 ) == TRUE ## but with a noisy nastygram that will halt if .> options("warn"=2) .> No breakage for %in% methods until such time as a full deprecation cycle .> has passed, and if the maintainers can't be arsed to do anything at all .> about the warnings by the second full release, then perhaps they don't .> really care that much after all. Just a thought? .> .> From someone (me) who has their own issues with keeping everything up .> to date and should know better. If you want to use %in% for .> .> peaks %in% genes (why on earth would you do this rather than peaks .> %in% promoters(genes), anyways?) .> .> then a nastygram could be emitted "WARNING: YOUR SHORTHAND NOTATION IS .> DOOMED AFTER BIOC 2.13, YOU WILL BE ASSIMILATED" and everyone is (more .> or less) happy. .> .> .> .> On Mon, Jan 7, 2013 at 11:33 AM, Michael Lawrence .> <lawrence.michael at="" gene.com="" <mailto:lawrence.michael="" at="" gene.com="">> wrote: .> .> .> .> .> On Mon, Jan 7, 2013 at 11:00 AM, Hervé Pagès <hpages at="" fhcrc.org="" .=""> <mailto:hpages at="" fhcrc.org="">> wrote: .> .> Hi Michael, .> .> I don't think "match" (the word) always has to mean "equality" .> either. .> However having match() (the function) do "whole exact matching" (aka .> "equality") for any kind of vector-like object has the advantage of: .> .> (a) making it consistent with base::match() (?base::match is .> pretty .> explicit about what the contract of match() is) .> .> .> (a) alone is obviously not enough. We have many methods, like the .> set operations, that treat ranges specially. Are we going to start .> moving everything toward the base behavior? And have rangeIntersect, .> rangeSetdiff, etc? .> .> (b) preserving its relationship with ==, duplicated(), unique(), .> etc... .> .> .> So it becomes consistent with duplicated/unique, but we lose .> consistency with the set operations. .> .> (c) not frustrating the user who needs something to do exact .> matching on ranges (as I mentioned previously, if you take .> match() away from him/her, s/he'll be left with nothing). .> .> .> No one has ever asked for match() to behave this way. There was a .> request for a way to tabulate identical ranges. It was a nice idea .> to extract the general "outer equal" findMatches function. But the .> changes seem to be snow-balling. These types of changes mean a lot .> of maintenance work for the users. A deprecation cycle does not .> circumvent that. .> .> .> IMO those advantages counterbalance *by far* the very little .> convenience you get from having 'match(query, subject)' do .> 'findOverlaps(query, subject, select="first")' on .> IRanges/GRanges objects. If you need to do that, just use the .> latter, or, if you think that's still too much typing, define .> a wrapper e.g. 'ovmatch(query, subject)'. .> .> There are plenty of specialized tools around for doing .> inexact/fuzzy/partial/overlap matching for many particular types .> of vector-like objects: grep() and family, pmatch(), charmatch(), .> agrep(), grepRaw(), matchPattern() and family, findOverlaps() and .> family, findIntervals(), etc... For the reasons I mentioned .> above, none of them should hijack match() to make it do some .> particular type of inexact matching on some particular type of .> objects. Even if, for that particular type of objects, doing that .> particular type of inexact matching is more common than doing .> exact matching. .> .> H. .> .> .> .> On 01/06/2013 05:39 PM, Michael Lawrence wrote: .> .> I think having overlapsAny is a nice addition and helps make .> the API .> more complete and explicit. Are you sure we need to change .> the behavior .> of the match method for this relatively uncommon use case? .> .> .> Yes because otherwise users with a use case of doing match() .> .> even if it's uncommon, .> .> .> I don't think .> "match" always has to mean "equality". It is a more general .> concept in .> my mind. The most common use case for matching ranges is .> overlap. .> .> .> Of course "match" doesn't always have to mean equality. But of base .> .> .> Michael .> .> .> On Fri, Jan 4, 2013 at 8:34 PM, Hervé Pagès .> <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> .> <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> wrote: .> .> Yes 'peaks %in% genes' is cute and was probably doing .> the right thing .> for most users (although not all). But 'exons %in% .> genes' is cute too .> and was probably doing the wrong thing for all users. .> Advanced users .> like you guys would have no problem switching to .> .> !is.na <http: is.na=""> .> <http: is.na="">(findOverlaps(__peaks, genes, type="within", .> select="any")) .> .> or .> .> !is.na <http: is.na=""> .> <http: is.na="">(findOverlaps(__peaks, genes, type="equal", .> .> select="any")) .> .> in case 'peaks %in% genes' was not doing exactly what .> you wanted, .> but most users would not find this particularly .> friendly. Even .> worse, some users probably didn't realize that 'peaks .> %in% genes' .> was not doing exactly what they thought it did because .> "peaks in .> genes" in English suggests that the peaks are within .> the genes, .> but it's not what 'peaks %in% genes' does. .> .> Having overlapsAny(), with exactly the same extra .> arguments as .> countOverlaps() and subsetByOverlaps() (i.e. 'maxgap', .> 'minoverlap', .> 'type', 'ignore.strand'), all of them documented (and .> with most .> users more or less familiar with them already) has the .> virtue to .> expose the user to all the options from the very start, .> and to .> help him/her make the right choice. Of course there .> will be users .> that don't want or don't have the time to read/think .> about all the .> options. Not a big deal: they'll just do .> 'overlapsAny(query, subject)', .> which is not a lot more typing than 'query %in% .> subject', especially .> if they use tab completion. .> .> It's true that it's more common to ask questions about .> overlap than .> about equality but there are some use cases for the .> latter (as the .> original thread shows). Until now, when you had such a .> use case, you .> could not use match() or %in%, which would have been .> the natural things .> to use, because they got hijacked to do something else, .> and you were .> left with nothing. Not a satisfying situation. So at a .> minimum, we .> needed to restore the true/real/original semantic of .> match() to do .> "equality" instead of "overlap". But it's hard to do .> this for match() .> and not do it for %in% too. For more than 99% of R .> users, %in% is .> just a simple wrapper for 'match(x, table, nomatch = 0) .> > 0' (this .> is how it has been documented and implemented in base R .> for many .> years). Not maintaining this relationship between %in% .> and match() .> would only cause grief and frustration to newcomers to .> Bioconductor. .> .> H. .> .> .> .> On 01/04/2013 03:32 PM, Cook, Malcolm wrote: .> .> Hiya again, .> .> I am definitely a late comer to BioC, so I .> definitely easily .> defer to .> the tide of history. .> .> But I do think you miss my point Michael about the .> proposed change .> making the relationship between %in% and match for .> {G,I}Ranges{List} .> mimic that between other vectors, and I do think .> that changing .> the API .> would make other late-comers take to BioC .> easier/faster. .> .> That said, I NEVER use %in% so I really have no .> stake in the .> matter, and .> I DEFINITELY appreciate the argument to not .> changing the API .> just for .> sematic sweetness. .> .> That that said, Herve is _/so good/_ about .> deprecations and warnings .> .> that make such changes fairly easily digestible. .> .> That that that.... enough.... I bow out of this .> one....!!!! .> .> Always learning and Happy New Year to all lurkers, .> .> ~Malcolm .> .> *From:*Michael Lawrence .> [mailto:lawrence.michael at gene. .> <mailto:lawrence.michael at="" gene.="">____com .> .> <mailto:lawrence.michael at="" gene.__com="" .=""> <mailto:lawrence.michael at="" gene.com="">>] .> *Sent:* Friday, January 04, 2013 5:11 PM .> *To:* Cook, Malcolm .> *Cc:* Sean Davis; Michael Lawrence; Hervé Pagès .> (hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> .> <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>); Tim .> .> .> Triche, Jr.; Vedran Franke; .> bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> .> <mailto:bioconductor at="" r-__project.org="" .=""> <mailto:bioconductor at="" r-project.org="">> .> *Subject:* Re: [BioC] countMatches() (was: table .> for GenomicRanges) .> .> .> On Fri, Jan 4, 2013 at 1:56 PM, Cook, Malcolm .> <mec at="" stowers.org="" <mailto:mec="" at="" stowers.org=""> .> <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org="">> .> <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org=""> .> <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org="">>>> wrote: .> .> Hiya, .> .> For what it is worth... .> .> I think the change to %in% is warranted. .> .> If I understand correctly, this change restores the .> relationship .> between .> the semantics of `%in` and the semantics of `match`. .> .> From the docs: .> .> '"%in%" <- function(x, table) match(x, table, .> nomatch = 0) > 0' .> .> Herve's change restores this relationship. .> .> .> match and %in% were initially consistent (both .> considering any .> overlap); .> Herve has changed both of them together. The whole .> idea behind .> IRanges .> is that ranges are special data types with special .> semantics. We .> have .> reimplemented much of the existing R vector API .> using those .> semantics; .> this extends beyond match/%in%. I am hesitant about .> making such .> sweeping .> changes to the API so late in the life-cycle of the .> package. .> There was a .> feature request for a way to count identical ranges .> in a set of .> ranges. .> Let's please not get carried away and start .> redesigning the API .> for this .> one, albeit useful, request. There are all sorts of .> inconsistencies in .> the API, and many of them were conscious decisions .> that considered .> practical use cases. .> .> Michael .> .> .> Herve, I suspect you were you as a result able to .> completely drop .> all the `%in%,BiocClass1,BiocClass2` .> definitions and depend .> upon .> base::%in% .> .> Am I right? .> .> If so, may I suggest that Herve stay the .> course, with the .> addition of .> '"%ol%" <- function(a, b) findOverlaps(a, .> b, maxgap=0L, .> minoverlap=1L, type='any', select='all') > 0' .> .> This would provide a perspicacious idiom, thereby .> optimizing the API .> for Michaels observed common use case. .> .> Just sayin' .> .> ~Malcolm .> .> .> .-----Original Message----- .> .From: .> bioconductor-bounces at r-____project.org .> <mailto:bioconductor-bounces at="" r-__project.org=""> .> <mailto:bioconductor-bounces at="" __r-="" project.org="" .=""> <mailto:bioconductor-bounces at="" r-project.org="">> .> <mailto:bioconductor-bounces@ .=""> <mailto:bioconductor-bounces@>____r-project.org .> <http: r-project.org=""> .> <mailto:bioconductor-bounces at="" __r-="" project.org="" .=""> <mailto:bioconductor-bounces at="" r-project.org="">>> .> [mailto:bioconductor-bounces@ .> <mailto:bioconductor-bounces@>____r-project.org .> <http: r-project.org=""> .> <mailto:bioconductor-bounces at="" __r-="" project.org="" .=""> <mailto:bioconductor-bounces at="" r-project.org="">> .> .> <mailto:bioconductor-bounces@ .=""> <mailto:bioconductor-bounces@>____r-project.org .> <http: r-project.org=""> .> .> <mailto:bioconductor-bounces at="" __r-="" project.org="" .=""> <mailto:bioconductor-bounces at="" r-project.org="">>>] On Behalf Of Sean .> Davis .> .Sent: Friday, January 04, 2013 3:37 PM .> .To: Michael Lawrence .> .Cc: Tim Triche, Jr.; Vedran Franke; .> bioconductor at r-project.org .> <mailto:bioconductor at="" r-project.org=""> .> <mailto:bioconductor at="" r-__project.org="" .=""> <mailto:bioconductor at="" r-project.org="">> .> <mailto:bioconductor at="" r-____project.org="" .=""> <mailto:bioconductor at="" r-__project.org=""> .> .> <mailto:bioconductor at="" r-__project.org="" .=""> <mailto:bioconductor at="" r-project.org="">>> .> .> .Subject: Re: [BioC] countMatches() (was: .> table for .> GenomicRanges) .> . .> .On Fri, Jan 4, 2013 at 4:32 PM, Michael .> Lawrence .> .<lawrence.michael at="" gene.com="" .=""> <mailto:lawrence.michael at="" gene.com=""> .> <mailto:lawrence.michael at="" gene.__com="" .=""> <mailto:lawrence.michael at="" gene.com="">> .> <mailto:lawrence.michael at="" gene.="" .=""> <mailto:lawrence.michael at="" gene.="">____com .> .> <mailto:lawrence.michael at="" gene.__com="" .=""> <mailto:lawrence.michael at="" gene.com="">>>> wrote: .> .> The change to the behavior of %in% is a .> pretty big .> one. Are you .> thinking .> .> that all set-based operations should .> behave this way? For .> example, setdiff .> .> and intersect? I really liked the syntax .> of "peaks .> %in% genes". .> In my .> .> experience, it's way more common to ask .> questions .> about overlap .> than about .> .> equality, so I'd rather optimize the API .> for that use .> case. But .> again, .> .> that's just my personal bias. .> . .> .For what it is worth, I share Michael's .> personal bias here. .> . .> .Sean .> . .> . .> .> Michael .> .> .> .> .> .> On Fri, Jan 4, 2013 at 1:11 PM, Hervé Pagès .> <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> .> <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> .> <mailto:hpages at="" fhcrc.org="" .=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="" .=""> <mailto:hpages at="" fhcrc.org="">>>> wrote: .> .> .> .>> Hi, .> .>> .> .>> I added findMatches() and countMatches() .> to the .> latest IRanges / .> .>> GenomicRanges packages (in BioC devel only). .> .>> .> .>> findMatches(x, table): An enhanced .> version of .> 'match' that .> .>> returns all the matches in a .> Hits object. .> .>> .> .>> countMatches(x, table): Returns an .> integer vector .> of the length .> .>> of 'x', containing the number .> of matches in .> 'table' for .> .>> each element in 'x'. .> .>> .> .> .>> countMatches() is what you can use to .> tally/count/tabulate .> (choose your .> .> .>> preferred term) the unique elements in a .> GRanges object: .> .>> .> .>> library(GenomicRanges) .> .>> set.seed(33) .> .>> gr <- GRanges("chr1", .> IRanges(sample(15,20,replace=*____*TRUE), .> .> width=5)) .> .>> .> .>> Then: .> .>> .> .>> > gr_levels <- sort(unique(gr)) .> .>> > countMatches(gr_levels, gr) .> .>> [1] 1 1 1 2 4 2 2 1 2 2 2 .> .>> .> .>> Note that findMatches() and .> countMatches() also work on .> IRanges and .> .>> DNAStringSet objects, as well as on .> ordinary atomic .> vectors: .> .>> .> .>> library(hgu95av2probe) .> .>> library(Biostrings) .> .>> probes <- DNAStringSet(hgu95av2probe) .> .>> unique_probes <- unique(probes) .> .>> count <- countMatches(unique_probes, .> probes) .> .>> max(count) # 7 .> .>> .> .>> I made other changes in .> IRanges/GenomicRanges so that .> the notion .> .>> of "match" between elements of a .> vector-like object now .> consistently .> .>> means "equality" instead of "overlap", .> even for .> range-based .> objects .> .>> like IRanges or GRanges objects. This .> notion of .> "equality" is the .> .>> same that is used by ==. The most .> visible consequence .> of those .> .>> changes is that using %in% between 2 .> IRanges or .> GRanges objects .> .>> 'query' and 'subject' in order to do .> overlaps was .> replaced by .> .>> overlapsAny(query, subject). .> .>> .> .>> overlapsAny(query, subject): Finds the .> ranges in .> 'query' that .> .>> overlap any of the ranges in 'subject'. .> .>> .> .> .>> There are warnings and deprecation .> messages in place .> to help .> smooth .> .> .>> the transition. .> .>> .> .>> Cheers, .> .>> H. .> .>> .> .>> -- .> .>> Hervé Pagès .> .>> .> .>> Program in Computational Biology .> .>> Division of Public Health Sciences .> .>> Fred Hutchinson Cancer Research Center .> .>> 1100 Fairview Ave. N, M1-B514 .> .>> P.O. Box 19024 .> .>> Seattle, WA 98109-1024 .> .>> .> .>> E-mail: hpages at fhcrc.org .> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="" .=""> <mailto:hpages at="" fhcrc.org="">> .> <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> .> <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> .> .> .>> Phone: (206) 667-5791 .> <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> .> <tel:%28206%29%20667-5791> .> .>> Fax: (206) 667-1319 .> <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319> .> <tel:%28206%29%20667-1319> .> .> .>> .> .> .> .> [[alternative HTML version deleted]] .> .> .> .> .> .> .> ___________________________________________________ .> .> .> Bioconductor mailing list .> .> Bioconductor at r-project.org .> <mailto:bioconductor at="" r-project.org=""> .> <mailto:bioconductor at="" r-__project.org="" .=""> <mailto:bioconductor at="" r-project.org="">> .> <mailto:bioconductor at="" r-____project.org="" .=""> <mailto:bioconductor at="" r-__project.org=""> .> <mailto:bioconductor at="" r-__project.org="" .=""> <mailto:bioconductor at="" r-project.org="">>> .> .> .> .> https://stat.ethz.ch/mailman/____listinfo/bioconductor .> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> .> .> .> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="" .=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">> .> .> Search the archives: .> http://news.gmane.org/gmane.____science.biology.inform atics.____conductor .> <http: news.gmane.org="" gmane.__science.biology.informa="" tics.__conductor=""> .> .> <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor="" .=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">> .> . .> .> .___________________________________________________ .> .> .Bioconductor mailing list .> .Bioconductor at r-project.org .> <mailto:bioconductor at="" r-project.org=""> .> <mailto:bioconductor at="" r-__project.org="" .=""> <mailto:bioconductor at="" r-project.org="">> .> <mailto:bioconductor at="" r-____project.org="" .=""> <mailto:bioconductor at="" r-__project.org=""> .> <mailto:bioconductor at="" r-__project.org="" .=""> <mailto:bioconductor at="" r-project.org="">>> .> .> .> .https://stat.ethz.ch/mailman/____listinfo/bioconductor .> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> .> .> .> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="" .=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">> .> .Search the archives: .> http://news.gmane.org/gmane.____science.biology.inform atics.____conductor .> <http: news.gmane.org="" gmane.__science.biology.informa="" tics.__conductor=""> .> .> .> <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor="" .=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">> .> .> .> -- .> Hervé Pagès .> .> Program in Computational Biology .> Division of Public Health Sciences .> Fred Hutchinson Cancer Research Center .> 1100 Fairview Ave. N, M1-B514 .> P.O. Box 19024 .> Seattle, WA 98109-1024 .> .> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> .> <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> .> .> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> .> <tel:%28206%29%20667-5791> .> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> .> <tel:%28206%29%20667-1319> .> .> .> .> -- .> Hervé Pagès .> .> Program in Computational Biology .> Division of Public Health Sciences .> Fred Hutchinson Cancer Research Center .> 1100 Fairview Ave. N, M1-B514 .> P.O. Box 19024 .> Seattle, WA 98109-1024 .> .> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> .> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> .> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> .> .> .> .> .> .> -- .> /A model is a lie that helps you see the truth./ .> / .> / .> Howard Skipper .> <http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> . .-- .Hervé Pagès . .Program in Computational Biology .Division of Public Health Sciences .Fred Hutchinson Cancer Research Center .1100 Fairview Ave. N, M1-B514 .P.O. Box 19024 .Seattle, WA 98109-1024 . .E-mail: hpages at fhcrc.org .Phone: (206) 667-5791 .Fax: (206) 667-1319
ADD REPLYlink written 4.9 years ago by Malcolm Cook1.4k
On 01/07/2013 11:33 AM, Michael Lawrence wrote: > > > > On Mon, Jan 7, 2013 at 11:00 AM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote: > > Hi Michael, > > I don't think "match" (the word) always has to mean "equality" either. > However having match() (the function) do "whole exact matching" (aka > "equality") for any kind of vector-like object has the advantage of: > > (a) making it consistent with base::match() (?base::match is pretty > explicit about what the contract of match() is) > > > (a) alone is obviously not enough. We have many methods, like the set > operations, that treat ranges specially. Are we going to start moving > everything toward the base behavior? And have rangeIntersect, > rangeSetdiff, etc? > > (b) preserving its relationship with ==, duplicated(), unique(), > etc... > > > So it becomes consistent with duplicated/unique, but we lose consistency > with the set operations. Nope, we don't loose anything. Because match()/%in% were NOT consistent with the set operations anyway, that is, 'intersect(x, y)' on IRanges/GRanges objects was not doing 'x[x %in% y]' (%in% here being the old %in%). > > (c) not frustrating the user who needs something to do exact > matching on ranges (as I mentioned previously, if you take > match() away from him/her, s/he'll be left with nothing). > > > No one has ever asked for match() to behave this way. Here is my use case: internally findMatches()/countMatches() are implemented on top of match(), the fixed match(). They work on any object for which match() works. They would also work on objects for which match() does the wrong thing but they would return something wrong. They could be made ordinary functions, not generic (and they will, but they temporarily need to be made generics with methods, just to smooth the transition), because dispatch happens inside the function when match() is called. In the man page for those functions I can just say: findMatches(x, table): An enhanced version of ?match? that returns all the matches in a Hits object. and I'm done. It's clear and concise. The implementation/documentation of findMatches()/countMatches() is the typical illustration of why having methods that respect the contract of the generic is a must. The idea is to build on top of some basic building-blocks for which the behavior is well-defined, consistent, predictable. It's sooo much easier, and it's very healthy. > There was a > request for a way to tabulate identical ranges. It was a nice idea to > extract the general "outer equal" findMatches function. It's also a nice idea to have findMatches() and countMatches() aligned with match(). > But the changes seem to be snow-balling. No snow-balling. You cannot snow-ball too far anyway when you restore consistency. But you can easily snow-ball very far when you go on the other direction (there is no limits). Do I need to say that aiming for consistency/predictability is a good goal in software design? It can only make it *better* in all the meanings of the term: less bugs, easier to maintain, easier to document, and easier to use in the long run. Everybody wins. Even if you don't realize it now. Convenience is also important, but less important than consistency/predictability. As a matter of fact, an interesting and not immediately obvious side effect of going consistent is that, in the long run (i.e. when the software becomes bigger and more complex), it also gives you a form of convenience for the end-user: documentation is simpler and easier to read, and there are less special cases to remember. > These types of changes mean a lot of > maintenance work for the users. A deprecation cycle does not circumvent > that. I don't see why this change would be more work for the users than any other change. Making RangedData fade away will certainly be a much bigger one, will take much more time (maybe 2-3 years), and will require a lot more maintenance work from us (mostly me) and from the users. FWIW, the change to match()/%in% probably means more work for me than for the users. There is a *lot* of stuff I had to put in place in IRanges/GenomicRanges to make this transition smooth. But I truly believe it was worth it. I also fixed all the BioC packages I found that were affected by those changes (surprisingly, there were very few: only 5). I could have missed some. Please let me know if that is the case and I'll fix them too. Thanks, H. > > > IMO those advantages counterbalance *by far* the very little > convenience you get from having 'match(query, subject)' do > 'findOverlaps(query, subject, select="first")' on > IRanges/GRanges objects. If you need to do that, just use the > latter, or, if you think that's still too much typing, define > a wrapper e.g. 'ovmatch(query, subject)'. > > There are plenty of specialized tools around for doing > inexact/fuzzy/partial/overlap matching for many particular types > of vector-like objects: grep() and family, pmatch(), charmatch(), > agrep(), grepRaw(), matchPattern() and family, findOverlaps() and > family, findIntervals(), etc... For the reasons I mentioned > above, none of them should hijack match() to make it do some > particular type of inexact matching on some particular type of > objects. Even if, for that particular type of objects, doing that > particular type of inexact matching is more common than doing > exact matching. > > H. > > > > On 01/06/2013 05:39 PM, Michael Lawrence wrote: > > I think having overlapsAny is a nice addition and helps make the API > more complete and explicit. Are you sure we need to change the > behavior > of the match method for this relatively uncommon use case? > > > Yes because otherwise users with a use case of doing match() > > even if it's uncommon, > > > I don't think > "match" always has to mean "equality". It is a more general > concept in > my mind. The most common use case for matching ranges is overlap. > > > Of course "match" doesn't always have to mean equality. But of base > > > Michael > > > On Fri, Jan 4, 2013 at 8:34 PM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> wrote: > > Yes 'peaks %in% genes' is cute and was probably doing the > right thing > for most users (although not all). But 'exons %in% genes' > is cute too > and was probably doing the wrong thing for all users. > Advanced users > like you guys would have no problem switching to > > !is.na <http: is.na=""> > <http: is.na="">(findOverlaps(__peaks, genes, type="within", > select="any")) > > or > > !is.na <http: is.na=""> > <http: is.na="">(findOverlaps(__peaks, genes, type="equal", > > select="any")) > > in case 'peaks %in% genes' was not doing exactly what you > wanted, > but most users would not find this particularly friendly. Even > worse, some users probably didn't realize that 'peaks %in% > genes' > was not doing exactly what they thought it did because > "peaks in > genes" in English suggests that the peaks are within the genes, > but it's not what 'peaks %in% genes' does. > > Having overlapsAny(), with exactly the same extra arguments as > countOverlaps() and subsetByOverlaps() (i.e. 'maxgap', > 'minoverlap', > 'type', 'ignore.strand'), all of them documented (and with most > users more or less familiar with them already) has the > virtue to > expose the user to all the options from the very start, and to > help him/her make the right choice. Of course there will be > users > that don't want or don't have the time to read/think about > all the > options. Not a big deal: they'll just do > 'overlapsAny(query, subject)', > which is not a lot more typing than 'query %in% subject', > especially > if they use tab completion. > > It's true that it's more common to ask questions about > overlap than > about equality but there are some use cases for the latter > (as the > original thread shows). Until now, when you had such a use > case, you > could not use match() or %in%, which would have been the > natural things > to use, because they got hijacked to do something else, and > you were > left with nothing. Not a satisfying situation. So at a > minimum, we > needed to restore the true/real/original semantic of > match() to do > "equality" instead of "overlap". But it's hard to do this > for match() > and not do it for %in% too. For more than 99% of R users, > %in% is > just a simple wrapper for 'match(x, table, nomatch = 0) > > 0' (this > is how it has been documented and implemented in base R for > many > years). Not maintaining this relationship between %in% and > match() > would only cause grief and frustration to newcomers to > Bioconductor. > > H. > > > > On 01/04/2013 03:32 PM, Cook, Malcolm wrote: > > Hiya again, > > I am definitely a late comer to BioC, so I definitely > easily > defer to > the tide of history. > > But I do think you miss my point Michael about the > proposed change > making the relationship between %in% and match for > {G,I}Ranges{List} > mimic that between other vectors, and I do think that > changing > the API > would make other late-comers take to BioC easier/faster. > > That said, I NEVER use %in% so I really have no stake > in the > matter, and > I DEFINITELY appreciate the argument to not changing > the API > just for > sematic sweetness. > > That that said, Herve is _/so good/_ about deprecations > and warnings > > that make such changes fairly easily digestible. > > That that that.... enough.... I bow out of this one....!!!! > > Always learning and Happy New Year to all lurkers, > > ~Malcolm > > *From:*Michael Lawrence [mailto:lawrence.michael at gene. > <mailto:lawrence.michael at="" gene.="">____com > > <mailto:lawrence.michael at="" gene.__com=""> <mailto:lawrence.michael at="" gene.com="">>] > *Sent:* Friday, January 04, 2013 5:11 PM > *To:* Cook, Malcolm > *Cc:* Sean Davis; Michael Lawrence; Hervé Pagès > (hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>); Tim > > > Triche, Jr.; Vedran Franke; bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > *Subject:* Re: [BioC] countMatches() (was: table for > GenomicRanges) > > > On Fri, Jan 4, 2013 at 1:56 PM, Cook, Malcolm > <mec at="" stowers.org="" <mailto:mec="" at="" stowers.org=""> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org="">> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org=""> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org="">>>> wrote: > > Hiya, > > For what it is worth... > > I think the change to %in% is warranted. > > If I understand correctly, this change restores the > relationship > between > the semantics of `%in` and the semantics of `match`. > > From the docs: > > '"%in%" <- function(x, table) match(x, table, > nomatch = 0) > 0' > > Herve's change restores this relationship. > > > match and %in% were initially consistent (both > considering any > overlap); > Herve has changed both of them together. The whole idea > behind > IRanges > is that ranges are special data types with special > semantics. We > have > reimplemented much of the existing R vector API using those > semantics; > this extends beyond match/%in%. I am hesitant about > making such > sweeping > changes to the API so late in the life-cycle of the > package. > There was a > feature request for a way to count identical ranges in > a set of > ranges. > Let's please not get carried away and start redesigning > the API > for this > one, albeit useful, request. There are all sorts of > inconsistencies in > the API, and many of them were conscious decisions that > considered > practical use cases. > > Michael > > > Herve, I suspect you were you as a result able to > completely drop > all the `%in%,BiocClass1,BiocClass2` definitions > and depend > upon > base::%in% > > Am I right? > > If so, may I suggest that Herve stay the course, > with the > addition of > '"%ol%" <- function(a, b) findOverlaps(a, b, > maxgap=0L, > minoverlap=1L, type='any', select='all') > 0' > > This would provide a perspicacious idiom, thereby > optimizing the API > for Michaels observed common use case. > > Just sayin' > > ~Malcolm > > > .-----Original Message----- > .From: bioconductor-bounces at r-____project.org > <mailto:bioconductor-bounces at="" r-__project.org=""> > <mailto:bioconductor-bounces at="" __r-project.org=""> <mailto:bioconductor-bounces at="" r-project.org="">> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>____r-project.org > <http: r-project.org=""> > <mailto:bioconductor-bounces at="" __r-project.org=""> <mailto:bioconductor-bounces at="" r-project.org="">>> > [mailto:bioconductor-bounces@ > <mailto:bioconductor-bounces@>____r-project.org > <http: r-project.org=""> > <mailto:bioconductor-bounces at="" __r-project.org=""> <mailto:bioconductor-bounces at="" r-project.org="">> > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>____r-project.org > <http: r-project.org=""> > > <mailto:bioconductor-bounces at="" __r-project.org=""> <mailto:bioconductor-bounces at="" r-project.org="">>>] On Behalf Of Sean > Davis > .Sent: Friday, January 04, 2013 3:37 PM > .To: Michael Lawrence > .Cc: Tim Triche, Jr.; Vedran Franke; > bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org=""> > > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>> > > .Subject: Re: [BioC] countMatches() (was: table for > GenomicRanges) > . > .On Fri, Jan 4, 2013 at 4:32 PM, Michael Lawrence > .<lawrence.michael at="" gene.com=""> <mailto:lawrence.michael at="" gene.com=""> > <mailto:lawrence.michael at="" gene.__com=""> <mailto:lawrence.michael at="" gene.com="">> > <mailto:lawrence.michael at="" gene.=""> <mailto:lawrence.michael at="" gene.="">____com > > <mailto:lawrence.michael at="" gene.__com=""> <mailto:lawrence.michael at="" gene.com="">>>> wrote: > .> The change to the behavior of %in% is a > pretty big > one. Are you > thinking > .> that all set-based operations should behave > this way? For > example, setdiff > .> and intersect? I really liked the syntax of > "peaks > %in% genes". > In my > .> experience, it's way more common to ask questions > about overlap > than about > .> equality, so I'd rather optimize the API for > that use > case. But > again, > .> that's just my personal bias. > . > .For what it is worth, I share Michael's > personal bias here. > . > .Sean > . > . > .> Michael > .> > .> > .> On Fri, Jan 4, 2013 at 1:11 PM, Hervé Pagès > <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>> wrote: > .> > .>> Hi, > .>> > .>> I added findMatches() and countMatches() to the > latest IRanges / > .>> GenomicRanges packages (in BioC devel only). > .>> > .>> findMatches(x, table): An enhanced version of > ?match? that > .>> returns all the matches in a Hits > object. > .>> > .>> countMatches(x, table): Returns an integer > vector > of the length > .>> of ?x?, containing the number of > matches in > ?table? for > .>> each element in ?x?. > .>> > > .>> countMatches() is what you can use to > tally/count/tabulate > (choose your > > .>> preferred term) the unique elements in a > GRanges object: > .>> > .>> library(GenomicRanges) > .>> set.seed(33) > .>> gr <- GRanges("chr1", > IRanges(sample(15,20,replace=*____*TRUE), > > width=5)) > .>> > .>> Then: > .>> > .>> > gr_levels <- sort(unique(gr)) > .>> > countMatches(gr_levels, gr) > .>> [1] 1 1 1 2 4 2 2 1 2 2 2 > .>> > .>> Note that findMatches() and countMatches() > also work on > IRanges and > .>> DNAStringSet objects, as well as on ordinary > atomic > vectors: > .>> > .>> library(hgu95av2probe) > .>> library(Biostrings) > .>> probes <- DNAStringSet(hgu95av2probe) > .>> unique_probes <- unique(probes) > .>> count <- countMatches(unique_probes, probes) > .>> max(count) # 7 > .>> > .>> I made other changes in > IRanges/GenomicRanges so that > the notion > .>> of "match" between elements of a vector- like > object now > consistently > .>> means "equality" instead of "overlap", even for > range-based > objects > .>> like IRanges or GRanges objects. This notion of > "equality" is the > .>> same that is used by ==. The most visible > consequence > of those > .>> changes is that using %in% between 2 IRanges or > GRanges objects > .>> 'query' and 'subject' in order to do > overlaps was > replaced by > .>> overlapsAny(query, subject). > .>> > .>> overlapsAny(query, subject): Finds the > ranges in > ?query? that > .>> overlap any of the ranges in ?subject?. > .>> > > .>> There are warnings and deprecation messages > in place > to help > smooth > > .>> the transition. > .>> > .>> Cheers, > .>> H. > .>> > .>> -- > .>> Hervé Pagès > .>> > .>> Program in Computational Biology > .>> Division of Public Health Sciences > .>> Fred Hutchinson Cancer Research Center > .>> 1100 Fairview Ave. N, M1-B514 > .>> P.O. Box 19024 > .>> Seattle, WA 98109-1024 > .>> > .>> E-mail: hpages at fhcrc.org > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> > > .>> Phone: (206) 667-5791 > <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> > .>> Fax: (206) 667-1319 > <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > > .>> > .> > .> [[alternative HTML version deleted]] > .> > .> > .> > ___________________________________________________ > > .> Bioconductor mailing list > .> Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>> > > .> > https://stat.ethz.ch/mailman/____listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">> > .> Search the archives: > http://news.gmane.org/gmane.____science.biology.informatics. ____conductor > <http: news.gmane.org="" gmane.__science.biology.informatics._="" _conductor=""> > > <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">> > . > .___________________________________________________ > > .Bioconductor mailing list > .Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>> > > > .https://stat.ethz.ch/mailman/____listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">> > .Search the archives: > http://news.gmane.org/gmane.____science.biology.informatics. ____conductor > <http: news.gmane.org="" gmane.__science.biology.informatics._="" _conductor=""> > > > <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLYlink written 4.9 years ago by Hervé Pagès ♦♦ 13k
This is basically an argument against incorporating range-based semantics into the R vector API. I always thought it was interesting/cool how IRanges considered ranges to be a special data type, with special semantics. The %in% operator in particular has many fans. But it's hard to argue against consistency with the base R behavior. That point is not lost on me and it drove the design of DataFrame, Rle, etc. I'm still not sure we even need the findMatches function. There are very few times I've used outer(x, y, "=="). The feature request (and it was a good one) was for tabulating ranges. At some point after so many years one has to acknowledge that the IRanges API has been empirically shown to be reasonable, despite its theoretical inconsistencies. This is why I am resistant to such changes. But maybe I'm just suffering from my own personal biases. One other point: most of the code using IRanges is in scripts outside of the Bioc repository, so it is easy to underestimate the significance of some changes. Michael On Mon, Jan 7, 2013 at 1:46 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > On 01/07/2013 11:33 AM, Michael Lawrence wrote: > >> >> >> >> On Mon, Jan 7, 2013 at 11:00 AM, Hervé Pagès <hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>> wrote: >> >> Hi Michael, >> >> I don't think "match" (the word) always has to mean "equality" either. >> However having match() (the function) do "whole exact matching" (aka >> "equality") for any kind of vector-like object has the advantage of: >> >> (a) making it consistent with base::match() (?base::match is pretty >> explicit about what the contract of match() is) >> >> >> (a) alone is obviously not enough. We have many methods, like the set >> operations, that treat ranges specially. Are we going to start moving >> everything toward the base behavior? And have rangeIntersect, >> rangeSetdiff, etc? >> >> (b) preserving its relationship with ==, duplicated(), unique(), >> etc... >> >> >> So it becomes consistent with duplicated/unique, but we lose consistency >> with the set operations. >> > > Nope, we don't loose anything. Because match()/%in% were NOT consistent > with the set operations anyway, that is, 'intersect(x, y)' on > IRanges/GRanges objects was not doing 'x[x %in% y]' (%in% here being > the old %in%). > > > >> (c) not frustrating the user who needs something to do exact >> matching on ranges (as I mentioned previously, if you take >> match() away from him/her, s/he'll be left with nothing). >> >> >> No one has ever asked for match() to behave this way. >> > > Here is my use case: internally findMatches()/countMatches() are > implemented on top of match(), the fixed match(). They work on any > object for which match() works. They would also work on objects for > which match() does the wrong thing but they would return something > wrong. They could be made ordinary functions, not generic (and they > will, but they temporarily need to be made generics with methods, > just to smooth the transition), because dispatch happens inside the > function when match() is called. In the man page for those functions > I can just say: > > findMatches(x, table): An enhanced version of ‘match’ that returns > > all the matches in a Hits object. > > and I'm done. It's clear and concise. > > The implementation/documentation of findMatches()/countMatches() is > the typical illustration of why having methods that respect the > contract of the generic is a must. > > The idea is to build on top of some basic building-blocks for which > the behavior is well-defined, consistent, predictable. It's sooo much > easier, and it's very healthy. > > > There was a >> request for a way to tabulate identical ranges. It was a nice idea to >> extract the general "outer equal" findMatches function. >> > > It's also a nice idea to have findMatches() and countMatches() aligned > with match(). > > > But the changes seem to be snow-balling. >> > > No snow-balling. You cannot snow-ball too far anyway when you restore > consistency. But you can easily snow-ball very far when you go on the > other direction (there is no limits). Do I need to say that aiming for > consistency/predictability is a good goal in software design? It can > only make it *better* in all the meanings of the term: less bugs, > easier to maintain, easier to document, and easier to use in the long > run. Everybody wins. Even if you don't realize it now. Convenience is > also important, but less important than consistency/predictability. > As a matter of fact, an interesting and not immediately obvious side > effect of going consistent is that, in the long run (i.e. when the > software becomes bigger and more complex), it also gives you a form of > convenience for the end-user: documentation is simpler and easier to > read, and there are less special cases to remember. > > > These types of changes mean a lot of >> maintenance work for the users. A deprecation cycle does not circumvent >> that. >> > > I don't see why this change would be more work for the users than any > other change. Making RangedData fade away will certainly be a much > bigger one, will take much more time (maybe 2-3 years), and will > require a lot more maintenance work from us (mostly me) and from > the users. > > FWIW, the change to match()/%in% probably means more work for me than > for the users. There is a *lot* of stuff I had to put in place in > IRanges/GenomicRanges to make this transition smooth. But I truly > believe it was worth it. I also fixed all the BioC packages I found > that were affected by those changes (surprisingly, there were very > few: only 5). I could have missed some. Please let me know if that > is the case and I'll fix them too. > > Thanks, > H. > > >> >> IMO those advantages counterbalance *by far* the very little >> convenience you get from having 'match(query, subject)' do >> 'findOverlaps(query, subject, select="first")' on >> IRanges/GRanges objects. If you need to do that, just use the >> latter, or, if you think that's still too much typing, define >> a wrapper e.g. 'ovmatch(query, subject)'. >> >> There are plenty of specialized tools around for doing >> inexact/fuzzy/partial/overlap matching for many particular types >> of vector-like objects: grep() and family, pmatch(), charmatch(), >> agrep(), grepRaw(), matchPattern() and family, findOverlaps() and >> family, findIntervals(), etc... For the reasons I mentioned >> above, none of them should hijack match() to make it do some >> particular type of inexact matching on some particular type of >> objects. Even if, for that particular type of objects, doing that >> particular type of inexact matching is more common than doing >> exact matching. >> >> H. >> >> >> >> On 01/06/2013 05:39 PM, Michael Lawrence wrote: >> >> I think having overlapsAny is a nice addition and helps make the >> API >> more complete and explicit. Are you sure we need to change the >> behavior >> of the match method for this relatively uncommon use case? >> >> >> Yes because otherwise users with a use case of doing match() >> >> even if it's uncommon, >> >> >> I don't think >> "match" always has to mean "equality". It is a more general >> concept in >> my mind. The most common use case for matching ranges is overlap. >> >> >> Of course "match" doesn't always have to mean equality. But of base >> >> >> Michael >> >> >> On Fri, Jan 4, 2013 at 8:34 PM, Hervé Pagès <hpages@fhcrc.org>> <mailto:hpages@fhcrc.org> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>> wrote: >> >> Yes 'peaks %in% genes' is cute and was probably doing the >> right thing >> for most users (although not all). But 'exons %in% genes' >> is cute too >> and was probably doing the wrong thing for all users. >> Advanced users >> like you guys would have no problem switching to >> >> !is.na <http: is.na=""> >> <http: is.na="">(findOverlaps(__**peaks, genes, type="within", >> >> select="any")) >> >> or >> >> !is.na <http: is.na=""> >> <http: is.na="">(findOverlaps(__**peaks, genes, type="equal", >> >> >> select="any")) >> >> in case 'peaks %in% genes' was not doing exactly what you >> wanted, >> but most users would not find this particularly friendly. >> Even >> worse, some users probably didn't realize that 'peaks %in% >> genes' >> was not doing exactly what they thought it did because >> "peaks in >> genes" in English suggests that the peaks are within the >> genes, >> but it's not what 'peaks %in% genes' does. >> >> Having overlapsAny(), with exactly the same extra arguments >> as >> countOverlaps() and subsetByOverlaps() (i.e. 'maxgap', >> 'minoverlap', >> 'type', 'ignore.strand'), all of them documented (and with >> most >> users more or less familiar with them already) has the >> virtue to >> expose the user to all the options from the very start, and >> to >> help him/her make the right choice. Of course there will be >> users >> that don't want or don't have the time to read/think about >> all the >> options. Not a big deal: they'll just do >> 'overlapsAny(query, subject)', >> which is not a lot more typing than 'query %in% subject', >> especially >> if they use tab completion. >> >> It's true that it's more common to ask questions about >> overlap than >> about equality but there are some use cases for the latter >> (as the >> original thread shows). Until now, when you had such a use >> case, you >> could not use match() or %in%, which would have been the >> natural things >> to use, because they got hijacked to do something else, and >> you were >> left with nothing. Not a satisfying situation. So at a >> minimum, we >> needed to restore the true/real/original semantic of >> match() to do >> "equality" instead of "overlap". But it's hard to do this >> for match() >> and not do it for %in% too. For more than 99% of R users, >> %in% is >> just a simple wrapper for 'match(x, table, nomatch = 0) > >> 0' (this >> is how it has been documented and implemented in base R for >> many >> years). Not maintaining this relationship between %in% and >> match() >> would only cause grief and frustration to newcomers to >> Bioconductor. >> >> H. >> >> >> >> On 01/04/2013 03:32 PM, Cook, Malcolm wrote: >> >> Hiya again, >> >> I am definitely a late comer to BioC, so I definitely >> easily >> defer to >> the tide of history. >> >> But I do think you miss my point Michael about the >> proposed change >> making the relationship between %in% and match for >> {G,I}Ranges{List} >> mimic that between other vectors, and I do think that >> changing >> the API >> would make other late-comers take to BioC easier/faster. >> >> That said, I NEVER use %in% so I really have no stake >> in the >> matter, and >> I DEFINITELY appreciate the argument to not changing >> the API >> just for >> sematic sweetness. >> >> That that said, Herve is _/so good/_ about deprecations >> and warnings >> >> that make such changes fairly easily digestible. >> >> That that that.... enough.... I bow out of this >> one....!!!! >> >> Always learning and Happy New Year to all lurkers, >> >> ~Malcolm >> >> *From:*Michael Lawrence [mailto:lawrence.michael@gene. >> <mailto:lawrence.michael@gene.**>____com >> >> >> <mailto:lawrence.michael@gene.**__com>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com=""> >> >>] >> *Sent:* Friday, January 04, 2013 5:11 PM >> *To:* Cook, Malcolm >> *Cc:* Sean Davis; Michael Lawrence; Hervé Pagès >> (hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>); Tim >> >> >> >> Triche, Jr.; Vedran Franke; bioconductor@r-project.org >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> > >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> >> >> *Subject:* Re: [BioC] countMatches() (was: table for >> GenomicRanges) >> >> >> On Fri, Jan 4, 2013 at 1:56 PM, Cook, Malcolm >> <mec@stowers.org <mailto:mec@stowers.org=""> >> <mailto:mec@stowers.org <mailto:mec@stowers.org="">> >> <mailto:mec@stowers.org <mailto:mec@stowers.org=""> >> <mailto:mec@stowers.org <mailto:mec@stowers.org="">>>> wrote: >> >> Hiya, >> >> For what it is worth... >> >> I think the change to %in% is warranted. >> >> If I understand correctly, this change restores the >> relationship >> between >> the semantics of `%in` and the semantics of `match`. >> >> From the docs: >> >> '"%in%" <- function(x, table) match(x, table, >> nomatch = 0) > 0' >> >> Herve's change restores this relationship. >> >> >> match and %in% were initially consistent (both >> considering any >> overlap); >> Herve has changed both of them together. The whole idea >> behind >> IRanges >> is that ranges are special data types with special >> semantics. We >> have >> reimplemented much of the existing R vector API using >> those >> semantics; >> this extends beyond match/%in%. I am hesitant about >> making such >> sweeping >> changes to the API so late in the life-cycle of the >> package. >> There was a >> feature request for a way to count identical ranges in >> a set of >> ranges. >> Let's please not get carried away and start redesigning >> the API >> for this >> one, albeit useful, request. There are all sorts of >> inconsistencies in >> the API, and many of them were conscious decisions that >> considered >> practical use cases. >> >> Michael >> >> >> Herve, I suspect you were you as a result able to >> completely drop >> all the `%in%,BiocClass1,BiocClass2` definitions >> and depend >> upon >> base::%in% >> >> Am I right? >> >> If so, may I suggest that Herve stay the course, >> with the >> addition of >> '"%ol%" <- function(a, b) findOverlaps(a, b, >> maxgap=0L, >> minoverlap=1L, type='any', select='all') > 0' >> >> This would provide a perspicacious idiom, thereby >> optimizing the API >> for Michaels observed common use case. >> >> Just sayin' >> >> ~Malcolm >> >> >> .-----Original Message----- >> .From: bioconductor- bounces@r-____**project.org<bioconductor-bounces@r-____project.org> >> <mailto:bioconductor-bounces@**r-__project.org <bioconductor-bounces@r-__project.org=""> >> > >> >> <mailto:bioconductor-bounces@_**_r-project.org>> <mailto:bioconductor-bounces@**r-project.org<bioconductor- bounces@r-project.org=""> >> >> >> <mailto:bioconductor-bounces@>> <mailto:bioconductor-bounces@>**____r-project.org >> <http: r-project.org=""> >> >> <mailto:bioconductor-bounces@_**_r-project.org>> <mailto:bioconductor-bounces@**r-project.org<bioconductor- bounces@r-project.org=""> >> >>> >> [mailto:bioconductor-bounces@ >> <mailto:bioconductor-bounces@>**____r-project.org >> <http: r-project.org=""> >> >> <mailto:bioconductor-bounces@_**_r-project.org>> <mailto:bioconductor-bounces@**r-project.org<bioconductor- bounces@r-project.org=""> >> >> >> >> <mailto:bioconductor-bounces@>> <mailto:bioconductor-bounces@>**____r-project.org >> <http: r-project.org=""> >> >> >> <mailto:bioconductor-bounces@_**_r-project.org>> <mailto:bioconductor-bounces@**r-project.org<bioconductor- bounces@r-project.org="">>>>] >> On Behalf Of Sean >> Davis >> .Sent: Friday, January 04, 2013 3:37 PM >> .To: Michael Lawrence >> .Cc: Tim Triche, Jr.; Vedran Franke; >> bioconductor@r-project.org <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> > >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> >> >> <mailto:bioconductor@r-____**project.org<bioc onductor@r-____project.org=""> >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> > >> >> >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> >>> >> >> .Subject: Re: [BioC] countMatches() (was: table >> for >> GenomicRanges) >> . >> .On Fri, Jan 4, 2013 at 4:32 PM, Michael Lawrence >> .<lawrence.michael@gene.com>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com="">> >> <mailto:lawrence.michael@gene.**__com>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com="">>> >> <mailto:lawrence.michael@gene.>> <mailto:lawrence.michael@gene.**>____com >> >> >> <mailto:lawrence.michael@gene.**__com>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com="">>>>> >> wrote: >> .> The change to the behavior of %in% is a >> pretty big >> one. Are you >> thinking >> .> that all set-based operations should behave >> this way? For >> example, setdiff >> .> and intersect? I really liked the syntax of >> "peaks >> %in% genes". >> In my >> .> experience, it's way more common to ask >> questions >> about overlap >> than about >> .> equality, so I'd rather optimize the API for >> that use >> case. But >> again, >> .> that's just my personal bias. >> . >> .For what it is worth, I share Michael's >> personal bias here. >> . >> .Sean >> . >> . >> .> Michael >> .> >> .> >> .> On Fri, Jan 4, 2013 at 1:11 PM, Hervé Pagès >> <hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>>> wrote: >> .> >> .>> Hi, >> .>> >> .>> I added findMatches() and countMatches() to >> the >> latest IRanges / >> .>> GenomicRanges packages (in BioC devel only). >> .>> >> .>> findMatches(x, table): An enhanced version >> of >> ‘match’ that >> .>> returns all the matches in a Hits >> object. >> .>> >> .>> countMatches(x, table): Returns an integer >> vector >> of the length >> .>> of ‘x’, containing the number of >> matches in >> ‘table’ for >> .>> each element in ‘x’. >> .>> >> >> .>> countMatches() is what you can use to >> tally/count/tabulate >> (choose your >> >> .>> preferred term) the unique elements in a >> GRanges object: >> .>> >> .>> library(GenomicRanges) >> .>> set.seed(33) >> .>> gr <- GRanges("chr1", >> IRanges(sample(15,20,replace=***____*TRUE), >> >> >> width=5)) >> .>> >> .>> Then: >> .>> >> .>> > gr_levels <- sort(unique(gr)) >> .>> > countMatches(gr_levels, gr) >> .>> [1] 1 1 1 2 4 2 2 1 2 2 2 >> .>> >> .>> Note that findMatches() and countMatches() >> also work on >> IRanges and >> .>> DNAStringSet objects, as well as on ordinary >> atomic >> vectors: >> .>> >> .>> library(hgu95av2probe) >> .>> library(Biostrings) >> .>> probes <- DNAStringSet(hgu95av2probe) >> .>> unique_probes <- unique(probes) >> .>> count <- countMatches(unique_probes, probes) >> .>> max(count) # 7 >> .>> >> .>> I made other changes in >> IRanges/GenomicRanges so that >> the notion >> .>> of "match" between elements of a vector-like >> object now >> consistently >> .>> means "equality" instead of "overlap", even >> for >> range-based >> objects >> .>> like IRanges or GRanges objects. This notion >> of >> "equality" is the >> .>> same that is used by ==. The most visible >> consequence >> of those >> .>> changes is that using %in% between 2 IRanges >> or >> GRanges objects >> .>> 'query' and 'subject' in order to do >> overlaps was >> replaced by >> .>> overlapsAny(query, subject). >> .>> >> .>> overlapsAny(query, subject): Finds the >> ranges in >> ‘query’ that >> .>> overlap any of the ranges in ‘subject’. >> .>> >> >> .>> There are warnings and deprecation messages >> in place >> to help >> smooth >> >> .>> the transition. >> .>> >> .>> Cheers, >> .>> H. >> .>> >> .>> -- >> .>> Hervé Pagès >> .>> >> .>> Program in Computational Biology >> .>> Division of Public Health Sciences >> .>> Fred Hutchinson Cancer Research Center >> .>> 1100 Fairview Ave. N, M1-B514 >> .>> P.O. Box 19024 >> .>> Seattle, WA 98109-1024 >> .>> >> .>> E-mail: hpages@fhcrc.org >> <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>> >> >> .>> Phone: (206) 667-5791 >> <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> >> <tel:%28206%29%20667-5791> >> .>> Fax: (206) 667-1319 >> <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319> >> <tel:%28206%29%20667-1319> >> >> .>> >> .> >> .> [[alternative HTML version deleted]] >> .> >> .> >> .> >> ______________________________**_____________________ >> >> >> .> Bioconductor mailing list >> .> Bioconductor@r-project.org >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> > >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> >> >> <mailto:bioconductor@r-____**project.org<bioconduc tor@r-____project.org=""> >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> > >> >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> >>> >> >> .> >> https://stat.ethz.ch/mailman/_**___listinfo/bioconductor<ht tps:="" stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor<htt="" ps:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> > >> >> >> <https: stat.ethz.ch="" mailman="" **__listinfo="" biocond="" uctor<https:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https="" :="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> >> >> .> Search the archives: >> http://news.gmane.org/gmane.__**__science.biology.informatics.** >> ____conductor<http: news.gmane.org="" gmane.____science.biology.infor="" matics.____conductor=""> >> <http: news.gmane.org="" gmane._**_science.biology.informatics._**="">> _conductor<http: news.gmane.org="" gmane.__science.biology.informatic="" s.__conductor=""> >> > >> >> >> <http: news.gmane.org="" gmane._**_science.biology.informatics._**="">> _conductor<http: news.gmane.org="" gmane.__science.biology.informatic="" s.__conductor=""> >> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor=""> >> >> >> . >> ._____________________________** >> ______________________ >> >> >> .Bioconductor mailing list >> .Bioconductor@r-project.org >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> > >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> >> >> <mailto:bioconductor@r-____**project.org<bioconduc tor@r-____project.org=""> >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> > >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org=""> >> >>> >> >> >> .https://stat.ethz.ch/mailman/**____listinfo/bioconductor <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor<htt="" ps:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> > >> >> >> <https: stat.ethz.ch="" mailman="" **__listinfo="" biocond="" uctor<https:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https="" :="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> >> >> .Search the archives: >> http://news.gmane.org/gmane.__**__science.biology.informatics.** >> ____conductor<http: news.gmane.org="" gmane.____science.biology.infor="" matics.____conductor=""> >> <http: news.gmane.org="" gmane._**_science.biology.informatics._**="">> _conductor<http: news.gmane.org="" gmane.__science.biology.informatic="" s.__conductor=""> >> > >> >> >> >> <http: news.gmane.org="" gmane._**_science.biology.informatics._**="">> _conductor<http: news.gmane.org="" gmane.__science.biology.informatic="" s.__conductor=""> >> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor=""> >> >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> >> >> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >> <tel:%28206%29%20667-5791> >> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >> <tel:%28206%29%20667-1319> >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >> >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]
ADD REPLYlink written 4.9 years ago by Michael Lawrence9.8k
On 01/07/2013 04:20 PM, Michael Lawrence wrote: > This is basically an argument against incorporating range-based > semantics into the R vector API. I always thought it was > interesting/cool how IRanges considered ranges to be a special data > type, with special semantics. The %in% operator in particular has many > fans. But it's hard to argue against consistency with the base R > behavior. That point is not lost on me and it drove the design of > DataFrame, Rle, etc. > > I'm still not sure we even need the findMatches function. There are very > few times I've used outer(x, y, "=="). The feature request (and it was a > good one) was for tabulating ranges. which you can do with countMatches(). I've put findMatches() for completeness (as the natural companion of countMatches()), and I'm not charging extra money for this. So we have a nice parallel between findMatches()/countMatches() on one side (for doing exact match), and findOverlaps()/countOverlaps() on the other side (for doing overlaps). > At some point after so many years > one has to acknowledge that the IRanges API has been empirically shown > to be reasonable, despite its theoretical inconsistencies. This is why I > am resistant to such changes. But maybe I'm just suffering from my own > personal biases. > > One other point: most of the code using IRanges is in scripts outside of > the Bioc repository, so it is easy to underestimate the significance of > some changes. or to overestimate it? Is it unreasonable to assume that level of usage in the Bioc repository reflects the amount of usage outside of it? And I forgot to mention that, in addition to having only 5 packages to fix in the repo, fixing them couldn't have been easier. H. > > Michael > > > > > > On Mon, Jan 7, 2013 at 1:46 PM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote: > > On 01/07/2013 11:33 AM, Michael Lawrence wrote: > > > > > On Mon, Jan 7, 2013 at 11:00 AM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> wrote: > > Hi Michael, > > I don't think "match" (the word) always has to mean > "equality" either. > However having match() (the function) do "whole exact > matching" (aka > "equality") for any kind of vector-like object has the > advantage of: > > (a) making it consistent with base::match() > (?base::match is pretty > explicit about what the contract of match() is) > > > (a) alone is obviously not enough. We have many methods, like > the set > operations, that treat ranges specially. Are we going to start > moving > everything toward the base behavior? And have rangeIntersect, > rangeSetdiff, etc? > > (b) preserving its relationship with ==, duplicated(), > unique(), > etc... > > > So it becomes consistent with duplicated/unique, but we lose > consistency > with the set operations. > > > Nope, we don't loose anything. Because match()/%in% were NOT consistent > with the set operations anyway, that is, 'intersect(x, y)' on > IRanges/GRanges objects was not doing 'x[x %in% y]' (%in% here being > the old %in%). > > > > (c) not frustrating the user who needs something to do exact > matching on ranges (as I mentioned previously, if > you take > match() away from him/her, s/he'll be left with > nothing). > > > No one has ever asked for match() to behave this way. > > > Here is my use case: internally findMatches()/countMatches() are > implemented on top of match(), the fixed match(). They work on any > object for which match() works. They would also work on objects for > which match() does the wrong thing but they would return something > wrong. They could be made ordinary functions, not generic (and they > will, but they temporarily need to be made generics with methods, > just to smooth the transition), because dispatch happens inside the > function when match() is called. In the man page for those functions > I can just say: > > findMatches(x, table): An enhanced version of ?match? that returns > > all the matches in a Hits object. > > and I'm done. It's clear and concise. > > The implementation/documentation of findMatches()/countMatches() is > the typical illustration of why having methods that respect the > contract of the generic is a must. > > The idea is to build on top of some basic building-blocks for which > the behavior is well-defined, consistent, predictable. It's sooo much > easier, and it's very healthy. > > > There was a > request for a way to tabulate identical ranges. It was a nice > idea to > extract the general "outer equal" findMatches function. > > > It's also a nice idea to have findMatches() and countMatches() aligned > with match(). > > > But the changes seem to be snow-balling. > > > No snow-balling. You cannot snow-ball too far anyway when you restore > consistency. But you can easily snow-ball very far when you go on the > other direction (there is no limits). Do I need to say that aiming for > consistency/predictability is a good goal in software design? It can > only make it *better* in all the meanings of the term: less bugs, > easier to maintain, easier to document, and easier to use in the long > run. Everybody wins. Even if you don't realize it now. Convenience is > also important, but less important than consistency/predictability. > As a matter of fact, an interesting and not immediately obvious side > effect of going consistent is that, in the long run (i.e. when the > software becomes bigger and more complex), it also gives you a form of > convenience for the end-user: documentation is simpler and easier to > read, and there are less special cases to remember. > > > These types of changes mean a lot of > maintenance work for the users. A deprecation cycle does not > circumvent > that. > > > I don't see why this change would be more work for the users than any > other change. Making RangedData fade away will certainly be a much > bigger one, will take much more time (maybe 2-3 years), and will > require a lot more maintenance work from us (mostly me) and from > the users. > > FWIW, the change to match()/%in% probably means more work for me than > for the users. There is a *lot* of stuff I had to put in place in > IRanges/GenomicRanges to make this transition smooth. But I truly > believe it was worth it. I also fixed all the BioC packages I found > that were affected by those changes (surprisingly, there were very > few: only 5). I could have missed some. Please let me know if that > is the case and I'll fix them too. > > Thanks, > H. > > > > IMO those advantages counterbalance *by far* the very little > convenience you get from having 'match(query, subject)' do > 'findOverlaps(query, subject, select="first")' on > IRanges/GRanges objects. If you need to do that, just use the > latter, or, if you think that's still too much typing, define > a wrapper e.g. 'ovmatch(query, subject)'. > > There are plenty of specialized tools around for doing > inexact/fuzzy/partial/overlap matching for many particular > types > of vector-like objects: grep() and family, pmatch(), > charmatch(), > agrep(), grepRaw(), matchPattern() and family, > findOverlaps() and > family, findIntervals(), etc... For the reasons I mentioned > above, none of them should hijack match() to make it do some > particular type of inexact matching on some particular type of > objects. Even if, for that particular type of objects, > doing that > particular type of inexact matching is more common than doing > exact matching. > > H. > > > > On 01/06/2013 05:39 PM, Michael Lawrence wrote: > > I think having overlapsAny is a nice addition and helps > make the API > more complete and explicit. Are you sure we need to > change the > behavior > of the match method for this relatively uncommon use case? > > > Yes because otherwise users with a use case of doing match() > > even if it's uncommon, > > > I don't think > "match" always has to mean "equality". It is a more general > concept in > my mind. The most common use case for matching ranges > is overlap. > > > Of course "match" doesn't always have to mean equality. But > of base > > > Michael > > > On Fri, Jan 4, 2013 at 8:34 PM, Hervé Pagès > <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>> wrote: > > Yes 'peaks %in% genes' is cute and was probably > doing the > right thing > for most users (although not all). But 'exons %in% > genes' > is cute too > and was probably doing the wrong thing for all users. > Advanced users > like you guys would have no problem switching to > > !is.na <http: is.na=""> <http: is.na=""> > <http: is.na="">(findOverlaps(____peaks, genes, > type="within", > > select="any")) > > or > > !is.na <http: is.na=""> <http: is.na=""> > <http: is.na="">(findOverlaps(____peaks, genes, type="equal", > > > select="any")) > > in case 'peaks %in% genes' was not doing exactly > what you > wanted, > but most users would not find this particularly > friendly. Even > worse, some users probably didn't realize that > 'peaks %in% > genes' > was not doing exactly what they thought it did because > "peaks in > genes" in English suggests that the peaks are > within the genes, > but it's not what 'peaks %in% genes' does. > > Having overlapsAny(), with exactly the same extra > arguments as > countOverlaps() and subsetByOverlaps() (i.e. 'maxgap', > 'minoverlap', > 'type', 'ignore.strand'), all of them documented > (and with most > users more or less familiar with them already) has the > virtue to > expose the user to all the options from the very > start, and to > help him/her make the right choice. Of course > there will be > users > that don't want or don't have the time to > read/think about > all the > options. Not a big deal: they'll just do > 'overlapsAny(query, subject)', > which is not a lot more typing than 'query %in% > subject', > especially > if they use tab completion. > > It's true that it's more common to ask questions about > overlap than > about equality but there are some use cases for > the latter > (as the > original thread shows). Until now, when you had > such a use > case, you > could not use match() or %in%, which would have > been the > natural things > to use, because they got hijacked to do something > else, and > you were > left with nothing. Not a satisfying situation. So at a > minimum, we > needed to restore the true/real/original semantic of > match() to do > "equality" instead of "overlap". But it's hard to > do this > for match() > and not do it for %in% too. For more than 99% of R > users, > %in% is > just a simple wrapper for 'match(x, table, nomatch > = 0) > > 0' (this > is how it has been documented and implemented in > base R for > many > years). Not maintaining this relationship between > %in% and > match() > would only cause grief and frustration to newcomers to > Bioconductor. > > H. > > > > On 01/04/2013 03:32 PM, Cook, Malcolm wrote: > > Hiya again, > > I am definitely a late comer to BioC, so I > definitely > easily > defer to > the tide of history. > > But I do think you miss my point Michael about the > proposed change > making the relationship between %in% and match for > {G,I}Ranges{List} > mimic that between other vectors, and I do > think that > changing > the API > would make other late-comers take to BioC > easier/faster. > > That said, I NEVER use %in% so I really have > no stake > in the > matter, and > I DEFINITELY appreciate the argument to not > changing > the API > just for > sematic sweetness. > > That that said, Herve is _/so good/_ about > deprecations > and warnings > > that make such changes fairly easily digestible. > > That that that.... enough.... I bow out of > this one....!!!! > > Always learning and Happy New Year to all lurkers, > > ~Malcolm > > *From:*Michael Lawrence > [mailto:lawrence.michael at gene <mailto:lawrence.michael at="" gene="">. > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene="">.__>____com > > > <mailto:lawrence.michael at="" gene.=""> <mailto:lawrence.michael at="" gene.="">____com > <mailto:lawrence.michael at="" gene.__com=""> <mailto:lawrence.michael at="" gene.com="">>>] > *Sent:* Friday, January 04, 2013 5:11 PM > *To:* Cook, Malcolm > *Cc:* Sean Davis; Michael Lawrence; Hervé Pagès > (hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>); Tim > > > > Triche, Jr.; Vedran Franke; > bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org=""> > > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>> > *Subject:* Re: [BioC] countMatches() (was: > table for > GenomicRanges) > > > On Fri, Jan 4, 2013 at 1:56 PM, Cook, Malcolm > <mec at="" stowers.org="" <mailto:mec="" at="" stowers.org=""> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org="">> > <mailto:mec at="" stowers.org=""> <mailto:mec at="" stowers.org=""> <mailto:mec at="" stowers.org=""> <mailto:mec at="" stowers.org="">>> > <mailto:mec at="" stowers.org=""> <mailto:mec at="" stowers.org=""> <mailto:mec at="" stowers.org=""> <mailto:mec at="" stowers.org="">> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org=""> > <mailto:mec at="" stowers.org="" <mailto:mec="" at="" stowers.org="">>>>> wrote: > > Hiya, > > For what it is worth... > > I think the change to %in% is warranted. > > If I understand correctly, this change > restores the > relationship > between > the semantics of `%in` and the semantics of > `match`. > > From the docs: > > '"%in%" <- function(x, table) match(x, table, > nomatch = 0) > 0' > > Herve's change restores this relationship. > > > match and %in% were initially consistent (both > considering any > overlap); > Herve has changed both of them together. The > whole idea > behind > IRanges > is that ranges are special data types with special > semantics. We > have > reimplemented much of the existing R vector > API using those > semantics; > this extends beyond match/%in%. I am hesitant > about > making such > sweeping > changes to the API so late in the life- cycle > of the > package. > There was a > feature request for a way to count identical > ranges in > a set of > ranges. > Let's please not get carried away and start > redesigning > the API > for this > one, albeit useful, request. There are all > sorts of > inconsistencies in > the API, and many of them were conscious > decisions that > considered > practical use cases. > > Michael > > > Herve, I suspect you were you as a result > able to > completely drop > all the `%in%,BiocClass1,BiocClass2` > definitions > and depend > upon > base::%in% > > Am I right? > > If so, may I suggest that Herve stay the > course, > with the > addition of > '"%ol%" <- function(a, b) > findOverlaps(a, b, > maxgap=0L, > minoverlap=1L, type='any', select='all') > 0' > > This would provide a perspicacious idiom, > thereby > optimizing the API > for Michaels observed common use case. > > Just sayin' > > ~Malcolm > > > .-----Original Message----- > .From: > bioconductor-bounces at r-______project.org > <mailto:bioconductor-bounces at="" r-____project.org=""> > <mailto:bioconductor-bounces at="" __r-__project.org=""> <mailto:bioconductor-bounces at="" r-__project.org="">> > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>____r-project.org > <http: r-project.org=""> > <mailto:bioconductor-bounces at="" __r-project.org=""> <mailto:bioconductor-bounces at="" r-project.org="">>> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>>______r-project.org > <http: r-project.org=""> > <http: r-project.org=""> > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>____r-project.org > <http: r-project.org=""> > <mailto:bioconductor-bounces at="" __r-project.org=""> <mailto:bioconductor-bounces at="" r-project.org="">>>> > [mailto:bioconductor-bounces@ > <mailto:bioconductor-bounces@> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>>______r-project.org > <http: r-project.org=""> > <http: r-project.org=""> > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>____r-project.org > <http: r-project.org=""> > <mailto:bioconductor-bounces at="" __r-project.org=""> <mailto:bioconductor-bounces at="" r-project.org="">>> > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@> > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>>______r-project.org > <http: r-project.org=""> > <http: r-project.org=""> > > > <mailto:bioconductor-bounces@> <mailto:bioconductor-bounces@>____r-project.org > <http: r-project.org=""> > <mailto:bioconductor-bounces at="" __r-project.org=""> <mailto:bioconductor-bounces at="" r-project.org="">>>>] On Behalf Of Sean > Davis > .Sent: Friday, January 04, 2013 3:37 PM > .To: Michael Lawrence > .Cc: Tim Triche, Jr.; Vedran Franke; > bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>> > <mailto:bioconductor at="" r-______project.org=""> <mailto:bioconductor at="" r-____project.org=""> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org="">> > > > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>>> > > .Subject: Re: [BioC] countMatches() > (was: table for > GenomicRanges) > . > .On Fri, Jan 4, 2013 at 4:32 PM, > Michael Lawrence > .<lawrence.michael at="" gene.com=""> <mailto:lawrence.michael at="" gene.com=""> > <mailto:lawrence.michael at="" gene.__com=""> <mailto:lawrence.michael at="" gene.com="">> > <mailto:lawrence.michael at="" gene.=""> <mailto:lawrence.michael at="" gene.="">____com > <mailto:lawrence.michael at="" gene.__com=""> <mailto:lawrence.michael at="" gene.com="">>> > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene="">. > <mailto:lawrence.michael at="" gene=""> <mailto:lawrence.michael at="" gene="">.__>____com > > > <mailto:lawrence.michael at="" gene.=""> <mailto:lawrence.michael at="" gene.="">____com > <mailto:lawrence.michael at="" gene.__com=""> <mailto:lawrence.michael at="" gene.com="">>>>> wrote: > .> The change to the behavior of %in% is a > pretty big > one. Are you > thinking > .> that all set-based operations should > behave > this way? For > example, setdiff > .> and intersect? I really liked the > syntax of > "peaks > %in% genes". > In my > .> experience, it's way more common to > ask questions > about overlap > than about > .> equality, so I'd rather optimize the > API for > that use > case. But > again, > .> that's just my personal bias. > . > .For what it is worth, I share Michael's > personal bias here. > . > .Sean > . > . > .> Michael > .> > .> > .> On Fri, Jan 4, 2013 at 1:11 PM, > Hervé Pagès > <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>>> wrote: > .> > .>> Hi, > .>> > .>> I added findMatches() and > countMatches() to the > latest IRanges / > .>> GenomicRanges packages (in BioC > devel only). > .>> > .>> findMatches(x, table): An > enhanced version of > ?match? that > .>> returns all the matches > in a Hits > object. > .>> > .>> countMatches(x, table): Returns > an integer > vector > of the length > .>> of ?x?, containing the > number of > matches in > ?table? for > .>> each element in ?x?. > .>> > > .>> countMatches() is what you can use to > tally/count/tabulate > (choose your > > .>> preferred term) the unique elements > in a > GRanges object: > .>> > .>> library(GenomicRanges) > .>> set.seed(33) > .>> gr <- GRanges("chr1", > IRanges(sample(15,20,replace=*______*TRUE), > > > width=5)) > .>> > .>> Then: > .>> > .>> > gr_levels <- sort(unique(gr)) > .>> > countMatches(gr_levels, gr) > .>> [1] 1 1 1 2 4 2 2 1 2 2 2 > .>> > .>> Note that findMatches() and > countMatches() > also work on > IRanges and > .>> DNAStringSet objects, as well as on > ordinary > atomic > vectors: > .>> > .>> library(hgu95av2probe) > .>> library(Biostrings) > .>> probes <- DNAStringSet(hgu95av2probe) > .>> unique_probes <- unique(probes) > .>> count <- > countMatches(unique_probes, probes) > .>> max(count) # 7 > .>> > .>> I made other changes in > IRanges/GenomicRanges so that > the notion > .>> of "match" between elements of a > vector-like > object now > consistently > .>> means "equality" instead of > "overlap", even for > range-based > objects > .>> like IRanges or GRanges objects. > This notion of > "equality" is the > .>> same that is used by ==. The most > visible > consequence > of those > .>> changes is that using %in% between > 2 IRanges or > GRanges objects > .>> 'query' and 'subject' in order to do > overlaps was > replaced by > .>> overlapsAny(query, subject). > .>> > .>> overlapsAny(query, subject): > Finds the > ranges in > ?query? that > .>> overlap any of the ranges in > ?subject?. > .>> > > .>> There are warnings and deprecation > messages > in place > to help > smooth > > .>> the transition. > .>> > .>> Cheers, > .>> H. > .>> > .>> -- > .>> Hervé Pagès > .>> > .>> Program in Computational Biology > .>> Division of Public Health Sciences > .>> Fred Hutchinson Cancer Research Center > .>> 1100 Fairview Ave. N, M1-B514 > .>> P.O. Box 19024 > .>> Seattle, WA 98109-1024 > .>> > .>> E-mail: hpages at fhcrc.org > <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> > > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>> > > .>> Phone: (206) 667-5791 > <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> > .>> Fax: (206) 667-1319 > <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > > .>> > .> > .> [[alternative HTML version > deleted]] > .> > .> > .> > _____________________________________________________ > > > .> Bioconductor mailing list > .> Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>> > <mailto:bioconductor at="" r-______project.org=""> <mailto:bioconductor at="" r-____project.org=""> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org="">> > > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>>> > > .> > https://stat.ethz.ch/mailman/______listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">> > > > > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>> > .> Search the archives: > http://news.gmane.org/gmane.______science.biology.informatic s.______conductor > <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> > > <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> <http: news.gmane.org="" gmane.__science.biology.informatics._="" _conductor="">> > > > > <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> <http: news.gmane.org="" gmane.__science.biology.informatics._="" _conductor=""> > > <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">>> > . > > ._____________________________________________________ > > > .Bioconductor mailing list > .Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>> > <mailto:bioconductor at="" r-______project.org=""> <mailto:bioconductor at="" r-____project.org=""> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org="">> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>>> > > > > .https://stat.ethz.ch/mailman/______listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">> > > > > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>> > .Search the archives: > http://news.gmane.org/gmane.______science.biology.informatic s.______conductor > <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> > > <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> <http: news.gmane.org="" gmane.__science.biology.informatics._="" _conductor="">> > > > > > <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> <http: news.gmane.org="" gmane.__science.biology.informatics._="" _conductor=""> > > <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">>> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> > > > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLYlink written 4.9 years ago by Hervé Pagès ♦♦ 13k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 312 users visited in the last hour