table for GenomicRanges

0

Entering edit mode

Vedran Franke ▴ 20

@vedran-franke-5644

Last seen 11.4 years ago

Dear All, Is there an equivalent of the table function for a GenomicRanges object? Best, Vedran -- Vedran Franke Bioinformatics Group, Department of Molecular Biology, Faculty of Science, Zagreb

• 1.7k views

ADD COMMENT • link updated 13.2 years ago by Michael Lawrence ★ 11k • written 13.2 years ago by Vedran Franke ▴ 20

0

Entering edit mode

Tim Triche ★ 4.2k

@tim-triche-3561

Last seen 5.4 years ago

United States

What would its output look like? Certainly you can do it with the metadata columns: R> segmentation(AML200)$CD14_MONO GRanges with 619643 ranges and 1 metadata column: seqnames ranges strand | state <rle> <iranges> <rle> | <factor> [1] chr1 [ 1, 9800] * | Quiescent [2] chr1 [ 9801, 10800] * | Quiescent [3] chr1 [10801, 11000] * | Quiescent [4] chr1 [11001, 12200] * | CTCF ... ... ... ... ... ... R> table(segmentation(AML200)$CD14_MONO$state) StrongEnhancer WeakEnhancer PoisedPromoter PolycombRepressed 65990 74465 18219 52787 Quiescent Transcription Elongation CTCF 176451 79811 59474 29116 StrongPromoter WeakPromoter ActivePromoter 31322 19628 37694 But I assume that is not what you are after. On Tue, Dec 4, 2012 at 4:25 PM, Vedran Franke <vfranke@bioinfo.hr> wrote: > Dear All, > > Is there an equivalent of the table function for a GenomicRanges object? > > Best, > > Vedran > > > -- > Vedran Franke > > Bioinformatics Group, > Department of Molecular Biology, > Faculty of Science, Zagreb > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD COMMENT • link 13.2 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 4.2 years ago

United States

I've often thought that this would be a really useful feature, that is, counting the combination of seqname, start, end and strand. This would follow the convention of the duplicated() method. All someone (Herve) needs to do is make an IRanges:::tableIntegerQuads. In the past, I have hacked this together by tabulating a string representation. Given the way R internalizes strings, that's not an ideal solution. Michael On Tue, Dec 4, 2012 at 4:25 PM, Vedran Franke <vfranke@bioinfo.hr> wrote: > Dear All, > > Is there an equivalent of the table function for a GenomicRanges object? > > Best, > > Vedran > > > -- > Vedran Franke > > Bioinformatics Group, > Department of Molecular Biology, > Faculty of Science, Zagreb > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 13.2 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi, As Tim pointed out, there is the question of what the output should be. Proposal 1: Output is a "table" object. tableGenomicRanges1 <- function(x) { x_levels <- sort(unique(x)) y <- IRanges:::matchIntegerQuads(as.factor(seqnames(x)), as.factor(strand(x)), start(x), width(x), as.factor(seqnames(x_levels)), as.factor(strand(x_levels)), start(x_levels), width(x_levels)) ans_names <- paste(seqnames(x_levels), "[", start(x_levels), ",", end(x_levels), "]", strand(x_levels), sep="") ans <- table(factor(y, levels=seq_along(x_levels))) names(ans) <- ans_names ans } > tableGenomicRanges1(gr) chr1[6,10]+ chr1[1,5]- chr1[1,10]- chr1[2,6]- chr1[3,7]- 303 333 345 337 345 chr1[101,110]- chr1[102,111]- chr1[103,112]- chr1[5,10]* chr2[2,10]+ 327 308 343 308 338 chr2[3,10]+ chr2[4,8]- chr2[5,9]- chr2[6,10]- chr2[104,113]- 324 344 350 355 342 chr2[105,114]- chr2[106,115]- chr2[107,116]- chr2[4,10]* chr3[7,10]+ 354 338 318 336 351 chr3[8,10]+ chr3[7,11]- chr3[8,12]- chr3[9,10]- chr3[9,13]- 355 360 357 324 325 chr3[10,10]- chr3[10,14]- chr3[108,117]- chr3[109,118]- chr3[110,119]- 346 307 277 320 330 but this output is not very convenient... Proposal 2: Output is a GenomicRanges object containing the sorted "levels". The counts are stored as a metadata col. tableGenomicRanges2 <- function(x) { ans <- sort(unique(x)) names(ans) <- mcols(ans) <- NULL y <- IRanges:::matchIntegerQuads(as.factor(seqnames(x)), as.factor(strand(x)), start(x), width(x), as.factor(seqnames(x_levels)), as.factor(strand(x_levels)), start(x_levels), width(x_levels)) mcols(ans)$count <- tabulate(y, nbins=length(x_levels)) ans } > tableGenomicRanges2(gr) GRanges with 30 ranges and 1 metadata column: seqnames ranges strand | count <rle> <iranges> <rle> | <integer> [1] chr1 [ 6, 10] + | 303 [2] chr1 [ 1, 5] - | 333 [3] chr1 [ 1, 10] - | 345 [4] chr1 [ 2, 6] - | 337 [5] chr1 [ 3, 7] - | 345 [6] chr1 [101, 110] - | 327 [7] chr1 [102, 111] - | 308 [8] chr1 [103, 112] - | 343 [9] chr1 [ 5, 10] * | 308 ... ... ... ... ... ... [22] chr3 [ 7, 11] - | 360 [23] chr3 [ 8, 12] - | 357 [24] chr3 [ 9, 10] - | 324 [25] chr3 [ 9, 13] - | 325 [26] chr3 [ 10, 10] - | 346 [27] chr3 [ 10, 14] - | 307 [28] chr3 [108, 117] - | 277 [29] chr3 [109, 118] - | 320 [30] chr3 [110, 119] - | 330 --- seqlengths: chr1 chr2 chr3 1000 2000 1500 Since the "table" method for GenomicRanges objects should really return a "table" object it should do tableGenomicRanges1(). So what do we do for tableGenomicRanges2? Should we provide this functionality thru a new generic function e.g. count()? Note that tableGenomicRanges2() not only produces more convenient output, it's also 4x faster or more than tableGenomicRanges1() on big objects. Thanks, H. On 12/05/2012 11:18 AM, Michael Lawrence wrote: > I've often thought that this would be a really useful feature, that is, > counting the combination of seqname, start, end and strand. This would > follow the convention of the duplicated() method. All someone (Herve) needs > to do is make an IRanges:::tableIntegerQuads. In the past, I have hacked > this together by tabulating a string representation. Given the way R > internalizes strings, that's not an ideal solution. > > Michael > > On Tue, Dec 4, 2012 at 4:25 PM, Vedran Franke <vfranke at="" bioinfo.hr=""> wrote: > >> Dear All, >> >> Is there an equivalent of the table function for a GenomicRanges object? >> >> Best, >> >> Vedran >> >> >> -- >> Vedran Franke >> >> Bioinformatics Group, >> Department of Molecular Biology, >> Faculty of Science, Zagreb >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.2 years ago Hervé Pagès 16k

0

Entering edit mode

The question is whether there is ever a use case to have a simple table. This is analogous to base R's table and data.frame. For example, if you call xtabs(), you get a table, then you have to call as.data.frame to get back into the data.frame context. This is sort of clean, and we could create an extension of table that for efficiency stores the associated GRanges along with the counts in the .Data slot. Then as(x, "GRanges") on that would generate the GRanges with the count column. That would be complicated though. Another issue is that table() cannot be used in the general way, due to restrictions on dispatch with "...". So I think I'm in favor of a new "count" generic. That naming is consistent with countOverlaps, countSubjects, countQueries, etc. Michael On Wed, Dec 5, 2012 at 12:55 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi, > > As Tim pointed out, there is the question of what the output should > be. > > Proposal 1: Output is a "table" object. > > tableGenomicRanges1 <- function(x) > { > x_levels <- sort(unique(x)) > y <- IRanges:::matchIntegerQuads(**as.factor(seqnames(x)), > as.factor(strand(x)), > start(x), width(x), > as.factor(seqnames(x_levels)), > as.factor(strand(x_levels)), > start(x_levels), width(x_levels)) > ans_names <- paste(seqnames(x_levels), > "[", start(x_levels), ",", end(x_levels), "]", > strand(x_levels), sep="") > ans <- table(factor(y, levels=seq_along(x_levels))) > names(ans) <- ans_names > ans > } > > > tableGenomicRanges1(gr) > chr1[6,10]+ chr1[1,5]- chr1[1,10]- chr1[2,6]- chr1[3,7]- > 303 333 345 337 345 > chr1[101,110]- chr1[102,111]- chr1[103,112]- chr1[5,10]* chr2[2,10]+ > 327 308 343 308 338 > chr2[3,10]+ chr2[4,8]- chr2[5,9]- chr2[6,10]- chr2[104,113]- > 324 344 350 355 342 > chr2[105,114]- chr2[106,115]- chr2[107,116]- chr2[4,10]* chr3[7,10]+ > 354 338 318 336 351 > chr3[8,10]+ chr3[7,11]- chr3[8,12]- chr3[9,10]- chr3[9,13]- > 355 360 357 324 325 > chr3[10,10]- chr3[10,14]- chr3[108,117]- chr3[109,118]- chr3[110,119]- > 346 307 277 320 330 > > but this output is not very convenient... > > Proposal 2: Output is a GenomicRanges object containing the > sorted "levels". The counts are stored as a metadata col. > > tableGenomicRanges2 <- function(x) > { > ans <- sort(unique(x)) > names(ans) <- mcols(ans) <- NULL > y <- IRanges:::matchIntegerQuads(**as.factor(seqnames(x)), > as.factor(strand(x)), > start(x), width(x), > as.factor(seqnames(x_levels)), > as.factor(strand(x_levels)), > start(x_levels), width(x_levels)) > mcols(ans)$count <- tabulate(y, nbins=length(x_levels)) > ans > } > > > tableGenomicRanges2(gr) > GRanges with 30 ranges and 1 metadata column: > seqnames ranges strand | count > <rle> <iranges> <rle> | <integer> > [1] chr1 [ 6, 10] + | 303 > [2] chr1 [ 1, 5] - | 333 > [3] chr1 [ 1, 10] - | 345 > [4] chr1 [ 2, 6] - | 337 > [5] chr1 [ 3, 7] - | 345 > [6] chr1 [101, 110] - | 327 > [7] chr1 [102, 111] - | 308 > [8] chr1 [103, 112] - | 343 > [9] chr1 [ 5, 10] * | 308 > > ... ... ... ... ... ... > [22] chr3 [ 7, 11] - | 360 > [23] chr3 [ 8, 12] - | 357 > [24] chr3 [ 9, 10] - | 324 > [25] chr3 [ 9, 13] - | 325 > [26] chr3 [ 10, 10] - | 346 > [27] chr3 [ 10, 14] - | 307 > [28] chr3 [108, 117] - | 277 > [29] chr3 [109, 118] - | 320 > [30] chr3 [110, 119] - | 330 > --- > seqlengths: > chr1 chr2 chr3 > 1000 2000 1500 > > Since the "table" method for GenomicRanges objects should really return > a "table" object it should do tableGenomicRanges1(). > So what do we do for tableGenomicRanges2? Should we provide this > functionality thru a new generic function e.g. count()? > Note that tableGenomicRanges2() not only produces more convenient > output, it's also 4x faster or more than tableGenomicRanges1() on > big objects. > > Thanks, > H. > > > > On 12/05/2012 11:18 AM, Michael Lawrence wrote: > >> I've often thought that this would be a really useful feature, that is, >> counting the combination of seqname, start, end and strand. This would >> follow the convention of the duplicated() method. All someone (Herve) >> needs >> to do is make an IRanges:::tableIntegerQuads. In the past, I have hacked >> this together by tabulating a string representation. Given the way R >> internalizes strings, that's not an ideal solution. >> >> Michael >> >> On Tue, Dec 4, 2012 at 4:25 PM, Vedran Franke <vfranke@bioinfo.hr> wrote: >> >> Dear All, >>> >>> Is there an equivalent of the table function for a GenomicRanges object? >>> >>> Best, >>> >>> Vedran >>> >>> >>> -- >>> Vedran Franke >>> >>> Bioinformatics Group, >>> Department of Molecular Biology, >>> Faculty of Science, Zagreb >>> >>> ______________________________**_________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.="" ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: >>> http://news.gmane.org/gmane.**science.biology.informatics.**conduc tor<http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>> >>> >> [[alternative HTML version deleted]] >> >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]

ADD REPLY • link 13.2 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi, On Wed, Dec 5, 2012 at 5:50 PM, Michael Lawrence <lawrence.michael at="" gene.com=""> wrote: > The question is whether there is ever a use case to have a simple table. > This is analogous to base R's table and data.frame. For example, if you > call xtabs(), you get a table, then you have to call as.data.frame to get > back into the data.frame context. This is sort of clean, and we could > create an extension of table that for efficiency stores the associated > GRanges along with the counts in the .Data slot. Then as(x, "GRanges") on > that would generate the GRanges with the count column. That would be > complicated though. > > Another issue is that table() cannot be used in the general way, due to > restrictions on dispatch with "...". > > So I think I'm in favor of a new "count" generic. That naming is consistent > with countOverlaps, countSubjects, countQueries, etc. Or maybe `tally`? Somehow I have a mental association w/ that being closer to `tabulate`, but I guess it's really not and maybe it's just me that has a mind map that puts `tally` closer to `table` than `count` is . -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 13.2 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

that goes along nicely with BamTally in gmapR, which is a damn useful function IMHO... actually I think I wrote a tally() function or something like it for Rsubread, can't remember whether I sent it in with the patch though. Anyways, a running tally is a good mental image for this functionality On Wed, Dec 5, 2012 at 2:59 PM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Hi, > > On Wed, Dec 5, 2012 at 5:50 PM, Michael Lawrence > <lawrence.michael@gene.com> wrote: > > The question is whether there is ever a use case to have a simple table. > > This is analogous to base R's table and data.frame. For example, if you > > call xtabs(), you get a table, then you have to call as.data.frame to get > > back into the data.frame context. This is sort of clean, and we could > > create an extension of table that for efficiency stores the associated > > GRanges along with the counts in the .Data slot. Then as(x, "GRanges") on > > that would generate the GRanges with the count column. That would be > > complicated though. > > > > Another issue is that table() cannot be used in the general way, due to > > restrictions on dispatch with "...". > > > > So I think I'm in favor of a new "count" generic. That naming is > consistent > > with countOverlaps, countSubjects, countQueries, etc. > > Or maybe `tally`? Somehow I have a mental association w/ that being > closer to `tabulate`, but I guess it's really not and maybe it's just > me that has a mind map that puts `tally` closer to `table` than > `count` is . > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD REPLY • link 13.2 years ago Tim Triche ★ 4.2k

0

Entering edit mode

So with tally, what would be the name of the metadata col? Would "tally" be OK? It's kind of neat to use the same name for the metadata col as for the function itself. That makes the code more readable: mcols(count(gr))$count Thanks for the feedback, H. On 12/05/2012 03:05 PM, Tim Triche, Jr. wrote: > that goes along nicely with BamTally in gmapR, which is a damn useful > function IMHO... actually I think I wrote a tally() function or > something like it for Rsubread, can't remember whether I sent it in with > the patch though. Anyways, a running tally is a good mental image for > this functionality > > > > > On Wed, Dec 5, 2012 at 2:59 PM, Steve Lianoglou > <mailinglist.honeypot at="" gmail.com="" <mailto:mailinglist.honeypot="" at="" gmail.com="">> > wrote: > > Hi, > > On Wed, Dec 5, 2012 at 5:50 PM, Michael Lawrence > <lawrence.michael at="" gene.com="" <mailto:lawrence.michael="" at="" gene.com="">> wrote: > > The question is whether there is ever a use case to have a simple > table. > > This is analogous to base R's table and data.frame. For example, > if you > > call xtabs(), you get a table, then you have to call > as.data.frame to get > > back into the data.frame context. This is sort of clean, and we could > > create an extension of table that for efficiency stores the > associated > > GRanges along with the counts in the .Data slot. Then as(x, > "GRanges") on > > that would generate the GRanges with the count column. That would be > > complicated though. > > > > Another issue is that table() cannot be used in the general way, > due to > > restrictions on dispatch with "...". > > > > So I think I'm in favor of a new "count" generic. That naming is > consistent > > with countOverlaps, countSubjects, countQueries, etc. > > Or maybe `tally`? Somehow I have a mental association w/ that being > closer to `tabulate`, but I guess it's really not and maybe it's just > me that has a mind map that puts `tally` closer to `table` than > `count` is . > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > > > > > -- > /A model is a lie that helps you see the truth./ > / > / > Howard Skipper > <http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.2 years ago Hervé Pagès 16k

0

Entering edit mode

that's cool, then it's consistent with score() and friends... this sounds like a grand scheme On Wed, Dec 5, 2012 at 4:31 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > So with tally, what would be the name of the metadata col? Would "tally" > be OK? > > It's kind of neat to use the same name for the metadata col as for the > function itself. That makes the code more readable: > > mcols(count(gr))$count > > Thanks for the feedback, > H. > > > > On 12/05/2012 03:05 PM, Tim Triche, Jr. wrote: > >> that goes along nicely with BamTally in gmapR, which is a damn useful >> function IMHO... actually I think I wrote a tally() function or >> something like it for Rsubread, can't remember whether I sent it in with >> the patch though. Anyways, a running tally is a good mental image for >> this functionality >> >> >> >> >> On Wed, Dec 5, 2012 at 2:59 PM, Steve Lianoglou >> <mailinglist.honeypot@gmail.**com <mailinglist.honeypot@gmail.com=""><mailto:>> mailinglist.honeypot@**gmail.com <mailinglist.honeypot@gmail.com>>> >> >> wrote: >> >> Hi, >> >> On Wed, Dec 5, 2012 at 5:50 PM, Michael Lawrence >> <lawrence.michael@gene.com <mailto:lawrence.michael@gene.**com<lawrence.michael@gene.com="">>> >> wrote: >> > The question is whether there is ever a use case to have a simple >> table. >> > This is analogous to base R's table and data.frame. For example, >> if you >> > call xtabs(), you get a table, then you have to call >> as.data.frame to get >> > back into the data.frame context. This is sort of clean, and we >> could >> > create an extension of table that for efficiency stores the >> associated >> > GRanges along with the counts in the .Data slot. Then as(x, >> "GRanges") on >> > that would generate the GRanges with the count column. That would >> be >> > complicated though. >> > >> > Another issue is that table() cannot be used in the general way, >> due to >> > restrictions on dispatch with "...". >> > >> > So I think I'm in favor of a new "count" generic. That naming is >> consistent >> > with countOverlaps, countSubjects, countQueries, etc. >> >> Or maybe `tally`? Somehow I have a mental association w/ that being >> closer to `tabulate`, but I guess it's really not and maybe it's just >> me that has a mind map that puts `tally` closer to `table` than >> `count` is . >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> | Memorial Sloan-Kettering Cancer Center >> | Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/**contact<http: cb="" io.mskcc.org="" ~lianos="" contact=""> >> >> >> >> >> -- >> /A model is a lie that helps you see the truth./ >> / >> / >> Howard Skipper >> <http: cancerres.**aacrjournals.org="" content="" 31="" 9="" **1173.full.pdf<h="" ttp:="" cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >> > >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD REPLY • link 13.2 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Hi Tim, On 12/05/2012 04:55 PM, Tim Triche, Jr. wrote: > that's cool, then it's consistent with score() and friends... this > sounds like a grand scheme It's not clear to me if you are voting for tally() + mcols(.)$tally or for count(.) + mcols()$count? Note that this functionality doesn't have to be restricted to GenomicRanges and can be extended to any Vector object 'x' for which unique(), sort() and match() work and "do the right thing". Then the implementation is simply: ans <- sort(unique(x)) names(ans) <- mcols(ans) <- NULL y <- match(x, ans) mcols(ans)$count <- tabulate(y, nbins=length(ans)) ans Unfortunately right now match() on GenomicRanges and Ranges objects reports a match in case of *overlap* instead of *equality* (this is why I needed to use IRanges:::matchIntegerQuads in the implementation of tableGenomicRanges2() I showed previously). Just a heads-up that I'd like to change this but with a transition plan e.g. with an extra argument to the match() generic for letting the user control what "match" means (default would be "overlaps" for now but should probably be changed to "equality" in the future). Cheers, H. > > > > On Wed, Dec 5, 2012 at 4:31 PM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote: > > So with tally, what would be the name of the metadata col? Would "tally" > be OK? > > It's kind of neat to use the same name for the metadata col as for the > function itself. That makes the code more readable: > > mcols(count(gr))$count > > Thanks for the feedback, > H. > > > > On 12/05/2012 03:05 PM, Tim Triche, Jr. wrote: > > that goes along nicely with BamTally in gmapR, which is a damn > useful > function IMHO... actually I think I wrote a tally() function or > something like it for Rsubread, can't remember whether I sent it > in with > the patch though. Anyways, a running tally is a good mental > image for > this functionality > > > > > On Wed, Dec 5, 2012 at 2:59 PM, Steve Lianoglou > <mailinglist.honeypot at="" gmail.__com=""> <mailto:mailinglist.honeypot at="" gmail.com=""> > <mailto:mailinglist.honeypot at="" __gmail.com=""> <mailto:mailinglist.honeypot at="" gmail.com="">>> > > wrote: > > Hi, > > On Wed, Dec 5, 2012 at 5:50 PM, Michael Lawrence > <lawrence.michael at="" gene.com=""> <mailto:lawrence.michael at="" gene.com=""> > <mailto:lawrence.michael at="" gene.__com=""> <mailto:lawrence.michael at="" gene.com="">>> wrote: > > The question is whether there is ever a use case to have > a simple > table. > > This is analogous to base R's table and data.frame. For > example, > if you > > call xtabs(), you get a table, then you have to call > as.data.frame to get > > back into the data.frame context. This is sort of clean, > and we could > > create an extension of table that for efficiency stores the > associated > > GRanges along with the counts in the .Data slot. Then as(x, > "GRanges") on > > that would generate the GRanges with the count column. > That would be > > complicated though. > > > > Another issue is that table() cannot be used in the > general way, > due to > > restrictions on dispatch with "...". > > > > So I think I'm in favor of a new "count" generic. That > naming is > consistent > > with countOverlaps, countSubjects, countQueries, etc. > > Or maybe `tally`? Somehow I have a mental association w/ > that being > closer to `tabulate`, but I guess it's really not and maybe > it's just > me that has a mind map that puts `tally` closer to `table` than > `count` is . > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/__contact > <http: cbio.mskcc.org="" ~lianos="" contact=""> > > > > > -- > /A model is a lie that helps you see the truth./ > / > / > Howard Skipper > <http: cancerres.__aacrjournals.org="" content="" 31="" 9="" __1173.full.pdf="" <http:="" cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf="">> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > > > > -- > /A model is a lie that helps you see the truth./ > / > / > Howard Skipper > <http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.2 years ago Hervé Pagès 16k

0

Entering edit mode

maybe (.)$count for tallying overlaps, and (.$)tally for counting equalities, by default? ha ha only serious. if the default is to countOverlaps then $count is a handy shortcut. if you want to tally up matching ranges $tally is more evocative On Wed, Dec 5, 2012 at 7:06 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Tim, > > > On 12/05/2012 04:55 PM, Tim Triche, Jr. wrote: > >> that's cool, then it's consistent with score() and friends... this >> sounds like a grand scheme >> > > It's not clear to me if you are voting for tally() + mcols(.)$tally > or for count(.) + mcols()$count? > > Note that this functionality doesn't have to be restricted to > GenomicRanges and can be extended to any Vector object 'x' for > which unique(), sort() and match() work and "do the right thing". > Then the implementation is simply: > > > ans <- sort(unique(x)) > names(ans) <- mcols(ans) <- NULL > y <- match(x, ans) > mcols(ans)$count <- tabulate(y, nbins=length(ans)) > ans > > Unfortunately right now match() on GenomicRanges and Ranges objects > reports a match in case of *overlap* instead of *equality* (this is > why I needed to use IRanges:::matchIntegerQuads in the implementation > of tableGenomicRanges2() I showed previously). Just a heads-up that > I'd like to change this but with a transition plan e.g. with an extra > argument to the match() generic for letting the user control what > "match" means (default would be "overlaps" for now but should probably > be changed to "equality" in the future). > > Cheers, > H. > > >> >> >> On Wed, Dec 5, 2012 at 4:31 PM, Hervé Pagès <hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>> wrote: >> >> So with tally, what would be the name of the metadata col? Would >> "tally" >> be OK? >> >> It's kind of neat to use the same name for the metadata col as for the >> function itself. That makes the code more readable: >> >> mcols(count(gr))$count >> >> Thanks for the feedback, >> H. >> >> >> >> On 12/05/2012 03:05 PM, Tim Triche, Jr. wrote: >> >> that goes along nicely with BamTally in gmapR, which is a damn >> useful >> function IMHO... actually I think I wrote a tally() function or >> something like it for Rsubread, can't remember whether I sent it >> in with >> the patch though. Anyways, a running tally is a good mental >> image for >> this functionality >> >> >> >> >> On Wed, Dec 5, 2012 at 2:59 PM, Steve Lianoglou >> <mailinglist.honeypot@gmail.__**com>> <mailto:mailinglist.honeypot@**gmail.com<mailinglist.honeyp ot@gmail.com=""> >> > >> <mailto:mailinglist.honeypot@_**_gmail.com>> >> <mailto:mailinglist.honeypot@**gmail.com<mailinglist.honeyp ot@gmail.com=""> >> >>> >> >> wrote: >> >> Hi, >> >> On Wed, Dec 5, 2012 at 5:50 PM, Michael Lawrence >> <lawrence.michael@gene.com>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com="">> >> <mailto:lawrence.michael@gene.**__com>> >> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com="">>>> >> wrote: >> > The question is whether there is ever a use case to have >> a simple >> table. >> > This is analogous to base R's table and data.frame. For >> example, >> if you >> > call xtabs(), you get a table, then you have to call >> as.data.frame to get >> > back into the data.frame context. This is sort of clean, >> and we could >> > create an extension of table that for efficiency stores >> the >> associated >> > GRanges along with the counts in the .Data slot. Then >> as(x, >> "GRanges") on >> > that would generate the GRanges with the count column. >> That would be >> > complicated though. >> > >> > Another issue is that table() cannot be used in the >> general way, >> due to >> > restrictions on dispatch with "...". >> > >> > So I think I'm in favor of a new "count" generic. That >> naming is >> consistent >> > with countOverlaps, countSubjects, countQueries, etc. >> >> Or maybe `tally`? Somehow I have a mental association w/ >> that being >> closer to `tabulate`, but I guess it's really not and maybe >> it's just >> me that has a mind map that puts `tally` closer to `table` >> than >> `count` is . >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> | Memorial Sloan-Kettering Cancer Center >> | Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/**__contac t<http: cbio.mskcc.org="" ~lianos="" __contact=""> >> >> <http: cbio.mskcc.org="" ~**lianos="" contact<http:="" cbio.mskcc.="" org="" ~lianos="" contact=""> >> > >> >> >> >> >> -- >> /A model is a lie that helps you see the truth./ >> / >> / >> Howard Skipper >> <http: cancerres.__aacrjourna**ls.org="" content="" 31="" 9="" __1173.**="">> full.pdf <http: aacrjournals.org="" content="" 31="" 9="" __1173.full.pdf=""> < >> http://cancerres.**aacrjournals.org/content/31/9/**1173.full.pdf<ht tp:="" cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >> >> >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >> >> >> >> >> >> -- >> /A model is a lie that helps you see the truth./ >> / >> / >> Howard Skipper >> <http: cancerres.**aacrjournals.org="" content="" 31="" 9="" **1173.full.pdf<h="" ttp:="" cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >> > >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD REPLY • link 13.2 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Now that I am rereading what I wrote, I am wondering whether a column labeled $count for a given combination of (seqname, start, end, strand) is more immediately obvious to a user, i.e. "there are mcols(GR)$count ranges that match this exact description in this here GR". Perhaps count() really is the way to go, especially from the standpoint of making it intuitive. There are 74465 segments exactly matching the state "WeakEnhancer" in the HMM that I summarized (don't bother looking for it in ENCODE or Roadmap data, it is a bespoke model with additional marks). Similarly, a "table" of unique reads placements that share a common range would summarize them by how many times we saw each. And then the user would see counts accordingly. -> I respectfully withdraw my proposal regarding $tally and now feel that $count is closer to what people would expect. Does that make sense? Example: mimatGR <- subsetByOverlaps(readsGR, tx.mimat) count(mimatGR) # table of appearances of unique reads, i.e., a library complexity surrogate or more interestingly, let's say there's an ncRNA and a coding RNA emerging from a bidirectional promoter. Then you could have maybeNcRNA <- reduce(c(tx.ncRNA, tx.mRNA)) # forward and reverse, maybe lots of exons, perhaps even a little trans splicing bidirectionalGR <- subsetByOverlaps(readsGR, maybeNcRNA) # what's really going on here? count() acts as a spot check count(bidirectionalGR) the latter, especially with a strand-specific RNAseq protocol, could tell you about some really interesting biology if conducted along a "theme". IMHO of course, since I'm not the one implementing it, but more thinking about the use cases I encounter. YMMV and I hope other people will point out the many ways in which I am heavily constipated. :-) Thanks Herve for your ninja programming skills, you code faster than I can think (at least clearly). --t On Wed, Dec 5, 2012 at 7:11 PM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: > maybe (.)$count for tallying overlaps, and (.$)tally for counting > equalities, by default? > > ha ha only serious. if the default is to countOverlaps then $count is a > handy shortcut. if you want to tally up matching ranges $tally is more > evocative > > > > On Wed, Dec 5, 2012 at 7:06 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > >> Hi Tim, >> >> >> On 12/05/2012 04:55 PM, Tim Triche, Jr. wrote: >> >>> that's cool, then it's consistent with score() and friends... this >>> sounds like a grand scheme >>> >> >> It's not clear to me if you are voting for tally() + mcols(.)$tally >> or for count(.) + mcols()$count? >> >> Note that this functionality doesn't have to be restricted to >> GenomicRanges and can be extended to any Vector object 'x' for >> which unique(), sort() and match() work and "do the right thing". >> Then the implementation is simply: >> >> >> ans <- sort(unique(x)) >> names(ans) <- mcols(ans) <- NULL >> y <- match(x, ans) >> mcols(ans)$count <- tabulate(y, nbins=length(ans)) >> ans >> >> Unfortunately right now match() on GenomicRanges and Ranges objects >> reports a match in case of *overlap* instead of *equality* (this is >> why I needed to use IRanges:::matchIntegerQuads in the implementation >> of tableGenomicRanges2() I showed previously). Just a heads-up that >> I'd like to change this but with a transition plan e.g. with an extra >> argument to the match() generic for letting the user control what >> "match" means (default would be "overlaps" for now but should probably >> be changed to "equality" in the future). >> >> Cheers, >> H. >> >> >>> >>> >>> On Wed, Dec 5, 2012 at 4:31 PM, Hervé Pagès <hpages@fhcrc.org>>> <mailto:hpages@fhcrc.org>> wrote: >>> >>> So with tally, what would be the name of the metadata col? Would >>> "tally" >>> be OK? >>> >>> It's kind of neat to use the same name for the metadata col as for >>> the >>> function itself. That makes the code more readable: >>> >>> mcols(count(gr))$count >>> >>> Thanks for the feedback, >>> H. >>> >>> >>> >>> On 12/05/2012 03:05 PM, Tim Triche, Jr. wrote: >>> >>> that goes along nicely with BamTally in gmapR, which is a damn >>> useful >>> function IMHO... actually I think I wrote a tally() function or >>> something like it for Rsubread, can't remember whether I sent it >>> in with >>> the patch though. Anyways, a running tally is a good mental >>> image for >>> this functionality >>> >>> >>> >>> >>> On Wed, Dec 5, 2012 at 2:59 PM, Steve Lianoglou >>> <mailinglist.honeypot@gmail.__**com>>> <mailto:mailinglist.honeypot@**gmail.com<mailinglist.honey pot@gmail.com=""> >>> > >>> <mailto:mailinglist.honeypot@_**_gmail.com>>> >>> <mailto:mailinglist.honeypot@**gmail.com<mailinglist.honey pot@gmail.com=""> >>> >>> >>> >>> wrote: >>> >>> Hi, >>> >>> On Wed, Dec 5, 2012 at 5:50 PM, Michael Lawrence >>> <lawrence.michael@gene.com>>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com="">> >>> <mailto:lawrence.michael@gene.**__com>>> >>> <mailto:lawrence.michael@gene.**com <lawrence.michael@gene.com="">>>> >>> wrote: >>> > The question is whether there is ever a use case to have >>> a simple >>> table. >>> > This is analogous to base R's table and data.frame. For >>> example, >>> if you >>> > call xtabs(), you get a table, then you have to call >>> as.data.frame to get >>> > back into the data.frame context. This is sort of clean, >>> and we could >>> > create an extension of table that for efficiency stores >>> the >>> associated >>> > GRanges along with the counts in the .Data slot. Then >>> as(x, >>> "GRanges") on >>> > that would generate the GRanges with the count column. >>> That would be >>> > complicated though. >>> > >>> > Another issue is that table() cannot be used in the >>> general way, >>> due to >>> > restrictions on dispatch with "...". >>> > >>> > So I think I'm in favor of a new "count" generic. That >>> naming is >>> consistent >>> > with countOverlaps, countSubjects, countQueries, etc. >>> >>> Or maybe `tally`? Somehow I have a mental association w/ >>> that being >>> closer to `tabulate`, but I guess it's really not and maybe >>> it's just >>> me that has a mind map that puts `tally` closer to `table` >>> than >>> `count` is . >>> >>> -- >>> Steve Lianoglou >>> Graduate Student: Computational Systems Biology >>> | Memorial Sloan-Kettering Cancer Center >>> | Weill Medical College of Cornell University >>> Contact Info: http://cbio.mskcc.org/~lianos/**__conta ct<http: cbio.mskcc.org="" ~lianos="" __contact=""> >>> >>> <http: cbio.mskcc.org="" ~**lianos="" contact<http:="" cbio.mskcc="" .org="" ~lianos="" contact=""> >>> > >>> >>> >>> >>> >>> -- >>> /A model is a lie that helps you see the truth./ >>> / >>> / >>> Howard Skipper >>> <http: cancerres.__aacrjourna**ls.org="" content="" 31="" 9="" __1173.**="">>> full.pdf <http: aacrjournals.org="" content="" 31="" 9="" __1173.full.pdf=""> < >>> http://cancerres.**aacrjournals.org/content/31/9/**1173.full.pdf<h ttp:="" cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >>> >> >>> >>> >>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >>> >>> >>> >>> >>> >>> -- >>> /A model is a lie that helps you see the truth./ >>> / >>> / >>> Howard Skipper >>> <http: cancerres.**aacrjournals.org="" content="" 31="" 9="" **1173.full.pdf<="" http:="" cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >>> > >>> >>> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> > > > > -- > *A model is a lie that helps you see the truth.* > * > * > Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD REPLY • link 13.2 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Thanks Tim for the feedback. I spent a little more time thinking about this and then realized that it was related to some other feature I've been thinking of a few months ago, then I forgot about. This other feature extends the capability of match() by finding *all* the matches (not just the first like match()). Surprisingly this is not in base R (but maybe it is, I'm just not aware of it). So I'd like to add the 2 following functions to IRanges: findMatches(query, subject) countMatches(query, subject) The parallel with findOverlaps/countOverlaps is obvious except that by "match" here we mean "exact match", so the concept is valid for any vector-like query and subject with elements that can be compared (i.e. query[i] == subject[j] is defined). Like findOverlaps(), findMatches() will return a Hits object. No 'select' arg for now but we could always add it later and findOverlaps(. , select="first") would be equivalent to match(.) I already have initial implementations for those 2 functions (with room for some performance improvements). I mention this here because once we have countMatches(), implementing the count/tally feature becomes very simple. It's just: x_levels <- sort(unique(x)) mcols(x_levels)$count <- countMatches(x_levels, x) Now it's so simple that I wonder if it's still worth providing a function for it. By using countMatches() the user can choose where to put the vector of counts, and how to name the metadata col in case s/he wants to stick it in 'x_levels'. I'll start by adding findMatches()/countMatches(). I think they are valuable per-se. Cheers, H. On 12/05/2012 07:36 PM, Tim Triche, Jr. wrote: > Now that I am rereading what I wrote, I am wondering whether a column > labeled $count for a given combination of (seqname, start, end, strand) > is more immediately obvious to a user, i.e. "there are mcols(GR)$count > ranges that match this exact description in this here GR". > > Perhaps count() really is the way to go, especially from the standpoint > of making it intuitive. There are 74465 segments exactly matching the > state "WeakEnhancer" in the HMM that I summarized (don't bother looking > for it in ENCODE or Roadmap data, it is a bespoke model with additional > marks). Similarly, a "table" of unique reads placements that share a > common range would summarize them by how many times we saw each. And > then the user would see counts accordingly. > > -> I respectfully withdraw my proposal regarding $tally and now feel > that $count is closer to what people would expect. Does that make sense? > > Example: > > mimatGR <- subsetByOverlaps(readsGR, tx.mimat) > count(mimatGR) # table of appearances of unique reads, i.e., a library > complexity surrogate > > or more interestingly, let's say there's an ncRNA and a coding RNA > emerging from a bidirectional promoter. Then you could have > > maybeNcRNA <- reduce(c(tx.ncRNA, tx.mRNA)) # forward and reverse, maybe > lots of exons, perhaps even a little trans splicing > bidirectionalGR <- subsetByOverlaps(readsGR, maybeNcRNA) # what's really > going on here? count() acts as a spot check > count(bidirectionalGR) > > the latter, especially with a strand-specific RNAseq protocol, could > tell you about some really interesting biology if conducted along a "theme". > > IMHO of course, since I'm not the one implementing it, but more thinking > about the use cases I encounter. > > YMMV and I hope other people will point out the many ways in which I am > heavily constipated. :-) > > Thanks Herve for your ninja programming skills, you code faster than I > can think (at least clearly). > > --t > > > > On Wed, Dec 5, 2012 at 7:11 PM, Tim Triche, Jr. <tim.triche at="" gmail.com=""> <mailto:tim.triche at="" gmail.com="">> wrote: > > maybe (.)$count for tallying overlaps, and (.$)tally for counting > equalities, by default? > > ha ha only serious. if the default is to countOverlaps then $count > is a handy shortcut. if you want to tally up matching ranges $tally > is more evocative > > > > On Wed, Dec 5, 2012 at 7:06 PM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote: > > Hi Tim, > > > On 12/05/2012 04:55 PM, Tim Triche, Jr. wrote: > > that's cool, then it's consistent with score() and > friends... this > sounds like a grand scheme > > > It's not clear to me if you are voting for tally() + mcols(.)$tally > or for count(.) + mcols()$count? > > Note that this functionality doesn't have to be restricted to > GenomicRanges and can be extended to any Vector object 'x' for > which unique(), sort() and match() work and "do the right thing". > Then the implementation is simply: > > > ans <- sort(unique(x)) > names(ans) <- mcols(ans) <- NULL > y <- match(x, ans) > mcols(ans)$count <- tabulate(y, nbins=length(ans)) > ans > > Unfortunately right now match() on GenomicRanges and Ranges objects > reports a match in case of *overlap* instead of *equality* (this is > why I needed to use IRanges:::matchIntegerQuads in the > implementation > of tableGenomicRanges2() I showed previously). Just a heads- up that > I'd like to change this but with a transition plan e.g. with an > extra > argument to the match() generic for letting the user control what > "match" means (default would be "overlaps" for now but should > probably > be changed to "equality" in the future). > > Cheers, > H. > > > > > On Wed, Dec 5, 2012 at 4:31 PM, Hervé Pagès > <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> wrote: > > So with tally, what would be the name of the metadata > col? Would "tally" > be OK? > > It's kind of neat to use the same name for the metadata > col as for the > function itself. That makes the code more readable: > > mcols(count(gr))$count > > Thanks for the feedback, > H. > > > > On 12/05/2012 03:05 PM, Tim Triche, Jr. wrote: > > that goes along nicely with BamTally in gmapR, > which is a damn > useful > function IMHO... actually I think I wrote a tally() > function or > something like it for Rsubread, can't remember > whether I sent it > in with > the patch though. Anyways, a running tally is a > good mental > image for > this functionality > > > > > On Wed, Dec 5, 2012 at 2:59 PM, Steve Lianoglou > <mailinglist.honeypot at="" gmail.____com=""> <mailto:mailinglist.honeypot at="" __gmail.com=""> <mailto:mailinglist.honeypot at="" gmail.com="">> > <mailto:mailinglist.honeypot@> <mailto:mailinglist.honeypot@>____gmail.com <http: gmail.com=""> > > <mailto:mailinglist.honeypot at="" __gmail.com=""> <mailto:mailinglist.honeypot at="" gmail.com="">>>> > > wrote: > > Hi, > > On Wed, Dec 5, 2012 at 5:50 PM, Michael Lawrence > <lawrence.michael at="" gene.com=""> <mailto:lawrence.michael at="" gene.com=""> > <mailto:lawrence.michael at="" gene.__com=""> <mailto:lawrence.michael at="" gene.com="">> > <mailto:lawrence.michael at="" gene.=""> <mailto:lawrence.michael at="" gene.="">____com > > <mailto:lawrence.michael at="" gene.__com=""> <mailto:lawrence.michael at="" gene.com="">>>> wrote: > > The question is whether there is ever a use > case to have > a simple > table. > > This is analogous to base R's table and > data.frame. For > example, > if you > > call xtabs(), you get a table, then you > have to call > as.data.frame to get > > back into the data.frame context. This is > sort of clean, > and we could > > create an extension of table that for > efficiency stores the > associated > > GRanges along with the counts in the .Data > slot. Then as(x, > "GRanges") on > > that would generate the GRanges with the > count column. > That would be > > complicated though. > > > > Another issue is that table() cannot be > used in the > general way, > due to > > restrictions on dispatch with "...". > > > > So I think I'm in favor of a new "count" > generic. That > naming is > consistent > > with countOverlaps, countSubjects, > countQueries, etc. > > Or maybe `tally`? Somehow I have a mental > association w/ > that being > closer to `tabulate`, but I guess it's really > not and maybe > it's just > me that has a mind map that puts `tally` > closer to `table` than > `count` is . > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: > http://cbio.mskcc.org/~lianos/____contact > <http: cbio.mskcc.org="" ~lianos="" __contact=""> > > <http: cbio.mskcc.org="" ~__lianos="" contact=""> <http: cbio.mskcc.org="" ~lianos="" contact="">> > > > > > -- > /A model is a lie that helps you see the truth./ > / > / > Howard Skipper > > <http: cancerres.__aacrjourna__ls.org="" content="" 31="" 9="" __1173.__full.pdf=""> <http: aacrjournals.org="" content="" 31="" 9="" __1173.full.pdf=""> > <http: cancerres.__aacrjournals.org="" content="" 31="" 9="" __1173.full.pdf=""> <http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf="">>> > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > > > > > > -- > /A model is a lie that helps you see the truth./ > / > / > Howard Skipper > <http: cancerres.__aacrjournals.org="" content="" 31="" 9="" __1173.full.pdf=""> <http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf="">> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > > > > -- > /A model is a lie that helps you see the truth./ > / > / > Howard Skipper > <http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > > > > > -- > /A model is a lie that helps you see the truth./ > / > / > Howard Skipper > <http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.2 years ago Hervé Pagès 16k

0

Entering edit mode

On 12/06/2012 12:20 PM, Hervé Pagès wrote: > Thanks Tim for the feedback. > > I spent a little more time thinking about this and then realized that > it was related to some other feature I've been thinking of a few months > ago, then I forgot about. This other feature extends the capability > of match() by finding *all* the matches (not just the first like > match()). Surprisingly this is not in base R (but maybe it is, I'm > just not aware of it). So I'd like to add the 2 following functions > to IRanges: > > findMatches(query, subject) > countMatches(query, subject) > > The parallel with findOverlaps/countOverlaps is obvious except that by > "match" here we mean "exact match", so the concept is valid for any > vector-like query and subject with elements that can be compared (i.e. > query[i] == subject[j] is defined). > > Like findOverlaps(), findMatches() will return a Hits object. > No 'select' arg for now but we could always add it later and > findOverlaps(. , select="first") would be equivalent to match(.) ^^^^^^^^^^^ I meant findMatches here, sorry. H. > > I already have initial implementations for those 2 functions (with > room for some performance improvements). > > I mention this here because once we have countMatches(), implementing > the count/tally feature becomes very simple. It's just: > > x_levels <- sort(unique(x)) > mcols(x_levels)$count <- countMatches(x_levels, x) > > Now it's so simple that I wonder if it's still worth providing a > function for it. By using countMatches() the user can choose where > to put the vector of counts, and how to name the metadata col in > case s/he wants to stick it in 'x_levels'. > > I'll start by adding findMatches()/countMatches(). I think they are > valuable per-se. > > Cheers, > H. > > > On 12/05/2012 07:36 PM, Tim Triche, Jr. wrote: >> Now that I am rereading what I wrote, I am wondering whether a column >> labeled $count for a given combination of (seqname, start, end, strand) >> is more immediately obvious to a user, i.e. "there are mcols(GR)$count >> ranges that match this exact description in this here GR". >> >> Perhaps count() really is the way to go, especially from the standpoint >> of making it intuitive. There are 74465 segments exactly matching the >> state "WeakEnhancer" in the HMM that I summarized (don't bother looking >> for it in ENCODE or Roadmap data, it is a bespoke model with additional >> marks). Similarly, a "table" of unique reads placements that share a >> common range would summarize them by how many times we saw each. And >> then the user would see counts accordingly. >> >> -> I respectfully withdraw my proposal regarding $tally and now feel >> that $count is closer to what people would expect. Does that make sense? >> >> Example: >> >> mimatGR <- subsetByOverlaps(readsGR, tx.mimat) >> count(mimatGR) # table of appearances of unique reads, i.e., a library >> complexity surrogate >> >> or more interestingly, let's say there's an ncRNA and a coding RNA >> emerging from a bidirectional promoter. Then you could have >> >> maybeNcRNA <- reduce(c(tx.ncRNA, tx.mRNA)) # forward and reverse, maybe >> lots of exons, perhaps even a little trans splicing >> bidirectionalGR <- subsetByOverlaps(readsGR, maybeNcRNA) # what's really >> going on here? count() acts as a spot check >> count(bidirectionalGR) >> >> the latter, especially with a strand-specific RNAseq protocol, could >> tell you about some really interesting biology if conducted along a >> "theme". >> >> IMHO of course, since I'm not the one implementing it, but more thinking >> about the use cases I encounter. >> >> YMMV and I hope other people will point out the many ways in which I am >> heavily constipated. :-) >> >> Thanks Herve for your ninja programming skills, you code faster than I >> can think (at least clearly). >> >> --t >> >> >> >> On Wed, Dec 5, 2012 at 7:11 PM, Tim Triche, Jr. <tim.triche at="" gmail.com="">> <mailto:tim.triche at="" gmail.com="">> wrote: >> >> maybe (.)$count for tallying overlaps, and (.$)tally for counting >> equalities, by default? >> >> ha ha only serious. if the default is to countOverlaps then $count >> is a handy shortcut. if you want to tally up matching ranges $tally >> is more evocative >> >> >> >> On Wed, Dec 5, 2012 at 7:06 PM, Hervé Pagès <hpages at="" fhcrc.org="">> <mailto:hpages at="" fhcrc.org="">> wrote: >> >> Hi Tim, >> >> >> On 12/05/2012 04:55 PM, Tim Triche, Jr. wrote: >> >> that's cool, then it's consistent with score() and >> friends... this >> sounds like a grand scheme >> >> >> It's not clear to me if you are voting for tally() + >> mcols(.)$tally >> or for count(.) + mcols()$count? >> >> Note that this functionality doesn't have to be restricted to >> GenomicRanges and can be extended to any Vector object 'x' for >> which unique(), sort() and match() work and "do the right thing". >> Then the implementation is simply: >> >> >> ans <- sort(unique(x)) >> names(ans) <- mcols(ans) <- NULL >> y <- match(x, ans) >> mcols(ans)$count <- tabulate(y, nbins=length(ans)) >> ans >> >> Unfortunately right now match() on GenomicRanges and Ranges >> objects >> reports a match in case of *overlap* instead of *equality* >> (this is >> why I needed to use IRanges:::matchIntegerQuads in the >> implementation >> of tableGenomicRanges2() I showed previously). Just a heads-up >> that >> I'd like to change this but with a transition plan e.g. with an >> extra >> argument to the match() generic for letting the user control what >> "match" means (default would be "overlaps" for now but should >> probably >> be changed to "equality" in the future). >> >> Cheers, >> H. >> >> >> >> >> On Wed, Dec 5, 2012 at 4:31 PM, Hervé Pagès >> <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> >> <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> wrote: >> >> So with tally, what would be the name of the metadata >> col? Would "tally" >> be OK? >> >> It's kind of neat to use the same name for the metadata >> col as for the >> function itself. That makes the code more readable: >> >> mcols(count(gr))$count >> >> Thanks for the feedback, >> H. >> >> >> >> On 12/05/2012 03:05 PM, Tim Triche, Jr. wrote: >> >> that goes along nicely with BamTally in gmapR, >> which is a damn >> useful >> function IMHO... actually I think I wrote a tally() >> function or >> something like it for Rsubread, can't remember >> whether I sent it >> in with >> the patch though. Anyways, a running tally is a >> good mental >> image for >> this functionality >> >> >> >> >> On Wed, Dec 5, 2012 at 2:59 PM, Steve Lianoglou >> <mailinglist.honeypot at="" gmail.____com="">> <mailto:mailinglist.honeypot at="" __gmail.com="">> <mailto:mailinglist.honeypot at="" gmail.com="">> >> <mailto:mailinglist.honeypot@>> <mailto:mailinglist.honeypot@>____gmail.com >> <http: gmail.com=""> >> >> <mailto:mailinglist.honeypot at="" __gmail.com="">> <mailto:mailinglist.honeypot at="" gmail.com="">>>> >> >> wrote: >> >> Hi, >> >> On Wed, Dec 5, 2012 at 5:50 PM, Michael >> Lawrence >> <lawrence.michael at="" gene.com="">> <mailto:lawrence.michael at="" gene.com=""> >> <mailto:lawrence.michael at="" gene.__com="">> <mailto:lawrence.michael at="" gene.com="">> >> <mailto:lawrence.michael at="" gene.="">> <mailto:lawrence.michael at="" gene.="">____com >> >> <mailto:lawrence.michael at="" gene.__com="">> <mailto:lawrence.michael at="" gene.com="">>>> wrote: >> > The question is whether there is ever a use >> case to have >> a simple >> table. >> > This is analogous to base R's table and >> data.frame. For >> example, >> if you >> > call xtabs(), you get a table, then you >> have to call >> as.data.frame to get >> > back into the data.frame context. This is >> sort of clean, >> and we could >> > create an extension of table that for >> efficiency stores the >> associated >> > GRanges along with the counts in the .Data >> slot. Then as(x, >> "GRanges") on >> > that would generate the GRanges with the >> count column. >> That would be >> > complicated though. >> > >> > Another issue is that table() cannot be >> used in the >> general way, >> due to >> > restrictions on dispatch with "...". >> > >> > So I think I'm in favor of a new "count" >> generic. That >> naming is >> consistent >> > with countOverlaps, countSubjects, >> countQueries, etc. >> >> Or maybe `tally`? Somehow I have a mental >> association w/ >> that being >> closer to `tabulate`, but I guess it's really >> not and maybe >> it's just >> me that has a mind map that puts `tally` >> closer to `table` than >> `count` is . >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> | Memorial Sloan-Kettering Cancer Center >> | Weill Medical College of Cornell University >> Contact Info: >> http://cbio.mskcc.org/~lianos/____contact >> <http: cbio.mskcc.org="" ~lianos="" __contact=""> >> >> <http: cbio.mskcc.org="" ~__lianos="" contact="">> <http: cbio.mskcc.org="" ~lianos="" contact="">> >> >> >> >> >> -- >> /A model is a lie that helps you see the truth./ >> / >> / >> Howard Skipper >> >> >> <http: cancerres.__aacrjourna__ls.org="" content="" 31="" 9="" __1173.__full.pdf="">> <http: aacrjournals.org="" content="" 31="" 9="" __1173.full.pdf=""> >> >> <http: cancerres.__aacrjournals.org="" content="" 31="" 9="" __1173.full.pdf="">> >> <http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf="">>> >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> >> <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> >> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >> <tel:%28206%29%20667-5791> >> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >> <tel:%28206%29%20667-1319> >> >> >> >> >> >> -- >> /A model is a lie that helps you see the truth./ >> / >> / >> Howard Skipper >> >> <http: cancerres.__aacrjournals.org="" content="" 31="" 9="" __1173.full.pdf="">> >> <http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf="">> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> >> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >> >> >> >> >> -- >> /A model is a lie that helps you see the truth./ >> / >> / >> Howard Skipper >> <http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >> >> >> >> >> -- >> /A model is a lie that helps you see the truth./ >> / >> / >> Howard Skipper >> <http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.2 years ago Hervé Pagès 16k

Login before adding your answer.