matrix like object with Rle columns

0

Entering edit mode

Kasper Daniel Hansen ★ 6.5k

@kasper-daniel-hansen-2979

Last seen 18 months ago

United States

Do we have a matrix-like object, but where the columns are Rle's? Kasper

• 1.5k views

ADD COMMENT • link updated 12.5 years ago by Michael Lawrence ★ 11k • written 12.5 years ago by Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.0 years ago

United States

Patrick and I had talked about this a long time ago (essentially putting a "dim" attribute on an Rle), but the closest thing today is a DataFrame with Rle columns. Use case? Michael On Mon, Jun 25, 2012 at 8:27 PM, Kasper Daniel Hansen < kasperdanielhansen@gmail.com> wrote: > Do we have a matrix-like object, but where the columns are Rle's? > > Kasper > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 12.5 years ago Michael Lawrence ★ 11k

0

Entering edit mode

On Mon, Jun 25, 2012 at 11:36 PM, Michael Lawrence <lawrence.michael at="" gene.com=""> wrote: > Patrick and I had talked about this a long time ago (essentially putting a > "dim" attribute on an Rle), but the closest thing today is a DataFrame with > Rle columns. > > Use case? Say I have whole-genome data (for example coverage) on multiple samples. Usually, this is far easier to think of as a matrix (in my opinion) with ~3B rows and I often want to do rowSums(), colSums() etc (in fact, probably the whole API from matrixStats). This is especially nice when you have multiple coverage-like tracks on each sample, so you could have trackA : genome by samples trackB : genome by samples ... You could think of this as a SummarizedExperiment, but with _extremely_ big matrices in the assay slot. I want to take advantage of the Rle structure to store the data more efficiently and also to do potentially faster computations. This is actually closer to my use case where I currently use matrices with ~30M rows (which works fine), but I would like to expand to ~800M rows (which would suck a bit). You could also think of a matrix-like object with Rle columns as an alternative sparse matrix structure. In a typical sparse matrix you only store the non-zero entities, here we only store the change-points. Depending on the structure of the matrix this could be an efficient storage of an otherwise dense matrix. So essentially, what I want, is to have mathematical operations on this object, where I would utilize that I know that all entities are numbers so the typical matrix operations makes sense. [ side question which could be relevant in this discussion: for a numeric Rle is there some notion of precision - say I have truly numeric values with tons of digits, and I want to consider two numbers part of the same run if |x1 -x2|<epsilon? ]="" kasper=""> > Michael > > On Mon, Jun 25, 2012 at 8:27 PM, Kasper Daniel Hansen > <kasperdanielhansen at="" gmail.com=""> wrote: >> >> Do we have a matrix-like object, but where the columns are Rle's? >> >> Kasper >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD REPLY • link 12.5 years ago Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

On Mon, Jun 25, 2012 at 11:56 PM, Kasper Daniel Hansen <kasperdanielhansen at="" gmail.com=""> wrote: > On Mon, Jun 25, 2012 at 11:36 PM, Michael Lawrence > <lawrence.michael at="" gene.com=""> wrote: >> Patrick and I had talked about this a long time ago (essentially putting a >> "dim" attribute on an Rle), but the closest thing today is a DataFrame with >> Rle columns. >> >> Use case? > > Say I have whole-genome data (for example coverage) ?on multiple > samples. ?Usually, this is far easier to think of as a matrix (in my > opinion) with ~3B rows and I often want to do rowSums(), colSums() etc > (in fact, probably the whole API from matrixStats). ?This is > especially nice when you have multiple coverage-like tracks on each > sample, so you could have > ?trackA : genome by samples > ?trackB : genome by samples > ?... > > You could think of this as a SummarizedExperiment, but with > _extremely_ big matrices in the assay slot. > > I want to take advantage of the Rle structure to store the data more > efficiently and also to do potentially faster computations. > > This is actually closer to my use case where I currently use matrices > with ~30M rows (which works fine), but I would like to expand to ~800M > rows (which would suck a bit). > > You could also think of a matrix-like object with Rle columns as an > alternative sparse matrix structure. ?In a typical sparse matrix you > only store the non-zero entities, here we only store the > change-points. ?Depending on the structure of the matrix this could be > an efficient storage of an otherwise dense matrix. > > So essentially, what I want, is to have mathematical operations on > this object, where I would utilize that I know that all entities are > numbers so the typical matrix operations makes sense. > > [ side question which could be relevant in this discussion: for a > numeric Rle is there some notion of precision - say I have truly > numeric values with tons of digits, and I want to consider two numbers > part of the same run if |x1 -x2|<epsilon? ]="" you="" can="" see="" that="" pete="" has="" had="" similar="" thoughts="" in="" genoset="" r="" dataframe-methods.r,="" although="" he="" only="" has="" colmeans="" (which="" is="" the="" easy="" one).="" kasper=""> Kasper > >> >> Michael >> >> On Mon, Jun 25, 2012 at 8:27 PM, Kasper Daniel Hansen >> <kasperdanielhansen at="" gmail.com=""> wrote: >>> >>> Do we have a matrix-like object, but where the columns are Rle's? >>> >>> Kasper >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >>

ADD REPLY • link 12.5 years ago Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

Seems like it could be a nice thing to have. Presumably one would create an Array subclass of Vector that would add a "dim" attribute. Then Matrix could extend that to constrain dim to length two (unfortunately colliding with the Matrix class in the Matrix package). Then RleMatrix extends Matrix to implement the actual data storage and many of the accelerated methods. As you said, row-oriented methods would be tough. Any takers? Michael On Mon, Jun 25, 2012 at 9:11 PM, Kasper Daniel Hansen < kasperdanielhansen@gmail.com> wrote: > On Mon, Jun 25, 2012 at 11:56 PM, Kasper Daniel Hansen > <kasperdanielhansen@gmail.com> wrote: > > On Mon, Jun 25, 2012 at 11:36 PM, Michael Lawrence > > <lawrence.michael@gene.com> wrote: > >> Patrick and I had talked about this a long time ago (essentially > putting a > >> "dim" attribute on an Rle), but the closest thing today is a DataFrame > with > >> Rle columns. > >> > >> Use case? > > > > Say I have whole-genome data (for example coverage) on multiple > > samples. Usually, this is far easier to think of as a matrix (in my > > opinion) with ~3B rows and I often want to do rowSums(), colSums() etc > > (in fact, probably the whole API from matrixStats). This is > > especially nice when you have multiple coverage-like tracks on each > > sample, so you could have > > trackA : genome by samples > > trackB : genome by samples > > ... > > > > You could think of this as a SummarizedExperiment, but with > > _extremely_ big matrices in the assay slot. > > > > I want to take advantage of the Rle structure to store the data more > > efficiently and also to do potentially faster computations. > > > > This is actually closer to my use case where I currently use matrices > > with ~30M rows (which works fine), but I would like to expand to ~800M > > rows (which would suck a bit). > > > > You could also think of a matrix-like object with Rle columns as an > > alternative sparse matrix structure. In a typical sparse matrix you > > only store the non-zero entities, here we only store the > > change-points. Depending on the structure of the matrix this could be > > an efficient storage of an otherwise dense matrix. > > > > So essentially, what I want, is to have mathematical operations on > > this object, where I would utilize that I know that all entities are > > numbers so the typical matrix operations makes sense. > > > > [ side question which could be relevant in this discussion: for a > > numeric Rle is there some notion of precision - say I have truly > > numeric values with tons of digits, and I want to consider two numbers > > part of the same run if |x1 -x2|<epsilon? ]=""> > You can see that Pete has had similar thoughts in > genoset/R/DataFrame-methods.R, although he only has colMeans (which is > the easy one). > > Kasper > > > Kasper > > > >> > >> Michael > >> > >> On Mon, Jun 25, 2012 at 8:27 PM, Kasper Daniel Hansen > >> <kasperdanielhansen@gmail.com> wrote: > >>> > >>> Do we have a matrix-like object, but where the columns are Rle's? > >>> > >>> Kasper > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor@r-project.org > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: > >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > >> > [[alternative HTML version deleted]]

ADD REPLY • link 12.5 years ago Michael Lawrence ★ 11k

0

Entering edit mode

One comment: since matrix is a vector with a dim attribute I see that the natural parallel is doing the same for Rle. Nevertheless, that would put an upper limit on the number of runLengths in the entire matrix. My impression (which could be wrong) is that we would need to implement essentially all matrix-like numeric operations from scratch anyway, so it may be worthwhile to consider using a list of Rle's where each Rle is a column, instead of a single Rle to represent all columns. Clearly that depends on implementation details, but if we really need to do everything from scratch, a list of columns might be more flexible (and perhaps even easier to code). Kasper On Tue, Jun 26, 2012 at 6:41 AM, Michael Lawrence <lawrence.michael at="" gene.com=""> wrote: > Seems like it could be a nice thing to have. Presumably one would create an > Array subclass of Vector that would add a "dim" attribute. Then Matrix could > extend that to constrain dim to length two (unfortunately colliding with the > Matrix class in the Matrix package). Then RleMatrix extends Matrix to > implement the actual data storage and many of the accelerated methods. As > you said, row-oriented methods would be tough. > > Any takers? > > Michael > > On Mon, Jun 25, 2012 at 9:11 PM, Kasper Daniel Hansen > <kasperdanielhansen at="" gmail.com=""> wrote: >> >> On Mon, Jun 25, 2012 at 11:56 PM, Kasper Daniel Hansen >> <kasperdanielhansen at="" gmail.com=""> wrote: >> > On Mon, Jun 25, 2012 at 11:36 PM, Michael Lawrence >> > <lawrence.michael at="" gene.com=""> wrote: >> >> Patrick and I had talked about this a long time ago (essentially >> >> putting a >> >> "dim" attribute on an Rle), but the closest thing today is a DataFrame >> >> with >> >> Rle columns. >> >> >> >> Use case? >> > >> > Say I have whole-genome data (for example coverage) ?on multiple >> > samples. ?Usually, this is far easier to think of as a matrix (in my >> > opinion) with ~3B rows and I often want to do rowSums(), colSums() etc >> > (in fact, probably the whole API from matrixStats). ?This is >> > especially nice when you have multiple coverage-like tracks on each >> > sample, so you could have >> > ?trackA : genome by samples >> > ?trackB : genome by samples >> > ?... >> > >> > You could think of this as a SummarizedExperiment, but with >> > _extremely_ big matrices in the assay slot. >> > >> > I want to take advantage of the Rle structure to store the data more >> > efficiently and also to do potentially faster computations. >> > >> > This is actually closer to my use case where I currently use matrices >> > with ~30M rows (which works fine), but I would like to expand to ~800M >> > rows (which would suck a bit). >> > >> > You could also think of a matrix-like object with Rle columns as an >> > alternative sparse matrix structure. ?In a typical sparse matrix you >> > only store the non-zero entities, here we only store the >> > change-points. ?Depending on the structure of the matrix this could be >> > an efficient storage of an otherwise dense matrix. >> > >> > So essentially, what I want, is to have mathematical operations on >> > this object, where I would utilize that I know that all entities are >> > numbers so the typical matrix operations makes sense. >> > >> > [ side question which could be relevant in this discussion: for a >> > numeric Rle is there some notion of precision - say I have truly >> > numeric values with tons of digits, and I want to consider two numbers >> > part of the same run if |x1 -x2|<epsilon? ]="">> >> You can see that Pete has had similar thoughts in >> genoset/R/DataFrame-methods.R, although he only has colMeans (which is >> the easy one). >> >> Kasper >> >> > Kasper >> > >> >> >> >> Michael >> >> >> >> On Mon, Jun 25, 2012 at 8:27 PM, Kasper Daniel Hansen >> >> <kasperdanielhansen at="" gmail.com=""> wrote: >> >>> >> >>> Do we have a matrix-like object, but where the columns are Rle's? >> >>> >> >>> Kasper >> >>> >> >>> _______________________________________________ >> >>> Bioconductor mailing list >> >>> Bioconductor at r-project.org >> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >>> Search the archives: >> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> > >

ADD REPLY • link 12.5 years ago Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

On Wed, Jun 27, 2012 at 8:07 AM, Kasper Daniel Hansen < kasperdanielhansen@gmail.com> wrote: > One comment: since matrix is a vector with a dim attribute I see that > the natural parallel is doing the same for Rle. Right, in the original plan, the Array class would bring the dim attribute, and RleMatrix would contain both Matrix and Rle. > Nevertheless, that > would put an upper limit on the number of runLengths in the entire > matrix. My impression (which could be wrong) is that we would need to > implement essentially all matrix-like numeric operations from scratch > anyway, so it may be worthwhile to consider using a list of Rle's > where each Rle is a column, instead of a single Rle to represent all > columns. Clearly that depends on implementation details, but if we > really need to do everything from scratch, a list of columns might be > more flexible (and perhaps even easier to code). > > This would make it harder to treat RleMatrix as an Rle (which is a nice feature of base R matrices). If the problem is the vector length limit, then I'd rather wait for Luke's fix, which apparently is coming along. Kasper > > On Tue, Jun 26, 2012 at 6:41 AM, Michael Lawrence > <lawrence.michael@gene.com> wrote: > > Seems like it could be a nice thing to have. Presumably one would create > an > > Array subclass of Vector that would add a "dim" attribute. Then Matrix > could > > extend that to constrain dim to length two (unfortunately colliding with > the > > Matrix class in the Matrix package). Then RleMatrix extends Matrix to > > implement the actual data storage and many of the accelerated methods. As > > you said, row-oriented methods would be tough. > > > > Any takers? > > > > Michael > > > > On Mon, Jun 25, 2012 at 9:11 PM, Kasper Daniel Hansen > > <kasperdanielhansen@gmail.com> wrote: > >> > >> On Mon, Jun 25, 2012 at 11:56 PM, Kasper Daniel Hansen > >> <kasperdanielhansen@gmail.com> wrote: > >> > On Mon, Jun 25, 2012 at 11:36 PM, Michael Lawrence > >> > <lawrence.michael@gene.com> wrote: > >> >> Patrick and I had talked about this a long time ago (essentially > >> >> putting a > >> >> "dim" attribute on an Rle), but the closest thing today is a > DataFrame > >> >> with > >> >> Rle columns. > >> >> > >> >> Use case? > >> > > >> > Say I have whole-genome data (for example coverage) on multiple > >> > samples. Usually, this is far easier to think of as a matrix (in my > >> > opinion) with ~3B rows and I often want to do rowSums(), colSums() etc > >> > (in fact, probably the whole API from matrixStats). This is > >> > especially nice when you have multiple coverage-like tracks on each > >> > sample, so you could have > >> > trackA : genome by samples > >> > trackB : genome by samples > >> > ... > >> > > >> > You could think of this as a SummarizedExperiment, but with > >> > _extremely_ big matrices in the assay slot. > >> > > >> > I want to take advantage of the Rle structure to store the data more > >> > efficiently and also to do potentially faster computations. > >> > > >> > This is actually closer to my use case where I currently use matrices > >> > with ~30M rows (which works fine), but I would like to expand to ~800M > >> > rows (which would suck a bit). > >> > > >> > You could also think of a matrix-like object with Rle columns as an > >> > alternative sparse matrix structure. In a typical sparse matrix you > >> > only store the non-zero entities, here we only store the > >> > change-points. Depending on the structure of the matrix this could be > >> > an efficient storage of an otherwise dense matrix. > >> > > >> > So essentially, what I want, is to have mathematical operations on > >> > this object, where I would utilize that I know that all entities are > >> > numbers so the typical matrix operations makes sense. > >> > > >> > [ side question which could be relevant in this discussion: for a > >> > numeric Rle is there some notion of precision - say I have truly > >> > numeric values with tons of digits, and I want to consider two numbers > >> > part of the same run if |x1 -x2|<epsilon? ]=""> >> > >> You can see that Pete has had similar thoughts in > >> genoset/R/DataFrame-methods.R, although he only has colMeans (which is > >> the easy one). > >> > >> Kasper > >> > >> > Kasper > >> > > >> >> > >> >> Michael > >> >> > >> >> On Mon, Jun 25, 2012 at 8:27 PM, Kasper Daniel Hansen > >> >> <kasperdanielhansen@gmail.com> wrote: > >> >>> > >> >>> Do we have a matrix-like object, but where the columns are Rle's? > >> >>> > >> >>> Kasper > >> >>> > >> >>> _______________________________________________ > >> >>> Bioconductor mailing list > >> >>> Bioconductor@r-project.org > >> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> >>> Search the archives: > >> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> >> > >> >> > > > > > [[alternative HTML version deleted]]

ADD REPLY • link 12.5 years ago Michael Lawrence ★ 11k

0

Entering edit mode

I would love/use all the time this feature if it existed. Jeff On Wed, Jun 27, 2012 at 11:21 AM, Michael Lawrence <lawrence.michael at="" gene.com=""> wrote: > On Wed, Jun 27, 2012 at 8:07 AM, Kasper Daniel Hansen < > kasperdanielhansen at gmail.com> wrote: > >> One comment: ?since matrix is a vector with a dim attribute I see that >> the natural parallel is doing the same for Rle. > > > Right, in the original plan, the Array class would bring the dim attribute, > and RleMatrix would contain both Matrix and Rle. > > >> ?Nevertheless, that >> would put an upper limit on the number of runLengths in the entire >> matrix. ?My impression (which could be wrong) is that we would need to >> implement essentially all matrix-like numeric operations from scratch >> anyway, so it may be worthwhile to consider using a list of Rle's >> where each Rle is a column, instead of a single Rle to represent all >> columns. ?Clearly that depends on implementation details, but if we >> really need to do everything from scratch, a list of columns might be >> more flexible (and perhaps even easier to code). >> >> > This would make it harder to treat RleMatrix as an Rle (which is a nice > feature of base R matrices). If the problem is the vector length limit, > then I'd rather wait for Luke's fix, which apparently is coming along. > > Kasper >> >> On Tue, Jun 26, 2012 at 6:41 AM, Michael Lawrence >> <lawrence.michael at="" gene.com=""> wrote: >> > Seems like it could be a nice thing to have. Presumably one would create >> an >> > Array subclass of Vector that would add a "dim" attribute. Then Matrix >> could >> > extend that to constrain dim to length two (unfortunately colliding with >> the >> > Matrix class in the Matrix package). Then RleMatrix extends Matrix to >> > implement the actual data storage and many of the accelerated methods. As >> > you said, row-oriented methods would be tough. >> > >> > Any takers? >> > >> > Michael >> > >> > On Mon, Jun 25, 2012 at 9:11 PM, Kasper Daniel Hansen >> > <kasperdanielhansen at="" gmail.com=""> wrote: >> >> >> >> On Mon, Jun 25, 2012 at 11:56 PM, Kasper Daniel Hansen >> >> <kasperdanielhansen at="" gmail.com=""> wrote: >> >> > On Mon, Jun 25, 2012 at 11:36 PM, Michael Lawrence >> >> > <lawrence.michael at="" gene.com=""> wrote: >> >> >> Patrick and I had talked about this a long time ago (essentially >> >> >> putting a >> >> >> "dim" attribute on an Rle), but the closest thing today is a >> DataFrame >> >> >> with >> >> >> Rle columns. >> >> >> >> >> >> Use case? >> >> > >> >> > Say I have whole-genome data (for example coverage) ?on multiple >> >> > samples. ?Usually, this is far easier to think of as a matrix (in my >> >> > opinion) with ~3B rows and I often want to do rowSums(), colSums() etc >> >> > (in fact, probably the whole API from matrixStats). ?This is >> >> > especially nice when you have multiple coverage-like tracks on each >> >> > sample, so you could have >> >> > ?trackA : genome by samples >> >> > ?trackB : genome by samples >> >> > ?... >> >> > >> >> > You could think of this as a SummarizedExperiment, but with >> >> > _extremely_ big matrices in the assay slot. >> >> > >> >> > I want to take advantage of the Rle structure to store the data more >> >> > efficiently and also to do potentially faster computations. >> >> > >> >> > This is actually closer to my use case where I currently use matrices >> >> > with ~30M rows (which works fine), but I would like to expand to ~800M >> >> > rows (which would suck a bit). >> >> > >> >> > You could also think of a matrix-like object with Rle columns as an >> >> > alternative sparse matrix structure. ?In a typical sparse matrix you >> >> > only store the non-zero entities, here we only store the >> >> > change-points. ?Depending on the structure of the matrix this could be >> >> > an efficient storage of an otherwise dense matrix. >> >> > >> >> > So essentially, what I want, is to have mathematical operations on >> >> > this object, where I would utilize that I know that all entities are >> >> > numbers so the typical matrix operations makes sense. >> >> > >> >> > [ side question which could be relevant in this discussion: for a >> >> > numeric Rle is there some notion of precision - say I have truly >> >> > numeric values with tons of digits, and I want to consider two numbers >> >> > part of the same run if |x1 -x2|<epsilon? ]="">> >> >> >> You can see that Pete has had similar thoughts in >> >> genoset/R/DataFrame-methods.R, although he only has colMeans (which is >> >> the easy one). >> >> >> >> Kasper >> >> >> >> > Kasper >> >> > >> >> >> >> >> >> Michael >> >> >> >> >> >> On Mon, Jun 25, 2012 at 8:27 PM, Kasper Daniel Hansen >> >> >> <kasperdanielhansen at="" gmail.com=""> wrote: >> >> >>> >> >> >>> Do we have a matrix-like object, but where the columns are Rle's? >> >> >>> >> >> >>> Kasper >> >> >>> >> >> >>> _______________________________________________ >> >> >>> Bioconductor mailing list >> >> >>> Bioconductor at r-project.org >> >> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> >>> Search the archives: >> >> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> >> >> >> > >> > >> > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.5 years ago Jeff Leek ▴ 650

0

Entering edit mode

Hi guys, Note that some of the things in the "matrix API" seem to work on standard data frames: > df <- data.frame(aa=1:5, bb=100) > rowSums(df) [1] 101 102 103 104 105 > colSums(df) aa bb 15 500 > max(df) [1] 100 > min(df) [1] 1 > range(df) [1] 1 100 > df + df aa bb 1 2 200 2 4 200 3 6 200 4 8 200 5 10 200 > df <= 3 aa bb [1,] TRUE FALSE [2,] TRUE FALSE [3,] TRUE FALSE [4,] FALSE FALSE [5,] FALSE FALSE etc... But none of them work on DataFrame. Maybe if they were we wouldn't need RleMatrix? Using DataFrame instead of RleMatrix would be nice because it reuses what we already have. It would also avoid the pitfall of having the length of an RleMatrix not being representable with a 32-bit int when let's say the nb of rows is 800M and there are a few nb of cols (like in Kasper's use case). No need to wait for Luke's "big vector" hack. Cheers, H. On 06/27/2012 10:46 AM, Jeff Leek wrote: > I would love/use all the time this feature if it existed. > > Jeff > > On Wed, Jun 27, 2012 at 11:21 AM, Michael Lawrence > <lawrence.michael at="" gene.com=""> wrote: >> On Wed, Jun 27, 2012 at 8:07 AM, Kasper Daniel Hansen < >> kasperdanielhansen at gmail.com> wrote: >> >>> One comment: since matrix is a vector with a dim attribute I see that >>> the natural parallel is doing the same for Rle. >> >> >> Right, in the original plan, the Array class would bring the dim attribute, >> and RleMatrix would contain both Matrix and Rle. >> >> >>> Nevertheless, that >>> would put an upper limit on the number of runLengths in the entire >>> matrix. My impression (which could be wrong) is that we would need to >>> implement essentially all matrix-like numeric operations from scratch >>> anyway, so it may be worthwhile to consider using a list of Rle's >>> where each Rle is a column, instead of a single Rle to represent all >>> columns. Clearly that depends on implementation details, but if we >>> really need to do everything from scratch, a list of columns might be >>> more flexible (and perhaps even easier to code). >>> >>> >> This would make it harder to treat RleMatrix as an Rle (which is a nice >> feature of base R matrices). If the problem is the vector length limit, >> then I'd rather wait for Luke's fix, which apparently is coming along. >> >> Kasper >>> >>> On Tue, Jun 26, 2012 at 6:41 AM, Michael Lawrence >>> <lawrence.michael at="" gene.com=""> wrote: >>>> Seems like it could be a nice thing to have. Presumably one would create >>> an >>>> Array subclass of Vector that would add a "dim" attribute. Then Matrix >>> could >>>> extend that to constrain dim to length two (unfortunately colliding with >>> the >>>> Matrix class in the Matrix package). Then RleMatrix extends Matrix to >>>> implement the actual data storage and many of the accelerated methods. As >>>> you said, row-oriented methods would be tough. >>>> >>>> Any takers? >>>> >>>> Michael >>>> >>>> On Mon, Jun 25, 2012 at 9:11 PM, Kasper Daniel Hansen >>>> <kasperdanielhansen at="" gmail.com=""> wrote: >>>>> >>>>> On Mon, Jun 25, 2012 at 11:56 PM, Kasper Daniel Hansen >>>>> <kasperdanielhansen at="" gmail.com=""> wrote: >>>>>> On Mon, Jun 25, 2012 at 11:36 PM, Michael Lawrence >>>>>> <lawrence.michael at="" gene.com=""> wrote: >>>>>>> Patrick and I had talked about this a long time ago (essentially >>>>>>> putting a >>>>>>> "dim" attribute on an Rle), but the closest thing today is a >>> DataFrame >>>>>>> with >>>>>>> Rle columns. >>>>>>> >>>>>>> Use case? >>>>>> >>>>>> Say I have whole-genome data (for example coverage) on multiple >>>>>> samples. Usually, this is far easier to think of as a matrix (in my >>>>>> opinion) with ~3B rows and I often want to do rowSums(), colSums() etc >>>>>> (in fact, probably the whole API from matrixStats). This is >>>>>> especially nice when you have multiple coverage-like tracks on each >>>>>> sample, so you could have >>>>>> trackA : genome by samples >>>>>> trackB : genome by samples >>>>>> ... >>>>>> >>>>>> You could think of this as a SummarizedExperiment, but with >>>>>> _extremely_ big matrices in the assay slot. >>>>>> >>>>>> I want to take advantage of the Rle structure to store the data more >>>>>> efficiently and also to do potentially faster computations. >>>>>> >>>>>> This is actually closer to my use case where I currently use matrices >>>>>> with ~30M rows (which works fine), but I would like to expand to ~800M >>>>>> rows (which would suck a bit). >>>>>> >>>>>> You could also think of a matrix-like object with Rle columns as an >>>>>> alternative sparse matrix structure. In a typical sparse matrix you >>>>>> only store the non-zero entities, here we only store the >>>>>> change-points. Depending on the structure of the matrix this could be >>>>>> an efficient storage of an otherwise dense matrix. >>>>>> >>>>>> So essentially, what I want, is to have mathematical operations on >>>>>> this object, where I would utilize that I know that all entities are >>>>>> numbers so the typical matrix operations makes sense. >>>>>> >>>>>> [ side question which could be relevant in this discussion: for a >>>>>> numeric Rle is there some notion of precision - say I have truly >>>>>> numeric values with tons of digits, and I want to consider two numbers >>>>>> part of the same run if |x1 -x2|<epsilon? ]="">>>>> >>>>> You can see that Pete has had similar thoughts in >>>>> genoset/R/DataFrame-methods.R, although he only has colMeans (which is >>>>> the easy one). >>>>> >>>>> Kasper >>>>> >>>>>> Kasper >>>>>> >>>>>>> >>>>>>> Michael >>>>>>> >>>>>>> On Mon, Jun 25, 2012 at 8:27 PM, Kasper Daniel Hansen >>>>>>> <kasperdanielhansen at="" gmail.com=""> wrote: >>>>>>>> >>>>>>>> Do we have a matrix-like object, but where the columns are Rle's? >>>>>>>> >>>>>>>> Kasper >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: >>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> >>>>>>> >>>> >>>> >>> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 12.5 years ago Hervé Pagès 16k

0

Entering edit mode

On Wed, Jun 27, 2012 at 3:30 PM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: > Hi guys, > > Note that some of the things in the "matrix API" seem to work on > standard data frames: > >> df <- data.frame(aa=1:5, bb=100) >> rowSums(df) > [1] 101 102 103 104 105 >> colSums(df) > ?aa ?bb > ?15 500 >> max(df) > [1] 100 >> min(df) > [1] 1 >> range(df) > [1] ? 1 100 >> df + df > ?aa ?bb > 1 ?2 200 > 2 ?4 200 > 3 ?6 200 > 4 ?8 200 > 5 10 200 >> df <= 3 > ? ? ? ?aa ? ?bb > [1,] ?TRUE FALSE > [2,] ?TRUE FALSE > [3,] ?TRUE FALSE > [4,] FALSE FALSE > [5,] FALSE FALSE > > etc... > > But none of them work on DataFrame. Maybe if they were we wouldn't need > RleMatrix? Using DataFrame instead of RleMatrix would be nice because it > reuses what we already have. It would also avoid the pitfall of having > the length of an RleMatrix not being representable with a 32-bit int > when let's say the nb of rows is 800M and there are a few nb of cols > (like in Kasper's use case). No need to wait for Luke's "big vector" > hack. This is totally fine with me, as long as coercion from Rle to a normal vector is avoided. But it might make sense to have a derivative class ensuring that all columns are numeric in nature. Kasper > > Cheers, > H. > > > On 06/27/2012 10:46 AM, Jeff Leek wrote: >> >> I would love/use all the time this feature if it existed. >> >> Jeff >> >> On Wed, Jun 27, 2012 at 11:21 AM, Michael Lawrence >> <lawrence.michael at="" gene.com=""> wrote: >>> >>> On Wed, Jun 27, 2012 at 8:07 AM, Kasper Daniel Hansen < >>> kasperdanielhansen at gmail.com> wrote: >>> >>>> One comment: ?since matrix is a vector with a dim attribute I see that >>>> the natural parallel is doing the same for Rle. >>> >>> >>> >>> Right, in the original plan, the Array class would bring the dim >>> attribute, >>> and RleMatrix would contain both Matrix and Rle. >>> >>> >>>> ?Nevertheless, that >>>> would put an upper limit on the number of runLengths in the entire >>>> matrix. ?My impression (which could be wrong) is that we would need to >>>> implement essentially all matrix-like numeric operations from scratch >>>> anyway, so it may be worthwhile to consider using a list of Rle's >>>> where each Rle is a column, instead of a single Rle to represent all >>>> columns. ?Clearly that depends on implementation details, but if we >>>> really need to do everything from scratch, a list of columns might be >>>> more flexible (and perhaps even easier to code). >>>> >>>> >>> This would make it harder to treat RleMatrix as an Rle (which is a nice >>> feature of base R matrices). If the problem is the vector length limit, >>> then I'd rather wait for Luke's fix, which apparently is coming along. >>> >>> Kasper >>>> >>>> >>>> On Tue, Jun 26, 2012 at 6:41 AM, Michael Lawrence >>>> <lawrence.michael at="" gene.com=""> wrote: >>>>> >>>>> Seems like it could be a nice thing to have. Presumably one would >>>>> create >>>> >>>> an >>>>> >>>>> Array subclass of Vector that would add a "dim" attribute. Then Matrix >>>> >>>> could >>>>> >>>>> extend that to constrain dim to length two (unfortunately colliding >>>>> with >>>> >>>> the >>>>> >>>>> Matrix class in the Matrix package). Then RleMatrix extends Matrix to >>>>> implement the actual data storage and many of the accelerated methods. >>>>> As >>>>> you said, row-oriented methods would be tough. >>>>> >>>>> Any takers? >>>>> >>>>> Michael >>>>> >>>>> On Mon, Jun 25, 2012 at 9:11 PM, Kasper Daniel Hansen >>>>> <kasperdanielhansen at="" gmail.com=""> wrote: >>>>>> >>>>>> >>>>>> On Mon, Jun 25, 2012 at 11:56 PM, Kasper Daniel Hansen >>>>>> <kasperdanielhansen at="" gmail.com=""> wrote: >>>>>>> >>>>>>> On Mon, Jun 25, 2012 at 11:36 PM, Michael Lawrence >>>>>>> <lawrence.michael at="" gene.com=""> wrote: >>>>>>>> >>>>>>>> Patrick and I had talked about this a long time ago (essentially >>>>>>>> putting a >>>>>>>> "dim" attribute on an Rle), but the closest thing today is a >>>> >>>> DataFrame >>>>>>>> >>>>>>>> with >>>>>>>> Rle columns. >>>>>>>> >>>>>>>> Use case? >>>>>>> >>>>>>> >>>>>>> Say I have whole-genome data (for example coverage) ?on multiple >>>>>>> samples. ?Usually, this is far easier to think of as a matrix (in my >>>>>>> opinion) with ~3B rows and I often want to do rowSums(), colSums() >>>>>>> etc >>>>>>> (in fact, probably the whole API from matrixStats). ?This is >>>>>>> especially nice when you have multiple coverage-like tracks on each >>>>>>> sample, so you could have >>>>>>> ?trackA : genome by samples >>>>>>> ?trackB : genome by samples >>>>>>> ?... >>>>>>> >>>>>>> You could think of this as a SummarizedExperiment, but with >>>>>>> _extremely_ big matrices in the assay slot. >>>>>>> >>>>>>> I want to take advantage of the Rle structure to store the data more >>>>>>> efficiently and also to do potentially faster computations. >>>>>>> >>>>>>> This is actually closer to my use case where I currently use matrices >>>>>>> with ~30M rows (which works fine), but I would like to expand to >>>>>>> ~800M >>>>>>> rows (which would suck a bit). >>>>>>> >>>>>>> You could also think of a matrix-like object with Rle columns as an >>>>>>> alternative sparse matrix structure. ?In a typical sparse matrix you >>>>>>> only store the non-zero entities, here we only store the >>>>>>> change-points. ?Depending on the structure of the matrix this could >>>>>>> be >>>>>>> an efficient storage of an otherwise dense matrix. >>>>>>> >>>>>>> So essentially, what I want, is to have mathematical operations on >>>>>>> this object, where I would utilize that I know that all entities are >>>>>>> numbers so the typical matrix operations makes sense. >>>>>>> >>>>>>> [ side question which could be relevant in this discussion: for a >>>>>>> numeric Rle is there some notion of precision - say I have truly >>>>>>> numeric values with tons of digits, and I want to consider two >>>>>>> numbers >>>>>>> part of the same run if |x1 -x2|<epsilon? ]="">>>>>> >>>>>> >>>>>> You can see that Pete has had similar thoughts in >>>>>> genoset/R/DataFrame-methods.R, although he only has colMeans (which is >>>>>> the easy one). >>>>>> >>>>>> Kasper >>>>>> >>>>>>> Kasper >>>>>>> >>>>>>>> >>>>>>>> Michael >>>>>>>> >>>>>>>> On Mon, Jun 25, 2012 at 8:27 PM, Kasper Daniel Hansen >>>>>>>> <kasperdanielhansen at="" gmail.com=""> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> Do we have a matrix-like object, but where the columns are Rle's? >>>>>>>>> >>>>>>>>> Kasper >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Bioconductor mailing list >>>>>>>>> Bioconductor at r-project.org >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>> Search the archives: >>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>> >>>>>>>> >>>>>>>> >>>>> >>>>> >>>> >>> >>> ? ? ? ?[[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: ?(206) 667-5791 > Fax: ? ?(206) 667-1319 > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.5 years ago Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

It does seem that the R data.frame tries its best to act like a matrix. For example, df + 1 and paste(df, "foo") act as one would expect. I've never used it in this way, probably because it's best to use matrices for that sort of thing Kasper, maybe you could list out the types of operations you would require? For example, colSums, rowSums, .... In terms of performance, is the number of columns small enough for iteration in R? We could fast-path the special case where every column is an Rle, but in general it would be hard to push that down into C. Michael On Wed, Jun 27, 2012 at 12:54 PM, Kasper Daniel Hansen < kasperdanielhansen@gmail.com> wrote: > On Wed, Jun 27, 2012 at 3:30 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > > Hi guys, > > > > Note that some of the things in the "matrix API" seem to work on > > standard data frames: > > > >> df <- data.frame(aa=1:5, bb=100) > >> rowSums(df) > > [1] 101 102 103 104 105 > >> colSums(df) > > aa bb > > 15 500 > >> max(df) > > [1] 100 > >> min(df) > > [1] 1 > >> range(df) > > [1] 1 100 > >> df + df > > aa bb > > 1 2 200 > > 2 4 200 > > 3 6 200 > > 4 8 200 > > 5 10 200 > >> df <= 3 > > aa bb > > [1,] TRUE FALSE > > [2,] TRUE FALSE > > [3,] TRUE FALSE > > [4,] FALSE FALSE > > [5,] FALSE FALSE > > > > etc... > > > > But none of them work on DataFrame. Maybe if they were we wouldn't need > > RleMatrix? Using DataFrame instead of RleMatrix would be nice because it > > reuses what we already have. It would also avoid the pitfall of having > > the length of an RleMatrix not being representable with a 32-bit int > > when let's say the nb of rows is 800M and there are a few nb of cols > > (like in Kasper's use case). No need to wait for Luke's "big vector" > > hack. > > This is totally fine with me, as long as coercion from Rle to a normal > vector is avoided. > > But it might make sense to have a derivative class ensuring that all > columns are numeric in nature. > > Kasper > > > > > Cheers, > > H. > > > > > > On 06/27/2012 10:46 AM, Jeff Leek wrote: > >> > >> I would love/use all the time this feature if it existed. > >> > >> Jeff > >> > >> On Wed, Jun 27, 2012 at 11:21 AM, Michael Lawrence > >> <lawrence.michael@gene.com> wrote: > >>> > >>> On Wed, Jun 27, 2012 at 8:07 AM, Kasper Daniel Hansen < > >>> kasperdanielhansen@gmail.com> wrote: > >>> > >>>> One comment: since matrix is a vector with a dim attribute I see that > >>>> the natural parallel is doing the same for Rle. > >>> > >>> > >>> > >>> Right, in the original plan, the Array class would bring the dim > >>> attribute, > >>> and RleMatrix would contain both Matrix and Rle. > >>> > >>> > >>>> Nevertheless, that > >>>> would put an upper limit on the number of runLengths in the entire > >>>> matrix. My impression (which could be wrong) is that we would need to > >>>> implement essentially all matrix-like numeric operations from scratch > >>>> anyway, so it may be worthwhile to consider using a list of Rle's > >>>> where each Rle is a column, instead of a single Rle to represent all > >>>> columns. Clearly that depends on implementation details, but if we > >>>> really need to do everything from scratch, a list of columns might be > >>>> more flexible (and perhaps even easier to code). > >>>> > >>>> > >>> This would make it harder to treat RleMatrix as an Rle (which is a nice > >>> feature of base R matrices). If the problem is the vector length limit, > >>> then I'd rather wait for Luke's fix, which apparently is coming along. > >>> > >>> Kasper > >>>> > >>>> > >>>> On Tue, Jun 26, 2012 at 6:41 AM, Michael Lawrence > >>>> <lawrence.michael@gene.com> wrote: > >>>>> > >>>>> Seems like it could be a nice thing to have. Presumably one would > >>>>> create > >>>> > >>>> an > >>>>> > >>>>> Array subclass of Vector that would add a "dim" attribute. Then > Matrix > >>>> > >>>> could > >>>>> > >>>>> extend that to constrain dim to length two (unfortunately colliding > >>>>> with > >>>> > >>>> the > >>>>> > >>>>> Matrix class in the Matrix package). Then RleMatrix extends Matrix to > >>>>> implement the actual data storage and many of the accelerated > methods. > >>>>> As > >>>>> you said, row-oriented methods would be tough. > >>>>> > >>>>> Any takers? > >>>>> > >>>>> Michael > >>>>> > >>>>> On Mon, Jun 25, 2012 at 9:11 PM, Kasper Daniel Hansen > >>>>> <kasperdanielhansen@gmail.com> wrote: > >>>>>> > >>>>>> > >>>>>> On Mon, Jun 25, 2012 at 11:56 PM, Kasper Daniel Hansen > >>>>>> <kasperdanielhansen@gmail.com> wrote: > >>>>>>> > >>>>>>> On Mon, Jun 25, 2012 at 11:36 PM, Michael Lawrence > >>>>>>> <lawrence.michael@gene.com> wrote: > >>>>>>>> > >>>>>>>> Patrick and I had talked about this a long time ago (essentially > >>>>>>>> putting a > >>>>>>>> "dim" attribute on an Rle), but the closest thing today is a > >>>> > >>>> DataFrame > >>>>>>>> > >>>>>>>> with > >>>>>>>> Rle columns. > >>>>>>>> > >>>>>>>> Use case? > >>>>>>> > >>>>>>> > >>>>>>> Say I have whole-genome data (for example coverage) on multiple > >>>>>>> samples. Usually, this is far easier to think of as a matrix (in > my > >>>>>>> opinion) with ~3B rows and I often want to do rowSums(), colSums() > >>>>>>> etc > >>>>>>> (in fact, probably the whole API from matrixStats). This is > >>>>>>> especially nice when you have multiple coverage-like tracks on each > >>>>>>> sample, so you could have > >>>>>>> trackA : genome by samples > >>>>>>> trackB : genome by samples > >>>>>>> ... > >>>>>>> > >>>>>>> You could think of this as a SummarizedExperiment, but with > >>>>>>> _extremely_ big matrices in the assay slot. > >>>>>>> > >>>>>>> I want to take advantage of the Rle structure to store the data > more > >>>>>>> efficiently and also to do potentially faster computations. > >>>>>>> > >>>>>>> This is actually closer to my use case where I currently use > matrices > >>>>>>> with ~30M rows (which works fine), but I would like to expand to > >>>>>>> ~800M > >>>>>>> rows (which would suck a bit). > >>>>>>> > >>>>>>> You could also think of a matrix-like object with Rle columns as an > >>>>>>> alternative sparse matrix structure. In a typical sparse matrix > you > >>>>>>> only store the non-zero entities, here we only store the > >>>>>>> change-points. Depending on the structure of the matrix this could > >>>>>>> be > >>>>>>> an efficient storage of an otherwise dense matrix. > >>>>>>> > >>>>>>> So essentially, what I want, is to have mathematical operations on > >>>>>>> this object, where I would utilize that I know that all entities > are > >>>>>>> numbers so the typical matrix operations makes sense. > >>>>>>> > >>>>>>> [ side question which could be relevant in this discussion: for a > >>>>>>> numeric Rle is there some notion of precision - say I have truly > >>>>>>> numeric values with tons of digits, and I want to consider two > >>>>>>> numbers > >>>>>>> part of the same run if |x1 -x2|<epsilon? ]=""> >>>>>> > >>>>>> > >>>>>> You can see that Pete has had similar thoughts in > >>>>>> genoset/R/DataFrame-methods.R, although he only has colMeans (which > is > >>>>>> the easy one). > >>>>>> > >>>>>> Kasper > >>>>>> > >>>>>>> Kasper > >>>>>>> > >>>>>>>> > >>>>>>>> Michael > >>>>>>>> > >>>>>>>> On Mon, Jun 25, 2012 at 8:27 PM, Kasper Daniel Hansen > >>>>>>>> <kasperdanielhansen@gmail.com> wrote: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Do we have a matrix-like object, but where the columns are Rle's? > >>>>>>>>> > >>>>>>>>> Kasper > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> Bioconductor mailing list > >>>>>>>>> Bioconductor@r-project.org > >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>>>>>>> Search the archives: > >>>>>>>>> > http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>> > >>>>> > >>>> > >>> > >>> [[alternative HTML version deleted]] > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor@r-project.org > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: > >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor@r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > > > > > -- > > Hervé Pagès > > > > Program in Computational Biology > > Division of Public Health Sciences > > Fred Hutchinson Cancer Research Center > > 1100 Fairview Ave. N, M1-B514 > > P.O. Box 19024 > > Seattle, WA 98109-1024 > > > > E-mail: hpages@fhcrc.org > > Phone: (206) 667-5791 > > Fax: (206) 667-1319 > > > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 12.5 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Kasper, On 06/25/2012 08:56 PM, Kasper Daniel Hansen wrote: [...] > [ side question which could be relevant in this discussion: for a > numeric Rle is there some notion of precision - say I have truly > numeric values with tons of digits, and I want to consider two numbers > part of the same run if |x1 -x2|<epsilon? ]="" the="" comparison="" of="" 2="" doubles="" is="" done="" at="" the="" c="" level="" with="=," which="" afaik="" is="" the="" same="" as="" doing="=" in="" r="" (as="" long="" as="" we="" deal="" with="" non-na="" and="" non-nan="" values).="" see="" the="" _fill_rle_slots_with_double_vals()="" helper="" function="" in="" iranges="" src="" rle_class.c="" for="" the="" details.="" therefore:=""> all.equal(sqrt(3)^2, 3) [1] TRUE > sqrt(3)^2 == 3 [1] FALSE > Rle(c(sqrt(3)^2, 3)) 'numeric' Rle of length 2 with 2 runs Lengths: 1 1 Values : 3 3 Note that base::rle() does the same: > rle(c(sqrt(3)^2, 3)) Run Length Encoding lengths: int [1:2] 1 1 values : num [1:2] 3 3 I can see that using a "|x1 -x2|<epsilon" criteria="" would="" in="" general="" give="" better="" compression="" (less="" runs)="" but="" then="" the="" compression="" would="" not="" be="" lossless="" as="" it="" is="" right="" now:=""> x <- c(sqrt(3)^2, 3) > identical(as.vector(Rle(x)), x) [1] TRUE > identical(inverse.rle(rle(x)), x) [1] TRUE Also the "|x1 -x2|<epsilon" approach="" would="" introduce="" some="" subtle="" complications="" due="" to="" the="" fact="" that="" the="" criteria="" is="" not="" transitive="" anymore="" i.e.="" you="" can="" have="" |x1="" -x2|<epsilon="" and="" |x2="" -x3|<epsilon,="" without="" having="" |x1="" -x3|<epsilon.="" because="" of="" that,="" finding="" the="" runs="" becomes="" some="" kind="" of="" clustering="" problem="" with="" several="" possible="" strategies,="" some="" of="" them="" very="" simple="" but="" not="" necessarily="" with="" the="" "good="" properties".="" h.=""> > Kasper > >> >> Michael >> >> On Mon, Jun 25, 2012 at 8:27 PM, Kasper Daniel Hansen >> <kasperdanielhansen at="" gmail.com=""> wrote: >>> >>> Do we have a matrix-like object, but where the columns are Rle's? >>> >>> Kasper >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 12.5 years ago Hervé Pagès 16k

0

Entering edit mode

On Wed, Jun 27, 2012 at 1:37 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Kasper, > > On 06/25/2012 08:56 PM, Kasper Daniel Hansen wrote: > [...] > > [ side question which could be relevant in this discussion: for a >> numeric Rle is there some notion of precision - say I have truly >> numeric values with tons of digits, and I want to consider two numbers >> part of the same run if |x1 -x2|<epsilon? ]="">> > > The comparison of 2 doubles is done at the C level with ==, which > AFAIK is the same as doing == in R (as long as we deal with non-NA > and non-NaN values). See the _fill_Rle_slots_with_double_**vals() helper > function in IRanges/src/Rle_class.c for the details. > > Therefore: > > > all.equal(sqrt(3)^2, 3) > [1] TRUE > > sqrt(3)^2 == 3 > [1] FALSE > > Rle(c(sqrt(3)^2, 3)) > 'numeric' Rle of length 2 with 2 runs > Lengths: 1 1 > Values : 3 3 > > Note that base::rle() does the same: > > > rle(c(sqrt(3)^2, 3)) > Run Length Encoding > lengths: int [1:2] 1 1 > values : num [1:2] 3 3 > > I can see that using a "|x1 -x2|<epsilon" criteria="" would="" in="" general=""> give better compression (less runs) but then the compression would not > be lossless as it is right now: > > > x <- c(sqrt(3)^2, 3) > > identical(as.vector(Rle(x)), x) > [1] TRUE > > identical(inverse.rle(rle(x)), x) > [1] TRUE > > Also the "|x1 -x2|<epsilon" approach="" would="" introduce="" some="" subtle=""> complications due to the fact that the criteria is not transitive > anymore i.e. you can have |x1 -x2|<epsilon and="" |x2="" -x3|<epsilon,=""> without having |x1 -x3|<epsilon. because="" of="" that,="" finding="" the="" runs=""> becomes some kind of clustering problem with several possible > strategies, some of them very simple but not necessarily with > the "good properties". > > One simple "clustering" would be to round to some fixed level of precision. One could multiple by some power of 10 and coerce to integer to avoid any floating point issues. > H. > > > >> Kasper >> >> >>> Michael >>> >>> On Mon, Jun 25, 2012 at 8:27 PM, Kasper Daniel Hansen >>> <kasperdanielhansen@gmail.com> wrote: >>> >>>> >>>> Do we have a matrix-like object, but where the columns are Rle's? >>>> >>>> Kasper >>>> >>>> ______________________________**_________________ >>>> Bioconductor mailing list >>>> Bioconductor@r-project.org >>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat="" .ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>> Search the archives: >>>> http://news.gmane.org/gmane.**science.biology.informatics.**condu ctor<http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>>> >>> >>> >>> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > [[alternative HTML version deleted]]

ADD REPLY • link 12.5 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Michael, On 06/27/2012 01:58 PM, Michael Lawrence wrote: > > > On Wed, Jun 27, 2012 at 1:37 PM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote: > > Hi Kasper, > > On 06/25/2012 08:56 PM, Kasper Daniel Hansen wrote: > [...] > > [ side question which could be relevant in this discussion: for a > numeric Rle is there some notion of precision - say I have truly > numeric values with tons of digits, and I want to consider two > numbers > part of the same run if |x1 -x2|<epsilon? ]=""> > > The comparison of 2 doubles is done at the C level with ==, which > AFAIK is the same as doing == in R (as long as we deal with non- NA > and non-NaN values). See the _fill_Rle_slots_with_double___vals() helper > function in IRanges/src/Rle_class.c for the details. > > Therefore: > > > all.equal(sqrt(3)^2, 3) > [1] TRUE > > sqrt(3)^2 == 3 > [1] FALSE > > Rle(c(sqrt(3)^2, 3)) > 'numeric' Rle of length 2 with 2 runs > Lengths: 1 1 > Values : 3 3 > > Note that base::rle() does the same: > > > rle(c(sqrt(3)^2, 3)) > Run Length Encoding > lengths: int [1:2] 1 1 > values : num [1:2] 3 3 > > I can see that using a "|x1 -x2|<epsilon" criteria="" would="" in="" general=""> give better compression (less runs) but then the compression would not > be lossless as it is right now: > > > x <- c(sqrt(3)^2, 3) > > identical(as.vector(Rle(x)), x) > [1] TRUE > > identical(inverse.rle(rle(x)), x) > [1] TRUE > > Also the "|x1 -x2|<epsilon" approach="" would="" introduce="" some="" subtle=""> complications due to the fact that the criteria is not transitive > anymore i.e. you can have |x1 -x2|<epsilon and="" |x2="" -x3|<epsilon,=""> without having |x1 -x3|<epsilon. because="" of="" that,="" finding="" the="" runs=""> becomes some kind of clustering problem with several possible > strategies, some of them very simple but not necessarily with > the "good properties". > > > One simple "clustering" would be to round to some fixed level of > precision. One could multiple by some power of 10 and coerce to integer > to avoid any floating point issues. Like for example Rle(round(x, digits=4)). If people feel that this would be useful, we could add the 'digits' arg to the Rle() constructor so the rounding is taken care of by the constructor itself. With default to NA for no rounding at all (like now), so the good properties are preserved e.g. lossless compression and the fact that unique, duplicated, is.unsorted, sort, order, rank etc (anything involving comparison between doubles) will behave exactly the same way on x and Rle(x) (there is code around that relies on such behavior). Also maybe we could consider doing signif() instead of round(). Cheers, H. > > H. > > > > Kasper > > > Michael > > On Mon, Jun 25, 2012 at 8:27 PM, Kasper Daniel Hansen > <kasperdanielhansen at="" gmail.com=""> <mailto:kasperdanielhansen at="" gmail.com="">> wrote: > > > Do we have a matrix-like object, but where the columns > are Rle's? > > Kasper > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 12.5 years ago Hervé Pagès 16k

Login before adding your answer.