Integer overflow when summing an 'integer' Rle
1
0
Entering edit mode
@valerie-obenchain-4275
Last seen 16 months ago
United States
Hi Nico, The following fixes have been applied to IRanges 1.15.43 (1) The 'Integer overflow' warning thrown by sum() on an integer-Rle is now more appropriate, library(IRanges) x <- Rle(values=as.integer(c(1, 2^31 -1, 1))) > sum(x) [1] NA Warning message: In sum(runValue(x) * runLength(x), ..., na.rm = na.rm) : Integer overflow - use runValue(.) <- as.numeric(runValue(.)) // (2) integers are coerced to numeric when calling mean() on an integer- Rle > mean(x) [1] 715827883 Valerie ## Paste of original correspondence between Nico and Herve [BioC] Integer overflow when summing an 'integer' Rle Nicolas Delhomme delhomme at embl.de Tue Feb 14 17:35:48 CET 2012 Salut Hervé, Bonne année! Well, we're already mid-Feb, but still most of it is in front of us ;-) On 10 Feb 2012, at 19:30, Hervé Pagès wrote: > Salut Nico, > > On 02/10/2012 08:04 AM, Nicolas Delhomme wrote: >> Hi all, >> >> While calculating some statistics of an RNA-seq experiment I tumbled onto the following problem. Applying the IRanges coverage function to my IRanges, I get back an integer Rle object. However trying to get the mean or sum of that Rle object results in an integer overflow. The following example just exemplify that overflow. >> >> library(IRanges) >> rC<- Rle(values=as.integer(c(1,(2^31)-1,1))) >> sum(rC) >> mean(rC) >> >> Both result in an integer overflow. >> >> [1] NA >> Warning message: >> In sum(runValue(x) * runLength(x), ..., na.rm = na.rm) : >> Integer overflow - use sum(as.numeric(.)) >> >> The solution to that is to do the following: >> >> sum(as.numeric(runLength(rC) * runValue(rC))) > > Another solution is to convert the 'integer' Rle into a 'numeric' Rle > before doing sum(). Unfortunately, since we don't have separate > classes for those (like for example an IntegerRle and a DoubleRle > class) it cannot be done using direct coercion i.e. with something > like: > > as(rC, "DoubleRle") > > (Maybe we should have individual Rle subclasses for 'integer' Rle, > 'numeric' Rle, 'logical' Rle, 'character' Rle, 'factor' Rle etc...) > That could be useful. I, a few times, had to do quite some conversions to go back and forth between different Rle "kinds". Having subclasses would be great. > So for now, this conversion must be done with: > > > class(runValue(rC)) <- "double" > > rC > 'numeric' Rle of length 3 with 3 runs > Lengths: 1 1 1 > Values : 1 2147483647 1 > > This works fine with an Rle, but not so much with an RleList where > one needs to do some ugly contortions in order to succeed. Well, I ended up doing that in an lapply and it works just fine. Not the most efficient memory wise though. > > Alternatively to having individual Rle subclasses maybe we could have > an accessor e.g. rleValueType(), with getter and setters, so we could > do: > > > rleValueType(rC) > [1] "integer" > > rleValueType(rC) <- "double" > > and that would work on Rle and RleList objects. > That would indeed be very useful and probably easier to implement. > Anyway, even though I think having an easy/unified way for changing > the type of the values in Rle/RleList objects is important, maybe > I'm going slightly off-topic. > > What we should definitely do now is replace this warning: > > Warning message: > In sum(runValue(x) * runLength(x), ..., na.rm = na.rm) : > Integer overflow - use sum(as.numeric(.)) > > by a more appropriate one (doing as.numeric() on an Rle is not a good > idea). > Indeed. >> >> but IMO it should be handled at the Rle level code; i.e. an integer Rle can clearly have a sum, a mean, etc... result that involve calculating values outside the integer range. > > I agree for mean() so I'll fix that. > > But for sum()... "calculating values outside the integer range", > even if the result of this calculation itself is not in the > integer range? base::sum() will return NA if the result is not in > the integer range and I think that's the right thing to do. > I don't like the idea of sum() returning a double when the input > is integer. > I'm on the same page here. Consistency (especially for R) is crucial. Under these conditions, having a meaningful warning would indeed be the best. Thanks for the detailed answer and for the slightly-off topic "diversion" . Cheers, Nico > Cheers, > H. > >> Is there anything that speaks again having these functions internally converting the integer values to numeric before calculating the sum or mean? >> >> Looking forward to hearing your thoughts on this, >> >> Cheers, >> >> Nico >> >> sessionInfo() >> R Under development (unstable) (2012-02-07 r58290) >> Platform: x86_64-apple-darwin10.8.0 (64-bit) >> >> locale: >> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] IRanges_1.13.24 BiocGenerics_0.1.4 >> >> loaded via a namespace (and not attached): >> [1] tools_2.15.0 >> >> >> >> --------------------------------------------------------------- >> Nicolas Delhomme >> >> Genome Biology Computational Support >> >> European Molecular Biology Laboratory >> >> Tel: +49 6221 387 8310 >> Email: nicolas.delhomme at embl.de >> Meyerhofstrasse 1 - Postfach 10.2209 >> 69102 Heidelberg, Germany >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 * Previous message: [BioC] Integer overflow when summing an 'integer' Rle * Next message: [BioC] about library size and length of gene information in DEseq * Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] More information about the Bioconductor mailing list [[alternative HTML version deleted]]
0
Entering edit mode
@delhommeemblde-3232
Last seen 8.7 years ago
Great! Thanks, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On 4 Sep 2012, at 22:16, Valerie Obenchain wrote: > Hi Nico, > > The following fixes have been applied to IRanges 1.15.43 > > (1) The 'Integer overflow' warning thrown by sum() on an integer-Rle is now more appropriate, > > library(IRanges) > x <- Rle(values=as.integer(c(1, 2^31 -1, 1))) > > sum(x) > [1] NA > Warning message: > In sum(runValue(x) * runLength(x), ..., na.rm = na.rm) : > Integer overflow - use runValue(.) <- as.numeric(runValue(.)) > > (2) integers are coerced to numeric when calling mean() on an integer-Rle > > > mean(x) > [1] 715827883 > > Valerie > > > > ## Paste of original correspondence between Nico and Herve > > [BioC] Integer overflow when summing an 'integer' Rle > Nicolas Delhomme delhomme at embl.de > Tue Feb 14 17:35:48 CET 2012 > > Salut Herv?, > > Bonne ann?e! Well, we're already mid-Feb, but still most of it is in front of us ;-) > > On 10 Feb 2012, at 19:30, Hervé Pagès wrote: > > > Salut Nico, > > > > On 02/10/2012 08:04 AM, Nicolas Delhomme wrote: > >> Hi all, > >> > >> While calculating some statistics of an RNA-seq experiment I tumbled onto the following problem. Applying the IRanges coverage function to my IRanges, I get back an integer Rle object. However trying to get the mean or sum of that Rle object results in an integer overflow. The following example just exemplify that overflow. > >> > >> library(IRanges) > >> rC<- Rle(values=as.integer(c(1,(2^31)-1,1))) > >> sum(rC) > >> mean(rC) > >> > >> Both result in an integer overflow. > >> > >> [1] NA > >> Warning message: > >> In sum(runValue(x) * runLength(x), ..., na.rm = na.rm) : > >> Integer overflow - use sum(as.numeric(.)) > >> > >> The solution to that is to do the following: > >> > >> sum(as.numeric(runLength(rC) * runValue(rC))) > > > > Another solution is to convert the 'integer' Rle into a 'numeric' Rle > > before doing sum(). Unfortunately, since we don't have separate > > classes for those (like for example an IntegerRle and a DoubleRle > > class) it cannot be done using direct coercion i.e. with something > > like: > > > > as(rC, "DoubleRle") > > > > (Maybe we should have individual Rle subclasses for 'integer' Rle, > > 'numeric' Rle, 'logical' Rle, 'character' Rle, 'factor' Rle etc...) > > > > That could be useful. I, a few times, had to do quite some conversions to go back and forth between different Rle "kinds". Having subclasses would be great. > > > So for now, this conversion must be done with: > > > > > class(runValue(rC)) <- "double" > > > rC > > 'numeric' Rle of length 3 with 3 runs > > Lengths: 1 1 1 > > Values : 1 2147483647 1 > > > > This works fine with an Rle, but not so much with an RleList where > > one needs to do some ugly contortions in order to succeed. > > Well, I ended up doing that in an lapply and it works just fine. Not the most efficient memory wise though. > > > > > Alternatively to having individual Rle subclasses maybe we could have > > an accessor e.g. rleValueType(), with getter and setters, so we could > > do: > > > > > rleValueType(rC) > > [1] "integer" > > > rleValueType(rC) <- "double" > > > > and that would work on Rle and RleList objects. > > > > That would indeed be very useful and probably easier to implement. > > > Anyway, even though I think having an easy/unified way for changing > > the type of the values in Rle/RleList objects is important, maybe > > I'm going slightly off-topic. > > > > What we should definitely do now is replace this warning: > > > > Warning message: > > In sum(runValue(x) * runLength(x), ..., na.rm = na.rm) : > > Integer overflow - use sum(as.numeric(.)) > > > > by a more appropriate one (doing as.numeric() on an Rle is not a good > > idea). > > > > Indeed. > > > >> > >> but IMO it should be handled at the Rle level code; i.e. an integer Rle can clearly have a sum, a mean, etc... result that involve calculating values outside the integer range. > > > > I agree for mean() so I'll fix that. > > > > But for sum()... "calculating values outside the integer range", > > even if the result of this calculation itself is not in the > > integer range? base::sum() will return NA if the result is not in > > the integer range and I think that's the right thing to do. > > I don't like the idea of sum() returning a double when the input > > is integer. > > > > I'm on the same page here. Consistency (especially for R) is crucial. Under these conditions, having a meaningful warning would indeed be the best. > > Thanks for the detailed answer and for the slightly-off topic "diversion" . > > Cheers, > > Nico > > > Cheers, > > H. > > > >> Is there anything that speaks again having these functions internally converting the integer values to numeric before calculating the sum or mean? > >> > >> Looking forward to hearing your thoughts on this, > >> > >> Cheers, > >> > >> Nico > >> > >> sessionInfo() > >> R Under development (unstable) (2012-02-07 r58290) > >> Platform: x86_64-apple-darwin10.8.0 (64-bit) > >> > >> locale: > >> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 > >> > >> attached base packages: > >> [1] stats graphics grDevices utils datasets methods base > >> > >> other attached packages: > >> [1] IRanges_1.13.24 BiocGenerics_0.1.4 > >> > >> loaded via a namespace (and not attached): > >> [1] tools_2.15.0 > >> > >> > >> > >> --------------------------------------------------------------- > >> Nicolas Delhomme > >> > >> Genome Biology Computational Support > >> > >> European Molecular Biology Laboratory > >> > >> Tel: +49 6221 387 8310 > >> Email: nicolas.delhomme at embl.de > >> Meyerhofstrasse 1 - Postfach 10.2209 > >> 69102 Heidelberg, Germany > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > -- > > Hervé Pagès > > > > Program in Computational Biology > > Division of Public Health Sciences > > Fred Hutchinson Cancer Research Center > > 1100 Fairview Ave. N, M1-B514 > > P.O. Box 19024 > > Seattle, WA 98109-1024 > > > > E-mail: hpages at fhcrc.org > > Phone: (206) 667-5791 > > Fax: (206) 667-1319 > > * Previous message: [BioC] Integer overflow when summing an 'integer' Rle > * Next message: [BioC] about library size and length of gene information in DEseq > * Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] > > More information about the Bioconductor mailing list > > >