Integer overflow when summing an 'integer' Rle
1
0
Entering edit mode
@delhommeemblde-3232
Last seen 9.6 years ago
Hi all, While calculating some statistics of an RNA-seq experiment I tumbled onto the following problem. Applying the IRanges coverage function to my IRanges, I get back an integer Rle object. However trying to get the mean or sum of that Rle object results in an integer overflow. The following example just exemplify that overflow. library(IRanges) rC <- Rle(values=as.integer(c(1,(2^31)-1,1))) sum(rC) mean(rC) Both result in an integer overflow. [1] NA Warning message: In sum(runValue(x) * runLength(x), ..., na.rm = na.rm) : Integer overflow - use sum(as.numeric(.)) The solution to that is to do the following: sum(as.numeric(runLength(rC) * runValue(rC))) but IMO it should be handled at the Rle level code; i.e. an integer Rle can clearly have a sum, a mean, etc... result that involve calculating values outside the integer range. Is there anything that speaks again having these functions internally converting the integer values to numeric before calculating the sum or mean? Looking forward to hearing your thoughts on this, Cheers, Nico sessionInfo() R Under development (unstable) (2012-02-07 r58290) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] IRanges_1.13.24 BiocGenerics_0.1.4 loaded via a namespace (and not attached): [1] tools_2.15.0 --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany
Coverage IRanges Coverage IRanges • 911 views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 17 hours ago
Seattle, WA, United States
Salut Nico, On 02/10/2012 08:04 AM, Nicolas Delhomme wrote: > Hi all, > > While calculating some statistics of an RNA-seq experiment I tumbled onto the following problem. Applying the IRanges coverage function to my IRanges, I get back an integer Rle object. However trying to get the mean or sum of that Rle object results in an integer overflow. The following example just exemplify that overflow. > > library(IRanges) > rC<- Rle(values=as.integer(c(1,(2^31)-1,1))) > sum(rC) > mean(rC) > > Both result in an integer overflow. > > [1] NA > Warning message: > In sum(runValue(x) * runLength(x), ..., na.rm = na.rm) : > Integer overflow - use sum(as.numeric(.)) > > The solution to that is to do the following: > > sum(as.numeric(runLength(rC) * runValue(rC))) Another solution is to convert the 'integer' Rle into a 'numeric' Rle before doing sum(). Unfortunately, since we don't have separate classes for those (like for example an IntegerRle and a DoubleRle class) it cannot be done using direct coercion i.e. with something like: as(rC, "DoubleRle") (Maybe we should have individual Rle subclasses for 'integer' Rle, 'numeric' Rle, 'logical' Rle, 'character' Rle, 'factor' Rle etc...) So for now, this conversion must be done with: > class(runValue(rC)) <- "double" > rC 'numeric' Rle of length 3 with 3 runs Lengths: 1 1 1 Values : 1 2147483647 1 This works fine with an Rle, but not so much with an RleList where one needs to do some ugly contortions in order to succeed. Alternatively to having individual Rle subclasses maybe we could have an accessor e.g. rleValueType(), with getter and setters, so we could do: > rleValueType(rC) [1] "integer" > rleValueType(rC) <- "double" and that would work on Rle and RleList objects. Anyway, even though I think having an easy/unified way for changing the type of the values in Rle/RleList objects is important, maybe I'm going slightly off-topic. What we should definitely do now is replace this warning: Warning message: In sum(runValue(x) * runLength(x), ..., na.rm = na.rm) : Integer overflow - use sum(as.numeric(.)) by a more appropriate one (doing as.numeric() on an Rle is not a good idea). > > but IMO it should be handled at the Rle level code; i.e. an integer Rle can clearly have a sum, a mean, etc... result that involve calculating values outside the integer range. I agree for mean() so I'll fix that. But for sum()... "calculating values outside the integer range", even if the result of this calculation itself is not in the integer range? base::sum() will return NA if the result is not in the integer range and I think that's the right thing to do. I don't like the idea of sum() returning a double when the input is integer. Cheers, H. > Is there anything that speaks again having these functions internally converting the integer values to numeric before calculating the sum or mean? > > Looking forward to hearing your thoughts on this, > > Cheers, > > Nico > > sessionInfo() > R Under development (unstable) (2012-02-07 r58290) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] IRanges_1.13.24 BiocGenerics_0.1.4 > > loaded via a namespace (and not attached): > [1] tools_2.15.0 > > > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme at embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
Salut Herv?, Bonne ann?e! Well, we're already mid-Feb, but still most of it is in front of us ;-) On 10 Feb 2012, at 19:30, Hervé Pagès wrote: > Salut Nico, > > On 02/10/2012 08:04 AM, Nicolas Delhomme wrote: >> Hi all, >> >> While calculating some statistics of an RNA-seq experiment I tumbled onto the following problem. Applying the IRanges coverage function to my IRanges, I get back an integer Rle object. However trying to get the mean or sum of that Rle object results in an integer overflow. The following example just exemplify that overflow. >> >> library(IRanges) >> rC<- Rle(values=as.integer(c(1,(2^31)-1,1))) >> sum(rC) >> mean(rC) >> >> Both result in an integer overflow. >> >> [1] NA >> Warning message: >> In sum(runValue(x) * runLength(x), ..., na.rm = na.rm) : >> Integer overflow - use sum(as.numeric(.)) >> >> The solution to that is to do the following: >> >> sum(as.numeric(runLength(rC) * runValue(rC))) > > Another solution is to convert the 'integer' Rle into a 'numeric' Rle > before doing sum(). Unfortunately, since we don't have separate > classes for those (like for example an IntegerRle and a DoubleRle > class) it cannot be done using direct coercion i.e. with something > like: > > as(rC, "DoubleRle") > > (Maybe we should have individual Rle subclasses for 'integer' Rle, > 'numeric' Rle, 'logical' Rle, 'character' Rle, 'factor' Rle etc...) > That could be useful. I, a few times, had to do quite some conversions to go back and forth between different Rle "kinds". Having subclasses would be great. > So for now, this conversion must be done with: > > > class(runValue(rC)) <- "double" > > rC > 'numeric' Rle of length 3 with 3 runs > Lengths: 1 1 1 > Values : 1 2147483647 1 > > This works fine with an Rle, but not so much with an RleList where > one needs to do some ugly contortions in order to succeed. Well, I ended up doing that in an lapply and it works just fine. Not the most efficient memory wise though. > > Alternatively to having individual Rle subclasses maybe we could have > an accessor e.g. rleValueType(), with getter and setters, so we could > do: > > > rleValueType(rC) > [1] "integer" > > rleValueType(rC) <- "double" > > and that would work on Rle and RleList objects. > That would indeed be very useful and probably easier to implement. > Anyway, even though I think having an easy/unified way for changing > the type of the values in Rle/RleList objects is important, maybe > I'm going slightly off-topic. > > What we should definitely do now is replace this warning: > > Warning message: > In sum(runValue(x) * runLength(x), ..., na.rm = na.rm) : > Integer overflow - use sum(as.numeric(.)) > > by a more appropriate one (doing as.numeric() on an Rle is not a good > idea). > Indeed. >> >> but IMO it should be handled at the Rle level code; i.e. an integer Rle can clearly have a sum, a mean, etc... result that involve calculating values outside the integer range. > > I agree for mean() so I'll fix that. > > But for sum()... "calculating values outside the integer range", > even if the result of this calculation itself is not in the > integer range? base::sum() will return NA if the result is not in > the integer range and I think that's the right thing to do. > I don't like the idea of sum() returning a double when the input > is integer. > I'm on the same page here. Consistency (especially for R) is crucial. Under these conditions, having a meaningful warning would indeed be the best. Thanks for the detailed answer and for the slightly-off topic "diversion" . Cheers, Nico > Cheers, > H. > >> Is there anything that speaks again having these functions internally converting the integer values to numeric before calculating the sum or mean? >> >> Looking forward to hearing your thoughts on this, >> >> Cheers, >> >> Nico >> >> sessionInfo() >> R Under development (unstable) (2012-02-07 r58290) >> Platform: x86_64-apple-darwin10.8.0 (64-bit) >> >> locale: >> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] IRanges_1.13.24 BiocGenerics_0.1.4 >> >> loaded via a namespace (and not attached): >> [1] tools_2.15.0 >> >> >> >> --------------------------------------------------------------- >> Nicolas Delhomme >> >> Genome Biology Computational Support >> >> European Molecular Biology Laboratory >> >> Tel: +49 6221 387 8310 >> Email: nicolas.delhomme at embl.de >> Meyerhofstrasse 1 - Postfach 10.2209 >> 69102 Heidelberg, Germany >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319
ADD REPLY

Login before adding your answer.

Traffic: 1034 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6