IRanges coverage integer limit?
1
0
Entering edit mode
@delhommeemblde-3232
Last seen 9.6 years ago
Hi, I've just discovered that the IRanges coverage function would "overflow" without warnings. Below is an example that reproduce it: library(IRanges) rngs <- IRanges(c(1:100),width=100) coverage(rngs) 'integer' Rle of length 199 with 199 runs Lengths: 1 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1 Values : 1 2 3 4 5 6 7 8 9 10 11 ... 10 9 8 7 6 5 4 3 2 1 coverage(rngs,weight=1e9) 'integer' Rle of length 200 with 200 runs Lengths: 1 1 1 ... 1 1 Values : 1000000000 2000000000 -1294967296 ... 1000000000 0 runValue(coverage(rngs,weight=1e9)) [1] 1000000000 2000000000 -1294967296 -294967296 705032704 1705032704 [7] -1589934592 -589934592 410065408 1410065408 -1884901888 -884901888 ... Clearly, the third position that has a coverage of 3 (not weighted) has a 3e9 weighted one which is > 2^31 (signed integer limit on most machine). I'm just surprised that it is silently ignored. For NGS, getting a bp coverage > 2^31 is unlikely, although I've already seen extremely high coverage for Ribosomal-like protein that were only 10 order of magnitude away (~2M X). This limits the ranges of weights that can be used (weight as of now can only be integers), i.e. a weight of 100 would already be borderline. Is there a way around this, coverage being such a very handy function? I understand that weight being integers probably makes computation faster, but what could be the overhead of allowing numeric instead? And I don't mind looking under the hood if that helps. Cheers, Nico sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C/UTF-8/C/C/C/C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] IRanges_1.15.17 BiocGenerics_0.3.0 loaded via a namespace (and not attached): [1] stats4_2.15.1 tools_2.15.1 --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany
Coverage IRanges Coverage IRanges • 1.5k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States
On 07/03/2012 09:40 AM, Nicolas Delhomme wrote: > Hi, > > I've just discovered that the IRanges coverage function would "overflow" without warnings. Below is an example that reproduce it: > > library(IRanges) > rngs <- IRanges(c(1:100),width=100) > coverage(rngs) > > 'integer' Rle of length 199 with 199 runs > Lengths: 1 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1 > Values : 1 2 3 4 5 6 7 8 9 10 11 ... 10 9 8 7 6 5 4 3 2 1 > > coverage(rngs,weight=1e9) > > 'integer' Rle of length 200 with 200 runs > Lengths: 1 1 1 ... 1 1 > Values : 1000000000 2000000000 -1294967296 ... 1000000000 0 > > runValue(coverage(rngs,weight=1e9)) > [1] 1000000000 2000000000 -1294967296 -294967296 705032704 1705032704 > [7] -1589934592 -589934592 410065408 1410065408 -1884901888 -884901888 > ... > > Clearly, the third position that has a coverage of 3 (not weighted) has a 3e9 weighted one which is > 2^31 (signed integer limit on most machine). I'm just surprised that it is silently ignored. > > For NGS, getting a bp coverage > 2^31 is unlikely, although I've already seen extremely high coverage for Ribosomal-like protein that were only 10 order of magnitude away (~2M X). This limits the ranges of weights that can be used (weight as of now can only be integers), i.e. a weight of 100 would already be borderline. > > Is there a way around this, coverage being such a very handy function? I understand that weight being integers probably makes computation faster, but what could be the overhead of allowing numeric instead? And I don't mind looking under the hood if that helps. Thanks Nico for catching this other one. I will keep operations in the int space for now (so an 'integer' Rle is always returned) but will make sure a warning is issued and NAs are returned in case of overflow. H. > > Cheers, > > Nico > > sessionInfo() > R version 2.15.1 (2012-06-22) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] C/UTF-8/C/C/C/C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] IRanges_1.15.17 BiocGenerics_0.3.0 > > loaded via a namespace (and not attached): > [1] stats4_2.15.1 tools_2.15.1 > > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme at embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
Great, thanks! Herv? - how much effort is it to extend it to numeric? I'm willing to do it, I just do not want to start on something where YOU would say it's though ;-) Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On Jul 3, 2012, at 8:00 PM, Hervé Pagès wrote: > On 07/03/2012 09:40 AM, Nicolas Delhomme wrote: >> Hi, >> >> I've just discovered that the IRanges coverage function would "overflow" without warnings. Below is an example that reproduce it: >> >> library(IRanges) >> rngs <- IRanges(c(1:100),width=100) >> coverage(rngs) >> >> 'integer' Rle of length 199 with 199 runs >> Lengths: 1 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1 >> Values : 1 2 3 4 5 6 7 8 9 10 11 ... 10 9 8 7 6 5 4 3 2 1 >> >> coverage(rngs,weight=1e9) >> >> 'integer' Rle of length 200 with 200 runs >> Lengths: 1 1 1 ... 1 1 >> Values : 1000000000 2000000000 -1294967296 ... 1000000000 0 >> >> runValue(coverage(rngs,weight=1e9)) >> [1] 1000000000 2000000000 -1294967296 -294967296 705032704 1705032704 >> [7] -1589934592 -589934592 410065408 1410065408 -1884901888 -884901888 >> ... >> >> Clearly, the third position that has a coverage of 3 (not weighted) has a 3e9 weighted one which is > 2^31 (signed integer limit on most machine). I'm just surprised that it is silently ignored. >> >> For NGS, getting a bp coverage > 2^31 is unlikely, although I've already seen extremely high coverage for Ribosomal-like protein that were only 10 order of magnitude away (~2M X). This limits the ranges of weights that can be used (weight as of now can only be integers), i.e. a weight of 100 would already be borderline. >> >> Is there a way around this, coverage being such a very handy function? I understand that weight being integers probably makes computation faster, but what could be the overhead of allowing numeric instead? And I don't mind looking under the hood if that helps. > > Thanks Nico for catching this other one. I will keep operations in the > int space for now (so an 'integer' Rle is always returned) but will make > sure a warning is issued and NAs are returned in case of overflow. > > H. > >> >> Cheers, >> >> Nico >> >> sessionInfo() >> R version 2.15.1 (2012-06-22) >> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >> >> locale: >> [1] C/UTF-8/C/C/C/C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] IRanges_1.15.17 BiocGenerics_0.3.0 >> >> loaded via a namespace (and not attached): >> [1] stats4_2.15.1 tools_2.15.1 >> >> >> --------------------------------------------------------------- >> Nicolas Delhomme >> >> Genome Biology Computational Support >> >> European Molecular Biology Laboratory >> >> Tel: +49 6221 387 8310 >> Email: nicolas.delhomme at embl.de >> Meyerhofstrasse 1 - Postfach 10.2209 >> 69102 Heidelberg, Germany >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > >
ADD REPLY
0
Entering edit mode
Hi Nico, The overflow issue is addressed in IRanges 1.15.18 (devel). On 07/04/2012 02:16 AM, Nicolas Delhomme wrote: > Great, thanks! > > Herv? - how much effort is it to extend it to numeric? I'm willing to do it, I just do not want to start on something where YOU would say it's though ;-) I don't think it would be tough at all. The real question is: do we want coverage() to always return a numeric-Rle instead of integer-Rle? This will make the Rle 50% bigger in memory, probably not a big deal. On the other hand this would allow treating numeric weights really as numerics instead of truncating them like we do right now: > coverage(IRanges(1:3, width=10)) 'integer' Rle of length 12 with 5 runs Lengths: 1 1 8 1 1 Values : 1 2 3 2 1 > coverage(IRanges(1:3, width=10), weight=2.86) 'integer' Rle of length 12 with 5 runs Lengths: 1 1 8 1 1 Values : 2 4 6 4 2 Maybe one option would be to return an integer-Rle when 'weight' is integer and a numeric-Rle when it's numeric. So by default (i.e. when no weights are supplied) it would still return an integer-Rle (because the default for 'weight' is 1L). But coverage(IRanges(1:3, width=10), weight=2) would return a numeric-Rle and coverage(IRanges(1:3, width=10), weight=2L) an integer-Rle. How does that sound? H. > > Nico > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme at embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > > > > > On Jul 3, 2012, at 8:00 PM, Hervé Pagès wrote: > >> On 07/03/2012 09:40 AM, Nicolas Delhomme wrote: >>> Hi, >>> >>> I've just discovered that the IRanges coverage function would "overflow" without warnings. Below is an example that reproduce it: >>> >>> library(IRanges) >>> rngs <- IRanges(c(1:100),width=100) >>> coverage(rngs) >>> >>> 'integer' Rle of length 199 with 199 runs >>> Lengths: 1 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1 >>> Values : 1 2 3 4 5 6 7 8 9 10 11 ... 10 9 8 7 6 5 4 3 2 1 >>> >>> coverage(rngs,weight=1e9) >>> >>> 'integer' Rle of length 200 with 200 runs >>> Lengths: 1 1 1 ... 1 1 >>> Values : 1000000000 2000000000 -1294967296 ... 1000000000 0 >>> >>> runValue(coverage(rngs,weight=1e9)) >>> [1] 1000000000 2000000000 -1294967296 -294967296 705032704 1705032704 >>> [7] -1589934592 -589934592 410065408 1410065408 -1884901888 -884901888 >>> ... >>> >>> Clearly, the third position that has a coverage of 3 (not weighted) has a 3e9 weighted one which is > 2^31 (signed integer limit on most machine). I'm just surprised that it is silently ignored. >>> >>> For NGS, getting a bp coverage > 2^31 is unlikely, although I've already seen extremely high coverage for Ribosomal-like protein that were only 10 order of magnitude away (~2M X). This limits the ranges of weights that can be used (weight as of now can only be integers), i.e. a weight of 100 would already be borderline. >>> >>> Is there a way around this, coverage being such a very handy function? I understand that weight being integers probably makes computation faster, but what could be the overhead of allowing numeric instead? And I don't mind looking under the hood if that helps. >> >> Thanks Nico for catching this other one. I will keep operations in the >> int space for now (so an 'integer' Rle is always returned) but will make >> sure a warning is issued and NAs are returned in case of overflow. >> >> H. >> >>> >>> Cheers, >>> >>> Nico >>> >>> sessionInfo() >>> R version 2.15.1 (2012-06-22) >>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>> >>> locale: >>> [1] C/UTF-8/C/C/C/C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] IRanges_1.15.17 BiocGenerics_0.3.0 >>> >>> loaded via a namespace (and not attached): >>> [1] stats4_2.15.1 tools_2.15.1 >>> >>> >>> --------------------------------------------------------------- >>> Nicolas Delhomme >>> >>> Genome Biology Computational Support >>> >>> European Molecular Biology Laboratory >>> >>> Tel: +49 6221 387 8310 >>> Email: nicolas.delhomme at embl.de >>> Meyerhofstrasse 1 - Postfach 10.2209 >>> 69102 Heidelberg, Germany >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
Hi Herv?, On Jul 10, 2012, at 7:44 PM, Hervé Pagès wrote: > Hi Nico, > > The overflow issue is addressed in IRanges 1.15.18 (devel). Thanks! > > On 07/04/2012 02:16 AM, Nicolas Delhomme wrote: >> Great, thanks! >> >> Herv? - how much effort is it to extend it to numeric? I'm willing to do it, I just do not want to start on something where YOU would say it's though ;-) > > I don't think it would be tough at all. The real question is: do we want > coverage() to always return a numeric-Rle instead of integer-Rle? This > will make the Rle 50% bigger in memory, probably not a big deal. On the > other hand this would allow treating numeric weights really as numerics > instead of truncating them like we do right now: > > > coverage(IRanges(1:3, width=10)) > 'integer' Rle of length 12 with 5 runs > Lengths: 1 1 8 1 1 > Values : 1 2 3 2 1 > > coverage(IRanges(1:3, width=10), weight=2.86) > 'integer' Rle of length 12 with 5 runs > Lengths: 1 1 8 1 1 > Values : 2 4 6 4 2 > > Maybe one option would be to return an integer-Rle when 'weight' is > integer and a numeric-Rle when it's numeric. So by default (i.e. when > no weights are supplied) it would still return an integer-Rle (because > the default for 'weight' is 1L). > But coverage(IRanges(1:3, width=10), weight=2) would return a > numeric-Rle and coverage(IRanges(1:3, width=10), weight=2L) > an integer-Rle. > > How does that sound? That sounds really great! I find it actually really intuitive, i.e. that's how I would expect it to behave. Let me know if there's something I can do to help with the changes, Nico > > H. > >> >> Nico >> >> --------------------------------------------------------------- >> Nicolas Delhomme >> >> Genome Biology Computational Support >> >> European Molecular Biology Laboratory >> >> Tel: +49 6221 387 8310 >> Email: nicolas.delhomme at embl.de >> Meyerhofstrasse 1 - Postfach 10.2209 >> 69102 Heidelberg, Germany >> --------------------------------------------------------------- >> >> >> >> >> >> On Jul 3, 2012, at 8:00 PM, Hervé Pagès wrote: >> >>> On 07/03/2012 09:40 AM, Nicolas Delhomme wrote: >>>> Hi, >>>> >>>> I've just discovered that the IRanges coverage function would "overflow" without warnings. Below is an example that reproduce it: >>>> >>>> library(IRanges) >>>> rngs <- IRanges(c(1:100),width=100) >>>> coverage(rngs) >>>> >>>> 'integer' Rle of length 199 with 199 runs >>>> Lengths: 1 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1 >>>> Values : 1 2 3 4 5 6 7 8 9 10 11 ... 10 9 8 7 6 5 4 3 2 1 >>>> >>>> coverage(rngs,weight=1e9) >>>> >>>> 'integer' Rle of length 200 with 200 runs >>>> Lengths: 1 1 1 ... 1 1 >>>> Values : 1000000000 2000000000 -1294967296 ... 1000000000 0 >>>> >>>> runValue(coverage(rngs,weight=1e9)) >>>> [1] 1000000000 2000000000 -1294967296 -294967296 705032704 1705032704 >>>> [7] -1589934592 -589934592 410065408 1410065408 -1884901888 -884901888 >>>> ... >>>> >>>> Clearly, the third position that has a coverage of 3 (not weighted) has a 3e9 weighted one which is > 2^31 (signed integer limit on most machine). I'm just surprised that it is silently ignored. >>>> >>>> For NGS, getting a bp coverage > 2^31 is unlikely, although I've already seen extremely high coverage for Ribosomal-like protein that were only 10 order of magnitude away (~2M X). This limits the ranges of weights that can be used (weight as of now can only be integers), i.e. a weight of 100 would already be borderline. >>>> >>>> Is there a way around this, coverage being such a very handy function? I understand that weight being integers probably makes computation faster, but what could be the overhead of allowing numeric instead? And I don't mind looking under the hood if that helps. >>> >>> Thanks Nico for catching this other one. I will keep operations in the >>> int space for now (so an 'integer' Rle is always returned) but will make >>> sure a warning is issued and NAs are returned in case of overflow. >>> >>> H. >>> >>>> >>>> Cheers, >>>> >>>> Nico >>>> >>>> sessionInfo() >>>> R version 2.15.1 (2012-06-22) >>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>>> >>>> locale: >>>> [1] C/UTF-8/C/C/C/C >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices utils datasets methods base >>>> >>>> other attached packages: >>>> [1] IRanges_1.15.17 BiocGenerics_0.3.0 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] stats4_2.15.1 tools_2.15.1 >>>> >>>> >>>> --------------------------------------------------------------- >>>> Nicolas Delhomme >>>> >>>> Genome Biology Computational Support >>>> >>>> European Molecular Biology Laboratory >>>> >>>> Tel: +49 6221 387 8310 >>>> Email: nicolas.delhomme at embl.de >>>> Meyerhofstrasse 1 - Postfach 10.2209 >>>> 69102 Heidelberg, Germany >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages at fhcrc.org >>> Phone: (206) 667-5791 >>> Fax: (206) 667-1319 >>> >>> >> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > >
ADD REPLY
0
Entering edit mode
Hi Nico, On 07/11/2012 02:29 AM, Nicolas Delhomme wrote: > Hi Herv?, > > On Jul 10, 2012, at 7:44 PM, Hervé Pagès wrote: > >> Hi Nico, >> >> The overflow issue is addressed in IRanges 1.15.18 (devel). > > Thanks! > >> >> On 07/04/2012 02:16 AM, Nicolas Delhomme wrote: >>> Great, thanks! >>> >>> Herv? - how much effort is it to extend it to numeric? I'm willing to do it, I just do not want to start on something where YOU would say it's though ;-) >> >> I don't think it would be tough at all. The real question is: do we want >> coverage() to always return a numeric-Rle instead of integer-Rle? This >> will make the Rle 50% bigger in memory, probably not a big deal. On the >> other hand this would allow treating numeric weights really as numerics >> instead of truncating them like we do right now: >> >> > coverage(IRanges(1:3, width=10)) >> 'integer' Rle of length 12 with 5 runs >> Lengths: 1 1 8 1 1 >> Values : 1 2 3 2 1 >> > coverage(IRanges(1:3, width=10), weight=2.86) >> 'integer' Rle of length 12 with 5 runs >> Lengths: 1 1 8 1 1 >> Values : 2 4 6 4 2 >> >> Maybe one option would be to return an integer-Rle when 'weight' is >> integer and a numeric-Rle when it's numeric. So by default (i.e. when >> no weights are supplied) it would still return an integer-Rle (because >> the default for 'weight' is 1L). >> But coverage(IRanges(1:3, width=10), weight=2) would return a >> numeric-Rle and coverage(IRanges(1:3, width=10), weight=2L) >> an integer-Rle. >> >> How does that sound? > > That sounds really great! I find it actually really intuitive, i.e. that's how I would expect it to behave. This is done in IRanges 1.15.20: - with integer weights: > coverage(IRanges(1:3, width=10), weight=2L) integer-Rle of length 12 with 5 runs Lengths: 1 1 8 1 1 Values : 2 4 6 4 2 - width numeric weights: > coverage(IRanges(1:3, width=10), weight=2) numeric-Rle of length 12 with 5 runs Lengths: 1 1 8 1 1 Values : 2 4 6 4 2 > coverage(IRanges(1:3, width=10), weight=2.86) numeric-Rle of length 12 with 5 runs Lengths: 1 1 8 1 1 Values : 2.86 5.72 8.58 5.72 2.86 > coverage(IRanges(1:3, width=10), weight=1e9) numeric-Rle of length 12 with 5 runs Lengths: 1 1 8 1 1 Values : 1e+09 2e+09 3e+09 2e+09 1e+09 Cheers, H. > > Let me know if there's something I can do to help with the changes, > > Nico > >> >> H. >> >>> >>> Nico >>> >>> --------------------------------------------------------------- >>> Nicolas Delhomme >>> >>> Genome Biology Computational Support >>> >>> European Molecular Biology Laboratory >>> >>> Tel: +49 6221 387 8310 >>> Email: nicolas.delhomme at embl.de >>> Meyerhofstrasse 1 - Postfach 10.2209 >>> 69102 Heidelberg, Germany >>> --------------------------------------------------------------- >>> >>> >>> >>> >>> >>> On Jul 3, 2012, at 8:00 PM, Hervé Pagès wrote: >>> >>>> On 07/03/2012 09:40 AM, Nicolas Delhomme wrote: >>>>> Hi, >>>>> >>>>> I've just discovered that the IRanges coverage function would "overflow" without warnings. Below is an example that reproduce it: >>>>> >>>>> library(IRanges) >>>>> rngs <- IRanges(c(1:100),width=100) >>>>> coverage(rngs) >>>>> >>>>> 'integer' Rle of length 199 with 199 runs >>>>> Lengths: 1 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1 >>>>> Values : 1 2 3 4 5 6 7 8 9 10 11 ... 10 9 8 7 6 5 4 3 2 1 >>>>> >>>>> coverage(rngs,weight=1e9) >>>>> >>>>> 'integer' Rle of length 200 with 200 runs >>>>> Lengths: 1 1 1 ... 1 1 >>>>> Values : 1000000000 2000000000 -1294967296 ... 1000000000 0 >>>>> >>>>> runValue(coverage(rngs,weight=1e9)) >>>>> [1] 1000000000 2000000000 -1294967296 -294967296 705032704 1705032704 >>>>> [7] -1589934592 -589934592 410065408 1410065408 -1884901888 -884901888 >>>>> ... >>>>> >>>>> Clearly, the third position that has a coverage of 3 (not weighted) has a 3e9 weighted one which is > 2^31 (signed integer limit on most machine). I'm just surprised that it is silently ignored. >>>>> >>>>> For NGS, getting a bp coverage > 2^31 is unlikely, although I've already seen extremely high coverage for Ribosomal-like protein that were only 10 order of magnitude away (~2M X). This limits the ranges of weights that can be used (weight as of now can only be integers), i.e. a weight of 100 would already be borderline. >>>>> >>>>> Is there a way around this, coverage being such a very handy function? I understand that weight being integers probably makes computation faster, but what could be the overhead of allowing numeric instead? And I don't mind looking under the hood if that helps. >>>> >>>> Thanks Nico for catching this other one. I will keep operations in the >>>> int space for now (so an 'integer' Rle is always returned) but will make >>>> sure a warning is issued and NAs are returned in case of overflow. >>>> >>>> H. >>>> >>>>> >>>>> Cheers, >>>>> >>>>> Nico >>>>> >>>>> sessionInfo() >>>>> R version 2.15.1 (2012-06-22) >>>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>>>> >>>>> locale: >>>>> [1] C/UTF-8/C/C/C/C >>>>> >>>>> attached base packages: >>>>> [1] stats graphics grDevices utils datasets methods base >>>>> >>>>> other attached packages: >>>>> [1] IRanges_1.15.17 BiocGenerics_0.3.0 >>>>> >>>>> loaded via a namespace (and not attached): >>>>> [1] stats4_2.15.1 tools_2.15.1 >>>>> >>>>> >>>>> --------------------------------------------------------------- >>>>> Nicolas Delhomme >>>>> >>>>> Genome Biology Computational Support >>>>> >>>>> European Molecular Biology Laboratory >>>>> >>>>> Tel: +49 6221 387 8310 >>>>> Email: nicolas.delhomme at embl.de >>>>> Meyerhofstrasse 1 - Postfach 10.2209 >>>>> 69102 Heidelberg, Germany >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>> >>>> >>>> -- >>>> Hervé Pagès >>>> >>>> Program in Computational Biology >>>> Division of Public Health Sciences >>>> Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Ave. N, M1-B514 >>>> P.O. Box 19024 >>>> Seattle, WA 98109-1024 >>>> >>>> E-mail: hpages at fhcrc.org >>>> Phone: (206) 667-5791 >>>> Fax: (206) 667-1319 >>>> >>>> >>> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
Thanks! That's awesome! --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On Jul 13, 2012, at 3:36 AM, Hervé Pagès wrote: > Hi Nico, > > On 07/11/2012 02:29 AM, Nicolas Delhomme wrote: >> Hi Herv?, >> >> On Jul 10, 2012, at 7:44 PM, Hervé Pagès wrote: >> >>> Hi Nico, >>> >>> The overflow issue is addressed in IRanges 1.15.18 (devel). >> >> Thanks! >> >>> >>> On 07/04/2012 02:16 AM, Nicolas Delhomme wrote: >>>> Great, thanks! >>>> >>>> Herv? - how much effort is it to extend it to numeric? I'm willing to do it, I just do not want to start on something where YOU would say it's though ;-) >>> >>> I don't think it would be tough at all. The real question is: do we want >>> coverage() to always return a numeric-Rle instead of integer-Rle? This >>> will make the Rle 50% bigger in memory, probably not a big deal. On the >>> other hand this would allow treating numeric weights really as numerics >>> instead of truncating them like we do right now: >>> >>> > coverage(IRanges(1:3, width=10)) >>> 'integer' Rle of length 12 with 5 runs >>> Lengths: 1 1 8 1 1 >>> Values : 1 2 3 2 1 >>> > coverage(IRanges(1:3, width=10), weight=2.86) >>> 'integer' Rle of length 12 with 5 runs >>> Lengths: 1 1 8 1 1 >>> Values : 2 4 6 4 2 >>> >>> Maybe one option would be to return an integer-Rle when 'weight' is >>> integer and a numeric-Rle when it's numeric. So by default (i.e. when >>> no weights are supplied) it would still return an integer-Rle (because >>> the default for 'weight' is 1L). >>> But coverage(IRanges(1:3, width=10), weight=2) would return a >>> numeric-Rle and coverage(IRanges(1:3, width=10), weight=2L) >>> an integer-Rle. >>> >>> How does that sound? >> >> That sounds really great! I find it actually really intuitive, i.e. that's how I would expect it to behave. > > This is done in IRanges 1.15.20: > > - with integer weights: > > > coverage(IRanges(1:3, width=10), weight=2L) > integer-Rle of length 12 with 5 runs > Lengths: 1 1 8 1 1 > Values : 2 4 6 4 2 > > - width numeric weights: > > > coverage(IRanges(1:3, width=10), weight=2) > numeric-Rle of length 12 with 5 runs > Lengths: 1 1 8 1 1 > Values : 2 4 6 4 2 > > > coverage(IRanges(1:3, width=10), weight=2.86) > numeric-Rle of length 12 with 5 runs > Lengths: 1 1 8 1 1 > Values : 2.86 5.72 8.58 5.72 2.86 > > > coverage(IRanges(1:3, width=10), weight=1e9) > numeric-Rle of length 12 with 5 runs > Lengths: 1 1 8 1 1 > Values : 1e+09 2e+09 3e+09 2e+09 1e+09 > > Cheers, > H. > >> >> Let me know if there's something I can do to help with the changes, >> >> Nico >> >>> >>> H. >>> >>>> >>>> Nico >>>> >>>> --------------------------------------------------------------- >>>> Nicolas Delhomme >>>> >>>> Genome Biology Computational Support >>>> >>>> European Molecular Biology Laboratory >>>> >>>> Tel: +49 6221 387 8310 >>>> Email: nicolas.delhomme at embl.de >>>> Meyerhofstrasse 1 - Postfach 10.2209 >>>> 69102 Heidelberg, Germany >>>> --------------------------------------------------------------- >>>> >>>> >>>> >>>> >>>> >>>> On Jul 3, 2012, at 8:00 PM, Hervé Pagès wrote: >>>> >>>>> On 07/03/2012 09:40 AM, Nicolas Delhomme wrote: >>>>>> Hi, >>>>>> >>>>>> I've just discovered that the IRanges coverage function would "overflow" without warnings. Below is an example that reproduce it: >>>>>> >>>>>> library(IRanges) >>>>>> rngs <- IRanges(c(1:100),width=100) >>>>>> coverage(rngs) >>>>>> >>>>>> 'integer' Rle of length 199 with 199 runs >>>>>> Lengths: 1 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1 >>>>>> Values : 1 2 3 4 5 6 7 8 9 10 11 ... 10 9 8 7 6 5 4 3 2 1 >>>>>> >>>>>> coverage(rngs,weight=1e9) >>>>>> >>>>>> 'integer' Rle of length 200 with 200 runs >>>>>> Lengths: 1 1 1 ... 1 1 >>>>>> Values : 1000000000 2000000000 -1294967296 ... 1000000000 0 >>>>>> >>>>>> runValue(coverage(rngs,weight=1e9)) >>>>>> [1] 1000000000 2000000000 -1294967296 -294967296 705032704 1705032704 >>>>>> [7] -1589934592 -589934592 410065408 1410065408 -1884901888 -884901888 >>>>>> ... >>>>>> >>>>>> Clearly, the third position that has a coverage of 3 (not weighted) has a 3e9 weighted one which is > 2^31 (signed integer limit on most machine). I'm just surprised that it is silently ignored. >>>>>> >>>>>> For NGS, getting a bp coverage > 2^31 is unlikely, although I've already seen extremely high coverage for Ribosomal-like protein that were only 10 order of magnitude away (~2M X). This limits the ranges of weights that can be used (weight as of now can only be integers), i.e. a weight of 100 would already be borderline. >>>>>> >>>>>> Is there a way around this, coverage being such a very handy function? I understand that weight being integers probably makes computation faster, but what could be the overhead of allowing numeric instead? And I don't mind looking under the hood if that helps. >>>>> >>>>> Thanks Nico for catching this other one. I will keep operations in the >>>>> int space for now (so an 'integer' Rle is always returned) but will make >>>>> sure a warning is issued and NAs are returned in case of overflow. >>>>> >>>>> H. >>>>> >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Nico >>>>>> >>>>>> sessionInfo() >>>>>> R version 2.15.1 (2012-06-22) >>>>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>>>>> >>>>>> locale: >>>>>> [1] C/UTF-8/C/C/C/C >>>>>> >>>>>> attached base packages: >>>>>> [1] stats graphics grDevices utils datasets methods base >>>>>> >>>>>> other attached packages: >>>>>> [1] IRanges_1.15.17 BiocGenerics_0.3.0 >>>>>> >>>>>> loaded via a namespace (and not attached): >>>>>> [1] stats4_2.15.1 tools_2.15.1 >>>>>> >>>>>> >>>>>> --------------------------------------------------------------- >>>>>> Nicolas Delhomme >>>>>> >>>>>> Genome Biology Computational Support >>>>>> >>>>>> European Molecular Biology Laboratory >>>>>> >>>>>> Tel: +49 6221 387 8310 >>>>>> Email: nicolas.delhomme at embl.de >>>>>> Meyerhofstrasse 1 - Postfach 10.2209 >>>>>> 69102 Heidelberg, Germany >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>> >>>>> >>>>> -- >>>>> Hervé Pagès >>>>> >>>>> Program in Computational Biology >>>>> Division of Public Health Sciences >>>>> Fred Hutchinson Cancer Research Center >>>>> 1100 Fairview Ave. N, M1-B514 >>>>> P.O. Box 19024 >>>>> Seattle, WA 98109-1024 >>>>> >>>>> E-mail: hpages at fhcrc.org >>>>> Phone: (206) 667-5791 >>>>> Fax: (206) 667-1319 >>>>> >>>>> >>>> >>> >>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages at fhcrc.org >>> Phone: (206) 667-5791 >>> Fax: (206) 667-1319 >>> >>> >> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > >
ADD REPLY

Login before adding your answer.

Traffic: 541 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6