duplicated on IRanges object
1
0
Entering edit mode
@manuela-hummel-4312
Last seen 11.1 years ago
Hi, there seems to be a numerical issue when applying 'duplicated' on an IRanges object. When there are two ranges that are almost the same, and within the IRanges object there are some other ranges with huge width, 'duplicated' identifies the two "almost the same" ranges as "the same". If we take for example those two ranges: > ir <- IRanges(start=rep(1000000000, 2), width=200:201) > ir IRanges of length 2 start end width [1] 1000000000 1000000199 200 [2] 1000000000 1000000200 201 They are obviously not the same: > duplicated(ir) [1] FALSE FALSE But when we now add another range with huge width: > ir2 IRanges of length 3 start end width [1] 1000000000 1000000199 200 [2] 1000000000 1000000200 201 [3] 5000000 100000000 95000001 ... the second range is detected as duplicate of the first: > duplicated(ir2) [1] FALSE TRUE FALSE I guess the problem is that in .toNumericWithCompatibleOrder the variable max_width gets so large, such that start(x) + width(x)/(max_width+1.00) gets numerically identical for ranges like the first two in the example. Best regards Manuela Ps: By the way, thanks for the great IRanges package! It makes working with sequence data so much easier. > sessionInfo() R version 2.12.0 (2010-10-15) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 [3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C [5] LC_TIME=Spanish_Spain.1252 attached base packages: [1] stats graphics grDevices utils datasets methods [7] base other attached packages: [1] IRanges_1.8.0 Manuela Hummel Core Facilities - Microarrays Unit Center for Genomic Regulation (CRG) Dr. Aiguader 88, 4th flour, Office 439.01 08003 Barcelona Phone: +34 93 316 0373 e-mail: manuela.hummel at crg.es ?
IRanges IRanges • 970 views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 12 hours ago
Seattle, WA, United States
Hi Manuela, Thanks for the report! The duplicated method for Ranges objects has been reimplemented in IRanges 1.8.1. The new implementation doesn't use the trick that consists in converting the ranges into numerical values anymore (there doesn't seem to be an easy/portable way to work around the rounding issues). This new version of IRanges should become available thru biocLite() in the next 12 hours. Cheers, H. On 10/22/2010 07:44 AM, Manuela Hummel wrote: > Hi, > > there seems to be a numerical issue when applying 'duplicated' on an IRanges object. > When there are two ranges that are almost the same, and within the IRanges object there are some other ranges with huge width, 'duplicated' identifies the two "almost the same" ranges as "the same". > > If we take for example those two ranges: > >> ir<- IRanges(start=rep(1000000000, 2), width=200:201) >> ir > IRanges of length 2 > start end width > [1] 1000000000 1000000199 200 > [2] 1000000000 1000000200 201 > > > They are obviously not the same: > >> duplicated(ir) > [1] FALSE FALSE > > > But when we now add another range with huge width: > >> ir2 > IRanges of length 3 > start end width > [1] 1000000000 1000000199 200 > [2] 1000000000 1000000200 201 > [3] 5000000 100000000 95000001 > > > ... the second range is detected as duplicate of the first: > >> duplicated(ir2) > [1] FALSE TRUE FALSE > > > I guess the problem is that in .toNumericWithCompatibleOrder the variable max_width gets so large, such that > start(x) + width(x)/(max_width+1.00) > gets numerically identical for ranges like the first two in the example. > > Best regards > Manuela > > Ps: By the way, thanks for the great IRanges package! It makes working with sequence data so much easier. > > >> sessionInfo() > R version 2.12.0 (2010-10-15) > Platform: x86_64-pc-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 > [3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C > [5] LC_TIME=Spanish_Spain.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods > [7] base > > other attached packages: > [1] IRanges_1.8.0 > > > > Manuela Hummel > Core Facilities - Microarrays Unit > Center for Genomic Regulation (CRG) > Dr. Aiguader 88, 4th flour, Office 439.01 > 08003 Barcelona > Phone: +34 93 316 0373 > e-mail: manuela.hummel at crg.es > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT

Login before adding your answer.

Traffic: 875 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6