WARNING: difference in sorting order depending on computer platform?!?

0

Entering edit mode

Jenny Drnevich ★ 2.0k

@jenny-drnevich-2812

Last seen 23 days ago

United States

Hi all, I just found a problem/discrepancy in running R on PC vs. Unix/Linux server. Maybe it's widely known, but I didn't know about it and it caused me big problems. I mostly use my desktop PC for running microarray analyses, but occasionally I have projects that require more memory. Then I run some of the memory-intensive steps on our Linux server, (which has a lot more memory but is REALLY slow), save the objects, and go back to my PC to finish the analysis. Well, it turns out that the order of probe set IDs as returned by featureNames() is slightly different between the computer platforms. I first thought it might be do to a difference in the chipnamecdf library Windows binary vs. *nix compilation of the source file, but I think it's just a difference in the way the computer platforms sort character data that have numbers. I've put a full, reproducible example below (our sys admin hasn't upgraded R on the server yet, but I doubt that's the problem), but in short, my PC puts 177_at before 1773_at, but the server puts 1773_at before 177_at. I guess this really isn't a "bug" that can be fixed, and I know it's not a good idea to run part of your R code on one computer and part on another computer, but don't you agree that this is undesirable behavior? Maybe I'm not computer-literate enough to have known that this is a well-known issue, so in part I'm posting this as a warning to others like me - I don't remember seeing anything like this in the 4+ years I've been following the BioC list. I also wondering in addition to however many of my analyses that may have been messed up slightly (ARRRGGHH!!), would this possibly cause problems in things like public repositories? I know databases don't depend on order, but I'd be surprised if it hasn't caused problems somewhere else. In this case, there's only 117 probe sets out of 22,277 that don't match up, so it would be hard to notice! Thanks, Jenny > library(affy) Loading required package: Biobase Welcome to Bioconductor Vignettes contain introductory material. To view, type 'openVignette()'. To cite Bioconductor, see 'citation("Biobase")' and for packages 'citation(pkgname)'. > library(ArrayExpress) > > rawset = ArrayExpress("E-MEXP-1422") trying URL 'http://www.ebi.ac.uk/microarray- as/ae/files/E-MEXP-1422/index.html' Content type 'text/html;charset=ISO-8859-1' length unknown opened URL downloaded 7746 bytes trying URL 'http://www.ebi.ac.uk/microarray- as/ae/files/E-MEXP-1422/E-MEXP-1422.raw.1.zip' Content type 'application/zip' length 11200346 bytes (10.7 Mb) opened URL downloaded 10.7 Mb Read 1 item trying URL 'http://www.ebi.ac.uk/microarray- as/ae/files/E-MEXP-1422/E-MEXP-1422.sdrf.txt' Content type 'text/plain' length 6679 bytes opened URL downloaded 6679 bytes trying URL 'http://www.ebi.ac.uk/microarray- as/ae/files/A-AFFY-37/A-AFFY-37.adf.txt' Content type 'text/plain' length 3590863 bytes (3.4 Mb) opened URL downloaded 3.4 Mb trying URL 'http://www.ebi.ac.uk/microarray- as/ae/files/E-MEXP-1422/E-MEXP-1422.idf.txt' Content type 'text/plain' length 5378 bytes opened URL downloaded 5378 bytes Read 49 items The object containing experiment E-MEXP-1422 has been built. > rawset AffyBatch object size of arrays=732x732 features (8499 kb) cdf=HG-U133A_2 (22277 affyids) number of samples=6 number of genes=22277 annotation=hgu133a2 notes=E-MEXP-1422 E-MEXP-1422 RNAi c("cellular_modification_design", "co-expression_design", "in_vitro_design", "RNAi") NULL > > PSnames.PC <- featureNames(rawset) > > all.equal(PSnames.PC, featureNames(rawset)) [1] TRUE > > save.image("NameOrderTest.RData") > > sessionInfo() R version 2.10.1 (2009-12-14) i386-pc-mingw32 locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] hgu133a2cdf_2.5.0 ArrayExpress_1.6.1 affy_1.24.2 Biobase_2.6.1 loaded via a namespace (and not attached): [1] affyio_1.14.0 limma_3.2.1 preprocessCore_1.8.0 [4] tools_2.10.1 XML_2.6-0 > > q() # now move to Linux server: > library(affy) Loading required package: Biobase Welcome to Bioconductor Vignettes contain introductory material. To view, type 'openVignette()'. To cite Bioconductor, see 'citation("Biobase")' and for packages 'citation(pkgname)'. > > > > load("NameOrderTest.RData") > > > > all.equal(PSnames.PC, featureNames(rawset)) [1] "117 string mismatches" > > > x <- data.frame(PC=PSnames.PC, Linux=featureNames(rawset), stringsAsFactors=F) > > x[ x[,1] != x[,2] , ][ 1:5 , ] PC Linux 17 177_at 1773_at 18 1773_at 177_at 2328 2028_s_at 202800_at 2329 202800_at 202801_at 2330 202801_at 202802_at > > > all.equal(sort(PSnames.PC), featureNames(rawset)) [1] TRUE > > > PSnames.linux <- featureNames(rawset) > > save.image("NameOrderTest.RData") > > sessionInfo() R version 2.9.0 (2009-04-17) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US .UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_N AME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTI FICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] hgu133a2cdf_2.4.0 affy_1.22.0 Biobase_2.4.0 loaded via a namespace (and not attached): [1] affyio_1.8.1 preprocessCore_1.6.0 tools_2.9.0 > > q() # now move back to PC: > library(affy) Loading required package: Biobase Welcome to Bioconductor Vignettes contain introductory material. To view, type 'openVignette()'. To cite Bioconductor, see 'citation("Biobase")' and for packages 'citation(pkgname)'. > load("NameOrderTest.RData") > > all.equal(PSnames.PC, featureNames(rawset)) [1] TRUE > > all.equal(PSnames.linux, featureNames(rawset)) [1] "117 string mismatches" > > all.equal(sort(PSnames.linux), featureNames(rawset)) [1] TRUE Jenny Drnevich, Ph.D. Functional Genomics Bioinformatics Specialist W.M. Keck Center for Comparative and Functional Genomics Roy J. Carver Biotechnology Center University of Illinois, Urbana-Champaign 330 ERML 1201 W. Gregory Dr. Urbana, IL 61801 USA ph: 217-244-7355 fax: 217-265-5066 e-mail: drnevich at illinois.edu

GO probe GO probe • 1.3k views

ADD COMMENT • link updated 14.3 years ago by Seth Falcon ★ 7.4k • written 14.3 years ago by Jenny Drnevich ★ 2.0k

0

Entering edit mode

Seth Falcon ★ 7.4k

@seth-falcon-992

Last seen 9.6 years ago

Hi Jenny, On 1/28/10 12:16 PM, Jenny Drnevich wrote: > I just found a problem/discrepancy in running R on PC vs. Unix/Linux > server. Maybe it's widely known, but I didn't know about it and it > caused me big problems. Ouch, that's not a fun problem to run into. The issue here is not so much platform as what's called locale. Locale settings determine such things as how numbers should be displayed ("," vs "."), time format, and indeed sorting of strings. You can read up on locale on Wikipedia: http://en.wikipedia.org/wiki/Locale Different locale settings impose different orderings of strings. Once you know this, the good news is that you can control the locale setting that R uses and should be able to obtain stable sorting across platforms. Here's an example run on a Windows system: >> strsplit(Sys.getlocale(), ";") > [[1]] > [1] "LC_COLLATE=English_United States.1252" > [2] "LC_CTYPE=English_United States.1252" > [3] "LC_MONETARY=English_United States.1252" > [4] "LC_NUMERIC=C" > [5] "LC_TIME=English_United States.1252" > >> v = c("177_at", "1773_at") >> sort(v) > [1] "177_at" "1773_at" >> Sys.setlocale(locale="C") > [1] "C" >> sort(v) > [1] "1773_at" "177_at" Note that not all locales are available on all systems, but the "C" locale is the basic common denominator -- but only supports ASCII not extended character sets. In summary, I think you can continue to use your two different systems if you do Sys.setlocale(locale="C") at the start of your script. + seth -- Seth Falcon Bioconductor Core Team | FHCRC

ADD COMMENT • link 14.3 years ago Seth Falcon ★ 7.4k

0

Entering edit mode

Hi Seth and Jenny not quite so... have a look at the "Details" section of the manual page "Comparison" in the base package (type: "? Comparison"): Comparison of strings in character vectors is lexicographic within the strings using the collating sequence of the locale in use: see 'locales'. The collating sequence of locales such as 'en_US' is normally different from 'C' (which should use ASCII) and can be surprising. Beware of making _any_ assumptions about the collation order: e.g. in Estonian 'Z' comes between 'S' and 'T', and collation is not necessarily character-by-character - in Danish 'aa' sorts as a single letter, after 'z'. In Welsh 'ng' may or may not be a single sorting unit: if it is it follows 'g'. Some platforms may not respect the locale and always sort in numerical order of the bytes in an 8-bit locale, or in Unicode point order for a UTF-8 locale (and may not sort in the same order for the same language in different character sets). Collation of non-letters (spaces, punctuation signs, hyphens, fractions and so on) is even more problematic. In Jenny's case, it is probably best not to rely on any sorting behaviour, and access the features based on their names. Best wishes Wolfgang Seth Falcon wrote: > Hi Jenny, > > On 1/28/10 12:16 PM, Jenny Drnevich wrote: >> I just found a problem/discrepancy in running R on PC vs. Unix/Linux >> server. Maybe it's widely known, but I didn't know about it and it >> caused me big problems. > > Ouch, that's not a fun problem to run into. The issue here is not so > much platform as what's called locale. Locale settings determine such > things as how numbers should be displayed ("," vs "."), time format, and > indeed sorting of strings. > > You can read up on locale on Wikipedia: > http://en.wikipedia.org/wiki/Locale > > Different locale settings impose different orderings of strings. Once > you know this, the good news is that you can control the locale setting > that R uses and should be able to obtain stable sorting across platforms. > > Here's an example run on a Windows system: > >>> strsplit(Sys.getlocale(), ";") >> [[1]] >> [1] "LC_COLLATE=English_United States.1252" >> [2] "LC_CTYPE=English_United States.1252" >> [3] "LC_MONETARY=English_United States.1252" >> [4] "LC_NUMERIC=C" >> [5] "LC_TIME=English_United States.1252" >> >>> v = c("177_at", "1773_at") >>> sort(v) >> [1] "177_at" "1773_at" >>> Sys.setlocale(locale="C") >> [1] "C" >>> sort(v) >> [1] "1773_at" "177_at" > > Note that not all locales are available on all systems, but the "C" > locale is the basic common denominator -- but only supports ASCII not > extended character sets. > > In summary, I think you can continue to use your two different systems > if you do Sys.setlocale(locale="C") at the start of your script. > > + seth > -- Best wishes Wolfgang -- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber/contact

ADD REPLY • link 14.2 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Benilton Carvalho ★ 4.3k

@benilton-carvalho-1375

Last seen 4.1 years ago

Brazil/Campinas/UNICAMP

Jenny, Say you run some stuff in your linux machine and some other on a server in Denmark... It's likely you're going to get something like the following: x <- c("a", "aaa", "z", "aa") Sys.setlocale(locale="C") ## a machine in the US sort(x) ## a aa aaa z Sys.setlocale(locale="da_DK") ## a machine in Denmark sort(x) ## a z aa aaa There may be something more elegant, but when I had to handle this, I started using match() a lot to ensure the probesets were properly aligned. You could force locale to be the same in whatever machine you use, but I'm not sure this is a good idea. You can check the locales you get on both machines using Sys.getlocale(). cheers, b On Thu, Jan 28, 2010 at 8:16 PM, Jenny Drnevich <drnevich at="" illinois.edu=""> wrote: > Hi all, > > I just found a problem/discrepancy in running R on PC vs. Unix/Linux server. > Maybe it's widely known, but I didn't know about it and it caused me big > problems. I mostly use my desktop PC for running microarray analyses, but > occasionally I have projects that require more memory. Then I run some of > the memory-intensive steps on our Linux server, (which has a lot more memory > but is REALLY slow), save the objects, and go back to my PC to finish the > analysis. Well, it turns out that the order of probe set IDs as returned by > featureNames() is slightly different between the computer platforms. I first > thought it might be do to a difference in the chipnamecdf library Windows > binary vs. *nix compilation of the source file, but I think it's just a > difference in the way the computer platforms sort character data that have > numbers. I've put a full, reproducible example below (our sys admin hasn't > upgraded R on the server yet, but I doubt that's the problem), but in short, > my PC puts 177_at before 1773_at, but the server puts 1773_at before 177_at. > > I guess this really isn't a "bug" that can be fixed, and I know it's not a > good idea to run part of your R code on one computer and part on another > computer, but don't you agree that this is undesirable behavior? ?Maybe I'm > not computer-literate enough to have known that this is a well-known issue, > so in part I'm posting this as a warning to others like me - I don't > remember seeing anything like this in the 4+ years I've been following the > BioC list. I also wondering in addition to however many of my analyses that > may have been messed up slightly (ARRRGGHH!!), would this possibly cause > problems in things like public repositories? I know databases don't depend > on order, but I'd be surprised if it hasn't caused problems somewhere else. > In this case, there's only 117 probe sets out of 22,277 that don't match up, > so it would be hard to notice! > > Thanks, > Jenny > > >> library(affy) > Loading required package: Biobase > > Welcome to Bioconductor > > ?Vignettes contain introductory material. To view, type > ?'openVignette()'. To cite Bioconductor, see > ?'citation("Biobase")' and for packages 'citation(pkgname)'. > >> library(ArrayExpress) >> >> rawset = ArrayExpress("E-MEXP-1422") > trying URL > 'http://www.ebi.ac.uk/microarray-as/ae/files/E-MEXP-1422/index.html' > Content type 'text/html;charset=ISO-8859-1' length unknown > opened URL > downloaded 7746 bytes > > trying URL > 'http://www.ebi.ac.uk/microarray- as/ae/files/E-MEXP-1422/E-MEXP-1422.raw.1.zip' > Content type 'application/zip' length 11200346 bytes (10.7 Mb) > opened URL > downloaded 10.7 Mb > > Read 1 item > trying URL > 'http://www.ebi.ac.uk/microarray- as/ae/files/E-MEXP-1422/E-MEXP-1422.sdrf.txt' > Content type 'text/plain' length 6679 bytes > opened URL > downloaded 6679 bytes > > trying URL > 'http://www.ebi.ac.uk/microarray- as/ae/files/A-AFFY-37/A-AFFY-37.adf.txt' > Content type 'text/plain' length 3590863 bytes (3.4 Mb) > opened URL > downloaded 3.4 Mb > > trying URL > 'http://www.ebi.ac.uk/microarray- as/ae/files/E-MEXP-1422/E-MEXP-1422.idf.txt' > Content type 'text/plain' length 5378 bytes > opened URL > downloaded 5378 bytes > > Read 49 items > > ?The object containing experiment ?E-MEXP-1422 ?has been built. > >> rawset > AffyBatch object > size of arrays=732x732 features (8499 kb) > cdf=HG-U133A_2 (22277 affyids) > number of samples=6 > number of genes=22277 > annotation=hgu133a2 > notes=E-MEXP-1422 > ? ? ? ?E-MEXP-1422 > ? ? ? ?RNAi > ? ? ? ?c("cellular_modification_design", "co-expression_design", > "in_vitro_design", "RNAi") > ? ? ? ?NULL >> >> PSnames.PC <- featureNames(rawset) >> >> all.equal(PSnames.PC, featureNames(rawset)) > [1] TRUE >> >> save.image("NameOrderTest.RData") >> >> sessionInfo() > R version 2.10.1 (2009-12-14) > i386-pc-mingw32 > > locale: > [1] LC_COLLATE=English_United States.1252 > [2] LC_CTYPE=English_United States.1252 > [3] LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base > > other attached packages: > [1] hgu133a2cdf_2.5.0 ?ArrayExpress_1.6.1 affy_1.24.2 ? ? ? ?Biobase_2.6.1 > > loaded via a namespace (and not attached): > [1] affyio_1.14.0 ? ? ? ?limma_3.2.1 ? ? ? ? ?preprocessCore_1.8.0 > [4] tools_2.10.1 ? ? ? ? XML_2.6-0 >> >> q() > > > # now move to Linux server: > > >> library(affy) > Loading required package: Biobase > > Welcome to Bioconductor > > ?Vignettes contain introductory material. To view, type > ?'openVignette()'. To cite Bioconductor, see > ?'citation("Biobase")' and for packages 'citation(pkgname)'. > >> >> >> >> load("NameOrderTest.RData") >> >> >> >> all.equal(PSnames.PC, featureNames(rawset)) > [1] "117 string mismatches" >> >> >> x <- data.frame(PC=PSnames.PC, Linux=featureNames(rawset), >> stringsAsFactors=F) >> >> x[ x[,1] != x[,2] , ][ 1:5 , ] > ? ? ? ? ? ?PC ? ? Linux > 17 ? ? ?177_at ? 1773_at > 18 ? ? 1773_at ? ?177_at > 2328 2028_s_at 202800_at > 2329 202800_at 202801_at > 2330 202801_at 202802_at >> >> >> all.equal(sort(PSnames.PC), featureNames(rawset)) > [1] TRUE >> >> >> PSnames.linux <- featureNames(rawset) >> >> save.image("NameOrderTest.RData") >> >> sessionInfo() > R version 2.9.0 (2009-04-17) > x86_64-unknown-linux-gnu > > locale: > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_ US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC _NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDEN TIFICATION=C > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base > > other attached packages: > [1] hgu133a2cdf_2.4.0 affy_1.22.0 ? ? ? Biobase_2.4.0 > > loaded via a namespace (and not attached): > [1] affyio_1.8.1 ? ? ? ? preprocessCore_1.6.0 tools_2.9.0 >> >> q() > > > # now move back to PC: > >> library(affy) > Loading required package: Biobase > > Welcome to Bioconductor > > ?Vignettes contain introductory material. To view, type > ?'openVignette()'. To cite Bioconductor, see > ?'citation("Biobase")' and for packages 'citation(pkgname)'. > >> load("NameOrderTest.RData") >> >> all.equal(PSnames.PC, featureNames(rawset)) > [1] TRUE >> >> all.equal(PSnames.linux, featureNames(rawset)) > [1] "117 string mismatches" >> >> all.equal(sort(PSnames.linux), featureNames(rawset)) > [1] TRUE > > > > > > > > > > > > Jenny Drnevich, Ph.D. > > Functional Genomics Bioinformatics Specialist > W.M. Keck Center for Comparative and Functional Genomics > Roy J. Carver Biotechnology Center > University of Illinois, Urbana-Champaign > > 330 ERML > 1201 W. Gregory Dr. > Urbana, IL 61801 > USA > > ph: 217-244-7355 > fax: 217-265-5066 > e-mail: drnevich at illinois.edu > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 14.2 years ago Benilton Carvalho ★ 4.3k

Login before adding your answer.