GEOquery returns error "scan() expected 'an integer'"

0

Entering edit mode

Timothée Flutre ▴ 20

@timothee-flutre-4896

Last seen 10.6 years ago

Hello, I downloaded a dataset from the GEO at the NCBI and launched the following commands: > library(GEOquery) > gse <- getGEO(filename="GSE25935_family.soft.gz") Here is the error message I got: Parsing.... Found 465 entities... GPL4133 (1 of 465 entities) GSM636943 (2 of 465 entities) ... GSM637180 (239 of 465 entities) Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() expected 'an integer', got '5.845752745' Calls: getGEO ... .parseGSMWithLimits -> fastTabRead -> read.delim -> read.table -> scan Is the input file badly formatted? Thanks for any help, TF > sessionInfo() R version 2.13.1 (2011-07-08) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] GEOquery_2.19.4 Biobase_2.10.0 loaded via a namespace (and not attached): [1] RCurl_1.5-0 XML_3.2-0 [[alternative HTML version deleted]]

• 4.5k views

ADD COMMENT • link updated 13.5 years ago by Sean Davis 21k • written 13.5 years ago by Timothée Flutre ▴ 20

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 6 weeks ago

United States

2011/10/2 Timoth?e Flutre <timflutre at="" gmail.com="">: > Hello, > > I downloaded a dataset from the GEO at the NCBI and launched the following > commands: >> library(GEOquery) >> gse <- getGEO(filename="GSE25935_family.soft.gz") > > Here is the error message I got: > Parsing.... > Found 465 entities... > GPL4133 (1 of 465 entities) > GSM636943 (2 of 465 entities) > ... > GSM637180 (239 of 465 entities) > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, > ?: > ?scan() expected 'an integer', got '5.845752745' > Calls: getGEO ... .parseGSMWithLimits -> fastTabRead -> read.delim -> > read.table -> scan > > Is the input file badly formatted? Sorry for the bug. In order to read some of the larger files in GEO, I borrowed a trick from the limma package to just the first part of the file to get the column types, then read the entire file after telling R about the column types. This ends up speeding up reading large files by an order of magnitude sometimes. That is the background. In this case, the problem arises from a sample (GSM637180) that contains 178 missing values as the first records. Since I read only the first 100, R assumes that this column is full of integers. I'll need to fix the code for table reading, but in the meantime, I would suggest this as the workaround: gse = getGEO('GSE25935',destdir='.') gse = combine(gse[[1]],gse[[2]] Using destdir in the getGEO call will allow you to reuse the downloaded files (they are cached in the current directory, in other words) in the case of having to run the code more than once. The combine() call is needed because NCBI GEO built the original series matrix format to have at most 255 columns per file, so two such files are needed to capture all the samples. Hope that helps, Sean > Thanks for any help, > TF > >> sessionInfo() > R version 2.13.1 (2011-07-08) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > ?[1] LC_CTYPE=en_US.UTF-8 ? ? ? LC_NUMERIC=C > ?[3] LC_TIME=en_US.UTF-8 ? ? ? ?LC_COLLATE=en_US.UTF-8 > ?[5] LC_MONETARY=C ? ? ? ? ? ? ?LC_MESSAGES=en_US.UTF-8 > ?[7] LC_PAPER=en_US.UTF-8 ? ? ? LC_NAME=C > ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base > > other attached packages: > [1] GEOquery_2.19.4 Biobase_2.10.0 > > loaded via a namespace (and not attached): > [1] RCurl_1.5-0 XML_3.2-0 > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 13.5 years ago Sean Davis 21k

0

Entering edit mode

Thanks a lot Sean for your help (and for providing us with GEOquery ;). However, in this case, I am not sure that using "combine" is enough to effectively put together the two files into a single object: > library(GEOquery) > gse = getGEO('GSE25935',destdir='.') Found 2 file(s) GSE25935_series_matrix-1.txt.gz % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 50.3M 100 50.3M 0 0 11.4M 0 0:00:04 0:00:04 --:--:-- 13.0M File stored at: /tmp/Rtmpz2pEno/GPL4133.soft GSE25935_series_matrix-2.txt.gz % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 44.5M 100 44.5M 0 0 1929k 0 0:00:23 0:00:23 --:--:-- 1920k Using locally cached version of GPL4133 found here: /tmp/Rtmpz2pEno/GPL4133.soft => There are some warnings "seek on a gzfile connection returned an internal error": can this cause a problem? > gse = combine(gse[[1]],gse[[2]]) There were 12 warnings (use warnings() to see them) > warnings() Warning messages: 1: In alleq(levels(x[[nm]]), levels(y[[nm]])) : Lengths (255, 209) differ (string compare on first 209)209 string mismatches 2: data frame column 'title' levels not all.equal 3: In alleq(levels(x[[nm]]), levels(y[[nm]])) : Lengths (255, 209) differ (string compare on first 209)209 string mismatches 4: data frame column 'geo_accession' levels not all.equal 5: In alleq(levels(x[[nm]]), levels(y[[nm]])) : Lengths (255, 209) differ (string compare on first 209)209 string mismatches 6: data frame column 'source_name_ch1' levels not all.equal 7: In alleq(levels(x[[nm]]), levels(y[[nm]])) : 59 string mismatches 8: data frame column 'characteristics_ch1.2' levels not all.equal 9: In alleq(levels(x[[nm]]), levels(y[[nm]])) : Lengths (50, 48) differ (string compare on first 48)47 string mismatches 10: data frame column 'characteristics_ch1.3' levels not all.equal 11: In alleq(levels(x[[nm]]), levels(y[[nm]])) : Lengths (255, 209) differ (string compare on first 209)209 string mismatches 12: data frame column 'supplementary_file' levels not all.equal => These warnings seem to indicate that the resulting object won't be well defined. And indeed: > Meta(gse) Error in function (classes, fdef, mtable) : unable to find an inherited method for function "Meta", for signature "ExpressionSet" Idem when I want to extract a matrix of samples x probes (which is really what I want here): > m <- matrix(nrow=nrow(Table(GSMList(gse)[[1]])), ncol=length(names(GSMList(gse))), dimnames=list(probe=Table(GSMList(gse)[[1]])$ID_REF, ind=names(GSMList(gse)))) Error in Table(GSMList(gse)[[1]]) : error in evaluating the argument 'object' in selecting a method for function 'Table': Error in function (classes, fdef, mtable) : unable to find an inherited method for function "GSMList", for signature "ExpressionSet" I am not familiar with "combine", sorry. Do you think there is a simple way to fix this, ie. to re-build a single series record from these two files? Otherwise, I may have to parse the files by another mean I guess. Thanks, Tim On Mon, Oct 3, 2011 at 6:18 AM, Sean Davis <sdavis2@mail.nih.gov> wrote: > 2011/10/2 TimothÃ©e Flutre <timflutre@gmail.com>: > > Hello, > > > > I downloaded a dataset from the GEO at the NCBI and launched the > following > > commands: > >> library(GEOquery) > >> gse <- getGEO(filename="GSE25935_family.soft.gz") > > > > Here is the error message I got: > > Parsing.... > > Found 465 entities... > > GPL4133 (1 of 465 entities) > > GSM636943 (2 of 465 entities) > > ... > > GSM637180 (239 of 465 entities) > > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, > na.strings, > > : > > scan() expected 'an integer', got '5.845752745' > > Calls: getGEO ... .parseGSMWithLimits -> fastTabRead -> read.delim -> > > read.table -> scan > > > > Is the input file badly formatted? > > Sorry for the bug. In order to read some of the larger files in GEO, > I borrowed a trick from the limma package to just the first part of > the file to get the column types, then read the entire file after > telling R about the column types. This ends up speeding up reading > large files by an order of magnitude sometimes. That is the > background. > > In this case, the problem arises from a sample (GSM637180) that > contains 178 missing values as the first records. Since I read only > the first 100, R assumes that this column is full of integers. I'll > need to fix the code for table reading, but in the meantime, I would > suggest this as the workaround: > > gse = getGEO('GSE25935',destdir='.') > gse = combine(gse[[1]],gse[[2]] > > Using destdir in the getGEO call will allow you to reuse the > downloaded files (they are cached in the current directory, in other > words) in the case of having to run the code more than once. The > combine() call is needed because NCBI GEO built the original series > matrix format to have at most 255 columns per file, so two such files > are needed to capture all the samples. > > Hope that helps, > Sean > > > > Thanks for any help, > > TF > > > >> sessionInfo() > > R version 2.13.1 (2011-07-08) > > Platform: x86_64-redhat-linux-gnu (64-bit) > > > > locale: > > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > > [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > other attached packages: > > [1] GEOquery_2.19.4 Biobase_2.10.0 > > > > loaded via a namespace (and not attached): > > [1] RCurl_1.5-0 XML_3.2-0 > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > [[alternative HTML version deleted]]

ADD REPLY • link 13.5 years ago Timothée Flutre ▴ 20

0

Entering edit mode

2011/10/3 Timoth?e Flutre <timflutre at="" gmail.com="">: > Thanks a lot Sean for your help (and for providing us with GEOquery ;). > > However, in this case, I am not sure that using "combine" is enough to > effectively put together the two files into a single object: > >> library(GEOquery) > >> gse = getGEO('GSE25935',destdir='.') > Found 2 file(s) > GSE25935_series_matrix-1.txt.gz > ?% Total ? ?% Received % Xferd ?Average Speed ? Time ? ?Time ? ? Time > Current > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Dload ?Upload ? Total ? Spent ? ?Left > Speed > 100 50.3M ?100 50.3M ? ?0 ? ? 0 ?11.4M ? ? ?0 ?0:00:04 ?0:00:04 --:--:-- > 13.0M > File stored at: > /tmp/Rtmpz2pEno/GPL4133.soft > GSE25935_series_matrix-2.txt.gz > ?% Total ? ?% Received % Xferd ?Average Speed ? Time ? ?Time ? ? Time > Current > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Dload ?Upload ? Total ? Spent ? ?Left > Speed > 100 44.5M ?100 44.5M ? ?0 ? ? 0 ?1929k ? ? ?0 ?0:00:23 ?0:00:23 --:--:-- > 1920k > Using locally cached version of GPL4133 found here: > /tmp/Rtmpz2pEno/GPL4133.soft > > => There are some warnings "seek on a gzfile connection returned an internal > error": can this cause a problem? These are warnings due to changes in base R. You can ignore them. >> gse = combine(gse[[1]],gse[[2]]) > There were 12 warnings (use warnings() to see them) >> warnings() > Warning messages: > 1: In alleq(levels(x[[nm]]), levels(y[[nm]])) : > ?Lengths (255, 209) differ (string compare on first 209)209 string > mismatches > 2: data frame column 'title' levels not all.equal > 3: In alleq(levels(x[[nm]]), levels(y[[nm]])) : > ?Lengths (255, 209) differ (string compare on first 209)209 string > mismatches > 4: data frame column 'geo_accession' levels not all.equal > 5: In alleq(levels(x[[nm]]), levels(y[[nm]])) : > ?Lengths (255, 209) differ (string compare on first 209)209 string > mismatches > 6: data frame column 'source_name_ch1' levels not all.equal > 7: In alleq(levels(x[[nm]]), levels(y[[nm]])) : 59 string mismatches > 8: data frame column 'characteristics_ch1.2' levels not all.equal > 9: In alleq(levels(x[[nm]]), levels(y[[nm]])) : > ?Lengths (50, 48) differ (string compare on first 48)47 string mismatches > 10: data frame column 'characteristics_ch1.3' levels not all.equal > 11: In alleq(levels(x[[nm]]), levels(y[[nm]])) : > ?Lengths (255, 209) differ (string compare on first 209)209 string > mismatches > 12: data frame column 'supplementary_file' levels not all.equal > > => These warnings seem to indicate that the resulting object won't be well > defined. > > And indeed: >> Meta(gse) gse is an ExpressionSet, so Meta will not work. You probably want something like: pData(gse) > Error in function (classes, fdef, mtable) ?: > ?unable to find an inherited method for function "Meta", for signature > "ExpressionSet" > > Idem when I want to extract a matrix of samples x probes (which is really > what I want here): >> m <- matrix(nrow=nrow(Table(GSMList(gse)[[1]])), > ncol=length(names(GSMList(gse))), > dimnames=list(probe=Table(GSMList(gse)[[1]])$ID_REF, > ind=names(GSMList(gse)))) > Error in Table(GSMList(gse)[[1]]) : > ?error in evaluating the argument 'object' in selecting a method for > function 'Table': Error in function (classes, fdef, mtable) ?: > ?unable to find an inherited method for function "GSMList", for signature > "ExpressionSet" No need to do any of this. Using GSEMatrix=TRUE, which has been the default for the last couple of years, alleviates the need to do the stuff below. If you want to get a samples x probes matrix, the data from the two approaches will be equivalent, but one is clearly simpler than the other. Sean > I am not familiar with "combine", sorry. Do you think there is a simple way > to fix this, ie. to re-build a single series record from these two files? > Otherwise, I may have to parse the files by another mean I guess. > Thanks, > Tim > > On Mon, Oct 3, 2011 at 6:18 AM, Sean Davis <sdavis2 at="" mail.nih.gov=""> wrote: > >> 2011/10/2 Timoth?e Flutre <timflutre at="" gmail.com="">: >> > Hello, >> > >> > I downloaded a dataset from the GEO at the NCBI and launched the >> following >> > commands: >> >> library(GEOquery) >> >> gse <- getGEO(filename="GSE25935_family.soft.gz") >> > >> > Here is the error message I got: >> > Parsing.... >> > Found 465 entities... >> > GPL4133 (1 of 465 entities) >> > GSM636943 (2 of 465 entities) >> > ... >> > GSM637180 (239 of 465 entities) >> > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, >> na.strings, >> > ?: >> > ?scan() expected 'an integer', got '5.845752745' >> > Calls: getGEO ... .parseGSMWithLimits -> fastTabRead -> read.delim -> >> > read.table -> scan >> > >> > Is the input file badly formatted? >> >> Sorry for the bug. ?In order to read some of the larger files in GEO, >> I borrowed a trick from the limma package to just the first part of >> the file to get the column types, then read the entire file after >> telling R about the column types. ?This ends up speeding up reading >> large files by an order of magnitude sometimes. ?That is the >> background. >> >> In this case, the problem arises from a sample (GSM637180) that >> contains 178 missing values as the first records. ?Since I read only >> the first 100, R assumes that this column is full of integers. ?I'll >> need to fix the code for table reading, but in the meantime, I would >> suggest this as the workaround: >> >> gse = getGEO('GSE25935',destdir='.') >> gse = combine(gse[[1]],gse[[2]] >> >> Using destdir in the getGEO call will allow you to reuse the >> downloaded files (they are cached in the current directory, in other >> words) in the case of having to run the code more than once. ?The >> combine() call is needed because NCBI GEO built the original series >> matrix format to have at most 255 columns per file, so two such files >> are needed to capture all the samples. >> >> Hope that helps, >> Sean >> >> >> > Thanks for any help, >> > TF >> > >> >> sessionInfo() >> > R version 2.13.1 (2011-07-08) >> > Platform: x86_64-redhat-linux-gnu (64-bit) >> > >> > locale: >> > ?[1] LC_CTYPE=en_US.UTF-8 ? ? ? LC_NUMERIC=C >> > ?[3] LC_TIME=en_US.UTF-8 ? ? ? ?LC_COLLATE=en_US.UTF-8 >> > ?[5] LC_MONETARY=C ? ? ? ? ? ? ?LC_MESSAGES=en_US.UTF-8 >> > ?[7] LC_PAPER=en_US.UTF-8 ? ? ? LC_NAME=C >> > ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C >> > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> > >> > attached base packages: >> > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base >> > >> > other attached packages: >> > [1] GEOquery_2.19.4 Biobase_2.10.0 >> > >> > loaded via a namespace (and not attached): >> > [1] RCurl_1.5-0 XML_3.2-0 >> > >> > ? ? ? ?[[alternative HTML version deleted]] >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor at r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> > > ? ? ? ?[[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 13.5 years ago Sean Davis 21k

Login before adding your answer.