GEOquery: getGEO() doesn\'t work (error \"invalid \'nlines\' argument\")
GEOquery
Entering edit mode
Hi Sean, > The "system.file" part of your command above is not necessary (and is > probably the problem). System.file is for locating files that came > with a specific software package. So, you want something like: > > GSE19711 <- getGEO('mypath/GSE19711_family.soft.gz') This works! Thanks a lot! > Note that you will have to do a fair bit of work to get the data out > of a SOFT format file. Instead, you should consider using a GSEMatrix > file. Alternatively, download the raw data and use a > platform-appropriate package to read in and analyze the data. > Finally, note that you do not need to download files separately. Well, my problem is that I am not quite sure about the "best" way to get the data I need. I'll try to give an example: We have the GEO Series GSE19711. For all the samples of this series, I need some specific information. Let's use the first sample of GSE19711 as an example: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM491937 I need to know the age of the patient ("ageatdiagnosis", if it is a pre- or a post-treatment sample, and the sex of the patient (in this case all samples are from women) and maybe some other information (in case of other series). And of course, I need the data matrix itself, to be finally able to create something similar to an ExpressionSet, but using the methylumi package, because all this is about methylation and not gene expression. I have to deal with several thousand samples from many different GEO series, therefore I want to automate the fetching of the phenodata information of the patients. Searching for a solution to do this, I found the GEOquery package and I thought it would be the best way to deal with the soft-Files because these files are available for all series I want to analyze, and they contain all information available, I thought (so far I worked only with expression data where I used RAW files, but there were always also phenodata files available, so it was a lot easier). If you can think of any better way to get the data I need and to annotate the sample <-> phenodata information in an easy way, please tell me, I would be very happy. Simone
> So when you use system.file() you are specifically telling GEOquery to > look for a file that is in your GEOquery library directory, rather > than telling GEOquery the actual directory. Thank you for explaining the system.file() thing, I didn't know that this was about the package repository. I thought it would be necessary to be able to access the downloaded files, but now I understand what's happening. > > mypath <- "C:/Users/bioinf_admin/Desktop/" > > GSE19711 <- getGEO('GSE19711',destdir=mypath) > > This will result in a list of ExpressionSets The problem is, that here I work with methylation data, so have to create MethyLumiSets instead of ExpressionSets. My idea was to create phenodata.txt files using the data I get from getGEO(): > GSE19711 <- getGEO(filename="mypath/GSE19711_family.soft.gz") (Btw, I always get warnings when doing this, but it seems to work anyway: > warnings() Warning messages: 1: In readLines(con, n = chunksize) : seek on a gzfile connection returned an internal error ...) And then accessing the information with some code like this for example: > Meta(GSMList(GSE19711)[[1]])$characteristics_ch1[3] [1] "ageatrecruitment: 68" And extract the relevant substrings and create a data.frame with all the information I need (age, sex, treatment, etc.). And all this in an apply function for every GSE or something like this. Furthermore getting the data matrices from the soft files as well and finally creating MethyLumiSets out of the data matrices and the phenodata.txt files I created. Maybe it would be better to first create ExpressionSets and convert them into MethyLumiSets somehow, but I would have to manipulate the objects anyway, because I can't use the phenodata information as it comes from GEO in these cases. I need the phenodata to be the same style for all the GEO sets I have to analyze, so in any case I'll have to do the work to extract (only) the information I need for the different GEO sets. But I'm still not quite sure about the best way to create the MethyLumiSets efficiently ... Best, Simone ADD REPLY 0 Entering edit mode On Tue, May 29, 2012 at 11:45 AM, <ecsi@gmx.net> wrote: > > So when you use system.file() you are specifically telling GEOquery to >> look for a file that is in your GEOquery library directory, rather than >> telling GEOquery the actual directory. >> > > Thank you for explaining the system.file() thing, I didn't know that this > was about the package repository. I thought it would be necessary to be > able to access the downloaded files, but now I understand what's happening. > > > > mypath <- "C:/Users/bioinf_admin/**Desktop/" >> > GSE19711 <- getGEO('GSE19711',destdir=**mypath) >> >> This will result in a list of ExpressionSets >> > > The problem is, that here I work with methylation data, so have to create > MethyLumiSets instead of ExpressionSets. > > My idea was to create phenodata.txt files using the data I get from > getGEO(): > > > GSE19711 <- getGEO(filename="mypath/**GSE19711_family.soft.gz") > > (Btw, I always get warnings when doing this, but it seems to work anyway: > > warnings() > Warning messages: > 1: In readLines(con, n = chunksize) : > seek on a gzfile connection returned an internal error > ...) > > And then accessing the information with some code like this for example: > > > Meta(GSMList(GSE19711)[[1]])$**characteristics_ch1[3] > > [1] "ageatrecruitment: 68" > > > And extract the relevant substrings and create a data.frame with all the > information I need (age, sex, treatment, etc.). And all this in an apply > function for every GSE or something like this. Furthermore getting the data > matrices from the soft files as well and finally creating MethyLumiSets out > of the data matrices and the phenodata.txt files I created. > > Maybe it would be better to first create ExpressionSets and convert them > into MethyLumiSets somehow, but I would have to manipulate the objects > anyway, because I can't use the phenodata information as it comes from GEO > in these cases. I need the phenodata to be the same style for all the GEO > sets I have to analyze, so in any case I'll have to do the work to extract > (only) the information I need for the different GEO sets. > > But I'm still not quite sure about the best way to create the > MethyLumiSets efficiently ... > > >From what you have described so far, an ExpressionSet will suffice. I don't think there is a need for a MethyLumiSet since you describe simply getting the normalized data from the GSE. I'd suggest going the route that Jim outlined using GSEMatrix files and moving forward from there. If you can fill in details of what downstream analysis you want to do with the data, perhaps we can be more directive on that point. Sean [[alternative HTML version deleted]]
> > File stored at: > C:\Users\Myuser\AppData\Local\Temp\RtmpOWBTrt/GPL8490.soft > > > Sorry, Simone. That is a bug. It has been fixed in GEOquery and will > be available as a new version in a day or so. Great, I will wait for the new version, thanks a lot! Simone
Hello Sean,
I am having the same problem using getGEO().
most of my query are downloadable without error/interruption but few have problem of incomplete downloads and the error message is same as simone's.
the datasets id giving me the error is GSE8650( the first platform is alright but the second(GPL97) is the one problematic).#
the code is

f.gse.ids=unique(f.results\$gse_ids)
for (id in f.gse.ids){
gse <- getGEO(id, destdir = "/home/uwakah/project_test_1/data_script/DISEASED/GEO_ARRAY_DISEASED",GSEMatrix = TRUE)}}

my sessionInfo()
-------------------
R version 3.2.2 (2015-08-14)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS release 6.7 (Final)

locale:
[1] C

attached base packages:
[1] parallel  stats4    methods   stats     graphics  grDevices utils
[8] datasets  base

other attached packages:
[1] GOstats_2.34.0       graph_1.46.0         Category_2.34.2
[4] Matrix_1.2-3         globaltest_5.22.0    GO.db_3.1.2
[7] hgu133a.db_3.1.3     annotate_1.46.1      XML_3.98-1.3
[10] hgu133plus2.db_3.1.3 limma_3.24.15        simpleaffy_2.44.0
[13] gcrma_2.40.0         genefilter_1.50.0    affy_1.46.1
[16] GEOmetadb_1.28.0     GEOquery_2.34.0      org.Mm.eg.db_3.1.2
[19] org.Hs.eg.db_3.1.2   RSQLite_1.0.0        DBI_0.3.1
[22] AnnotationDbi_1.30.1 GenomeInfoDb_1.4.3   IRanges_2.2.9
[25] S4Vectors_0.6.6      Biobase_2.28.0       BiocGenerics_0.14.0
[28] R.utils_2.2.0        R.oo_1.19.0          R.methodsS3_1.7.0
[31] BiocInstaller_1.18.5

loaded via a namespace (and not attached):
[1] XVector_0.8.0          bitops_1.0-6           tools_3.2.2
[4] zlibbioc_1.14.0        lattice_0.20-33        preprocessCore_1.30.0
[7] Biostrings_2.36.4      grid_3.2.2             GSEABase_1.30.2
[10] RBGL_1.44.0            survival_2.38-3        splines_3.2.2
[13] AnnotationForge_1.10.1 xtable_1.8-2           RCurl_1.95-4.7
[16] affyio_1.36.0

Error message
-------------
Error in read.table(con, sep = "\t", header = FALSE, nrows = nseries) :
invalid 'nlines' argument

I do ask if this bug has been fixed or help, thanks

Regards

Innocentia

Try removing the cached GPL97 file and then try again.  That bug was fixed several years ago.

Hi Sean,

Thanks alot for your response, in my case the source of the problem was space. there wasnt sufficient space, so when i deleted some heavy docs and had space, i ran the script again and it continued.