GEOquery - was queryGEO fails on GDS files (GEO Datasets)
3
0
Entering edit mode
Peter ▴ 170
@peter-1556
Last seen 10.2 years ago
Sean Davis wrote: >Peter, > >I have recently uploaded a new package to bioconductor called GEOquery. I've had a little play - very nice work. Cheers. Just a few queries/questions for you... I never did work out how to load the package from the source files, but I noticed there is now a Windows binary package on the website... http://www.bioconductor.org/packages/bioc/1.8/html/GEOquery.html I downloaded the ZIP file and installed it on Windows XP with R 2.1.1 and got the following warning: package 'GEOquery' successfully unpacked and MD5 sums checked updating HTML package descriptions Warning message: no package 'file15658' was found in: packageDescription(i, fields = "Title", lib.loc = lib) Question One ------------ Is the above "no package" warning important? ------------------------------------------------------------------- Question Two ------------ > library(GEOquery) Warning message: package 'GEOquery' was built under R version 2.3.0 Does the version of R matter? I assume R version 2.3.0 is the development version of R, as 2.2.1 is the latest official release. ------------------------------------------------------------------- Question Three -------------- > gds37 <- getGEO('GDS37', destdir="c:/temp/geo") trying URL 'ftp://ftp.ncbi.nih.gov/pub/geo/data/gds/soft_gz/GDS37.soft.gz' ftp data connection made, file length 132384 bytes opened URL downloaded 129Kb File stored at: c:/temp/geo/GDS37.soft.gz c:/temp/geo/GDS37.soft.gz parsing geodata parsing subsets ready to return Why does it print the file location twice? ------------------------------------------------------------------- Question Four ------------- If I repeat the command getGEO, why does it re-download the file? > gds37 <- getGEO('GDS37', destdir="c:/temp/geo") I would personally have written the getGEO code to check in the destination folder for the files GDS37.soft or GDS37.soft.gz and just load the local copy if it existed. I know I should use the following instead: > gds37 <- getGEO(filename="c:/temp/geo/gds37.soft.gz") ------------------------------------------------------------------- Question Five ------------- I like how you have handled converting subset information into phenotype data in GDS2eSet. Have you considered also parsing the "description" to extract the "Alternative Sample Name" and the "Sample Source"? As far as I can tell, all the current NCBI GDS files use the same format for the description lines: "Value for SAMPLENAME: ALTNAME; src: SOURCE" On the other hand, this is clearly not a "defined field" and is subject to change. Maybe automatically parse the lines if and only if it follows that format? ------------------------------------------------------------------- Thanks again - GEOquery looks like it will be very handy... Peter
GEOquery GEOquery • 1.8k views
ADD COMMENT
0
Entering edit mode
@sean-davis-490
Last seen 3 months ago
United States
On 1/11/06 2:29 PM, "Peter" <bioconductor-mailinglist at="" maubp.freeserve.co.uk=""> wrote: > Sean Davis wrote: >> Peter, >> >> I have recently uploaded a new package to bioconductor called GEOquery. > > I've had a little play - very nice work. Cheers. Just a few > queries/questions for you... > > I never did work out how to load the package from the source files, but > I noticed there is now a Windows binary package on the website... > > http://www.bioconductor.org/packages/bioc/1.8/html/GEOquery.html > > I downloaded the ZIP file and installed it on Windows XP with R 2.1.1 > and got the following warning: > > package 'GEOquery' successfully unpacked and MD5 sums checked > updating HTML package descriptions > Warning message: > no package 'file15658' was found in: packageDescription(i, fields = > "Title", lib.loc = lib) > > Question One > ------------ > Is the above "no package" warning important? I don't know the answer to that one, but I will look into it. > Question Two > ------------ > >> library(GEOquery) > Warning message: > package 'GEOquery' was built under R version 2.3.0 > > Does the version of R matter? I assume R version 2.3.0 is the > development version of R, as 2.2.1 is the latest official release. By definition, the development versions of Bioconductor packages are built to work with the current development version of R. That said, I venture to say that most of them will work with relatively recent versions of R, GEOquery included. > Question Three > -------------- > >> gds37 <- getGEO('GDS37', destdir="c:/temp/geo") > trying URL 'ftp://ftp.ncbi.nih.gov/pub/geo/data/gds/soft_gz/GDS37.soft.gz' > ftp data connection made, file length 132384 bytes > opened URL > downloaded 129Kb > > File stored at: > c:/temp/geo/GDS37.soft.gz > c:/temp/geo/GDS37.soft.gz > parsing geodata > parsing subsets > ready to return > > Why does it print the file location twice? Sloppy debugging code that didn't get removed. Thanks for pointing this out. > Question Four > ------------- > If I repeat the command getGEO, why does it re-download the file? > >> gds37 <- getGEO('GDS37', destdir="c:/temp/geo") > > I would personally have written the getGEO code to check in the > destination folder for the files GDS37.soft or GDS37.soft.gz and just > load the local copy if it existed. I can make that change, yes. > I know I should use the following instead: > >> gds37 <- getGEO(filename="c:/temp/geo/gds37.soft.gz") Obviously what I envisioned.... > > Question Five > ------------- > I like how you have handled converting subset information into phenotype > data in GDS2eSet. > > Have you considered also parsing the "description" to extract the > "Alternative Sample Name" and the "Sample Source"? > > As far as I can tell, all the current NCBI GDS files use the same format > for the description lines: > > "Value for SAMPLENAME: ALTNAME; src: SOURCE" > > On the other hand, this is clearly not a "defined field" and is subject > to change. That is exactly why I don't parse it. I can talk to the folks about GEO whether this is likely to change or not. > Maybe automatically parse the lines if and only if it > follows that format? That is a possibility. > Thanks again - GEOquery looks like it will be very handy... Thanks for the feedback. Keep it coming.... Sean
ADD COMMENT
0
Entering edit mode
Peter ▴ 170
@peter-1556
Last seen 10.2 years ago
Sean Davis wrote: >Peter, > >I have recently uploaded a new package to bioconductor called GEOquery. A follow up question, if I generate an expression set like so: gds <- getGEO(filename='GDS1099.soft') eset <- GDS2eSet(gds, do.log2=TRUE) Then this works: > sampleNames(eset) [1] "GSM37063" "GSM37064" "GSM37065" "GSM37066" "GSM37067" [6] "GSM37068" "GSM37069" "GSM37070" "GSM37071" "GSM37072" [11] "GSM37073" "GSM37074" "GSM37075" "GSM37076" "GSM37077" But this does not: > geneNames(eset) NULL getGEO seems to parse the ID (and the IDENTIFIER) columns fine, so I'm guessing that this is a problem in GDS2eSet. I would expect to get "AFFX-BioB-5_at", "AFFX-BioC-3_at", ... back as the geneNames values. I'm using: GEOquery_1.5.2.zip R Version 2.1.1 (2005-06-20) Windows XP Peter
ADD COMMENT
0
Entering edit mode
On 1/11/06 3:21 PM, "Peter" <bioconductor-mailinglist at="" maubp.freeserve.co.uk=""> wrote: > But this does not: > >> geneNames(eset) > NULL > > getGEO seems to parse the ID (and the IDENTIFIER) columns fine, so I'm > guessing that this is a problem in GDS2eSet. > > I would expect to get "AFFX-BioB-5_at", "AFFX-BioC-3_at", ... back as > the geneNames values. This is now fixed in the repository and should be fixed in the packaged files in 24 hours or so. Thanks. Sean
ADD REPLY
0
Entering edit mode
@ting-yuan-liu-fhcrc-1221
Last seen 10.2 years ago
Hi Peter, For Question 2: this is because GEOquery is not in the BioC 1.7 release. Now it is in the BioC devel (1.8) repository, and it will be built by the R devel (2.3) version. I think you can ignore the warning message at this stage. If you really concern about this, you can install the R devel version on your XP machine and then run GEOquery on it. We recommend to install BioC devel packages on the R devel version, and BioC stable packages on the R stable version. HTH, Ting-Yuan ______________________________________ Ting-Yuan Liu Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center Seattle, WA, USA ______________________________________ On Wed, 11 Jan 2006, Peter wrote: > Sean Davis wrote: > >Peter, > > > >I have recently uploaded a new package to bioconductor called GEOquery. > > I've had a little play - very nice work. Cheers. Just a few > queries/questions for you... > > I never did work out how to load the package from the source files, but > I noticed there is now a Windows binary package on the website... > > http://www.bioconductor.org/packages/bioc/1.8/html/GEOquery.html > > I downloaded the ZIP file and installed it on Windows XP with R 2.1.1 > and got the following warning: > > package 'GEOquery' successfully unpacked and MD5 sums checked > updating HTML package descriptions > Warning message: > no package 'file15658' was found in: packageDescription(i, fields = > "Title", lib.loc = lib) > > Question One > ------------ > Is the above "no package" warning important? > > ------------------------------------------------------------------- > > Question Two > ------------ > > > library(GEOquery) > Warning message: > package 'GEOquery' was built under R version 2.3.0 > > Does the version of R matter? I assume R version 2.3.0 is the > development version of R, as 2.2.1 is the latest official release. > > ------------------------------------------------------------------- > > Question Three > -------------- > > > gds37 <- getGEO('GDS37', destdir="c:/temp/geo") > trying URL 'ftp://ftp.ncbi.nih.gov/pub/geo/data/gds/soft_gz/GDS37.soft.gz' > ftp data connection made, file length 132384 bytes > opened URL > downloaded 129Kb > > File stored at: > c:/temp/geo/GDS37.soft.gz > c:/temp/geo/GDS37.soft.gz > parsing geodata > parsing subsets > ready to return > > Why does it print the file location twice? > > ------------------------------------------------------------------- > > Question Four > ------------- > If I repeat the command getGEO, why does it re-download the file? > > > gds37 <- getGEO('GDS37', destdir="c:/temp/geo") > > I would personally have written the getGEO code to check in the > destination folder for the files GDS37.soft or GDS37.soft.gz and just > load the local copy if it existed. > > I know I should use the following instead: > > > gds37 <- getGEO(filename="c:/temp/geo/gds37.soft.gz") > > > ------------------------------------------------------------------- > > Question Five > ------------- > I like how you have handled converting subset information into phenotype > data in GDS2eSet. > > Have you considered also parsing the "description" to extract the > "Alternative Sample Name" and the "Sample Source"? > > As far as I can tell, all the current NCBI GDS files use the same format > for the description lines: > > "Value for SAMPLENAME: ALTNAME; src: SOURCE" > > On the other hand, this is clearly not a "defined field" and is subject > to change. Maybe automatically parse the lines if and only if it > follows that format? > > ------------------------------------------------------------------- > > Thanks again - GEOquery looks like it will be very handy... > > Peter > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor >
ADD COMMENT

Login before adding your answer.

Traffic: 828 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6