Question

GEOquery and GEO issues

0

Entering edit mode

Christian.Stratowa@vie.boehringer-ingel… ▴ 270

@christianstratowavieboehringer-ingelheimcom-545

Last seen 11.3 years ago

Dear Sean While trying to find a parser for the GEO soft files I encoutered your GEOquery package which works great. Nevertheless, I would like to mention two issues which might be of general interest: 1, Memory problems: I have downloaded from GEO the file 'GSE2109_family.soft.gz' first (due to our proxy settings I cannot use getGEO for this purpose) and then imported it into R with: gse2109 <- getGEO(filename='GSE2109_family.soft.gz') Although I have succeeded in importing the file into R, it took 39.3 hours on a 64 bit Opteron machine with 16 GB RAM and used 9.7 GB RAM. The final .Rdata file has a size of 2.0 GB. Maybe, a future version of GEOquery could reduce both time and memory consumption. 2, Non-unique GEO platforms: I have also downloaded our own CLL dataset 'GSE2466_family.soft.gz' where we had to use both the Affymetrix HGU95A and HGU95Av2 chips. In my personal opinion it is a serious flaw of the GEO database that it declares both chips as single platform GPL91. In your description of the GEOquery package, chapter 4.3 Converting GSE to an exprSet, you supply code to make sure that all of the GSMs are from the same platform (see my small function below). Sorrowly, this is not sufficient in this case (and probably other Affymetrix chips where two versions exist). Even though the Sample_data_row_count is different (12625 vs 12626) cbind simply recylces the rows. In this case, I could test if Sample_data_row_count is identical for all chips, but theoretically there may be the case that different chip versions may still have the same number of probe sets. One possibility would be that GEO forces the submitters not only to supply Sample_platform_id, but also a "Sample_platform_title" which would contain the name of the chip as given by the manufacturer. 3, Sample descriptions: Since most data are useless w/o the sample description, which contains the clinical data, it would be helpful if GEO would supply a certain format for adding the clinical data, so that it would be possible to write a parser to extract these data automatically into a table. Best regards Christian Attached function: #--------------------------------------------------------------------- # table4GEO <- function(gse, column="VALUE", lg2=T){ # (c) Christian Stratowa created: 01/19/2006 last modified: 01/19/2006 # Get sample table of columns "column" for GEO Series GSExxxx # gse: GEOqueryclass imported from GEO GSE file GSExxxx_family.soft (or soft.gz) # column: name of column to be extracted from data table # load libraries library(Biobase); library(GEOquery); # get list gsm <- GSMList(gse); # check number of platforms (must be one platform only) tmp <- unlist(lapply(gsm, function(x) {Meta(x)$platform})); if (length(unique(tmp)) != 1) { stop("Data must belong to one platform ID only!"); }#if # number of samples size <- length(tmp); print(paste("Number of samples:",size)) # check if all samples have the chosen column tmp <- unlist(lapply(gsm, function(x) {which(Columns(x)[,1] == column)})); if (length(tmp) != size) { stop(paste("Only <", length(tmp), "> of <", size, "> samples have column ", column)); }#if # get "column" from all chips data <- do.call("cbind", lapply(gsm, function(x){Table(x)[,column]})); dimnames(data)[[1]] <- Table(gsm[[1]])$ID_REF if (lg2==TRUE) { data <- log2(data); }#if return(data); }#table4GEO ============================================== Christian Stratowa, PhD Boehringer Ingelheim Austria Dept NCE Lead Discovery - Bioinformatics Dr. Boehringergasse 5-11 A-1121 Vienna, Austria Tel.: ++43-1-80105-2470 Fax: ++43-1-80105-2782 email: christian.stratowa at vie.boehringer-ingelheim.com

hgu95a hgu95av2 GEOquery hgu95a hgu95av2 GEOquery • 2.2k views

ADD COMMENT • link 19.9 years ago Christian.Stratowa@vie.boehringer-ingel… ▴ 270

score 0 · Answer 1 · 2006-01-23

On 1/23/06 5:18 AM, "Christian.Stratowa at vie.boehringer- ingelheim.com" <christian.stratowa at="" vie.boehringer-ingelheim.com=""> wrote: > Dear Sean > > While trying to find a parser for the GEO soft files I encoutered your > GEOquery package which works great. > Nevertheless, I would like to mention two issues which might be of general > interest: > > 1, Memory problems: > I have downloaded from GEO the file 'GSE2109_family.soft.gz' first (due to > our proxy settings I cannot use > getGEO for this purpose) and then imported it into R with: > gse2109 <- getGEO(filename='GSE2109_family.soft.gz') > Although I have succeeded in importing the file into R, it took 39.3 hours > on a 64 bit Opteron machine with > 16 GB RAM and used 9.7 GB RAM. The final .Rdata file has a size of 2.0 GB. > Maybe, a future version of GEOquery could reduce both time and memory > consumption. This is obviously a problem with large GSEs. > 2, Non-unique GEO platforms: > I have also downloaded our own CLL dataset 'GSE2466_family.soft.gz' where we > had to use both the > Affymetrix HGU95A and HGU95Av2 chips. In my personal opinion it is a serious > flaw of the GEO > database that it declares both chips as single platform GPL91. > In your description of the GEOquery package, chapter 4.3 Converting GSE to > an exprSet, you supply > code to make sure that all of the GSMs are from the same platform (see my > small function below). > Sorrowly, this is not sufficient in this case (and probably other Affymetrix > chips where two versions exist). > Even though the Sample_data_row_count is different (12625 vs 12626) cbind > simply recylces the rows. > In this case, I could test if Sample_data_row_count is identical for all > chips, but theoretically there may > be the case that different chip versions may still have the same number of > probe sets. > One possibility would be that GEO forces the submitters not only to supply > Sample_platform_id, but > also a "Sample_platform_title" which would contain the name of the chip as > given by the manufacturer. Just to clarify--I am in no way affiliated with GEO and have no control over the way their database functions or what is stored in it. I have simply tried to provide a means to easily parse as much of GEO data as possible. As for your situation, this is easily remedied: Instead of using 'cbind' blindly (which assumes that the GPL and the data are in the same order, which they need not be), use match first. In fact, that is probably the safest way to do things--I'll change the vignette. Something like this: probesets <- Table(GPLList(gse)[[1]])$ID dat <- do.call('cbind',lapply(GSMList(gse),function(x) {tab <- Table(x) mymatch <- match(probesets,tab$ID_REF) return(tab$VALUE[mymatch]) } ) ) > > 3, Sample descriptions: > Since most data are useless w/o the sample description, which contains the > clinical data, it would > be helpful if GEO would supply a certain format for adding the clinical > data, so that it would be > possible to write a parser to extract these data automatically into a table. Again, I do not have any control over what GEO does with regard to clinical annotation. Where the clinical data is present, it should be possible to write a specific function or set of functions to extract it; writing a general function to do this is currently not possible for GSEs for the reason that you note--there isn't a specified format. I hope this clarifies things a bit. Thanks for the constructive feedback. Sean

score 0 · Answer 2 · 2006-01-23

Dear Sean Thank you for this valuable suggestion, using match will be the way to go. Sorry, I thought that you may have at least close contact to the GEO people. Best regards Christian ============================================== Christian Stratowa, PhD Boehringer Ingelheim Austria Dept NCE Lead Discovery - Bioinformatics Dr. Boehringergasse 5-11 A-1121 Vienna, Austria Tel.: ++43-1-80105-2470 Fax: ++43-1-80105-2782 email: christian.stratowa at vie.boehringer-ingelheim.com -----Original Message----- From: Sean Davis [mailto:sdavis2@mail.nih.gov] Sent: Monday, January 23, 2006 14:04 To: Stratowa,Dr.,Christian FEX BIG-AT-V; Bioconductor Subject: Re: [BioC] GEOquery and GEO issues On 1/23/06 5:18 AM, "Christian.Stratowa at vie.boehringer- ingelheim.com" <christian.stratowa at="" vie.boehringer-ingelheim.com=""> wrote: > Dear Sean > > While trying to find a parser for the GEO soft files I encoutered your > GEOquery package which works great. Nevertheless, I would like to > mention two issues which might be of general > interest: > > 1, Memory problems: > I have downloaded from GEO the file 'GSE2109_family.soft.gz' first > (due to our proxy settings I cannot use getGEO for this purpose) and > then imported it into R with: gse2109 <- > getGEO(filename='GSE2109_family.soft.gz') > Although I have succeeded in importing the file into R, it took 39.3 > hours on a 64 bit Opteron machine with 16 GB RAM and used 9.7 GB RAM. > The final .Rdata file has a size of 2.0 GB. Maybe, a future version of > GEOquery could reduce both time and memory consumption. This is obviously a problem with large GSEs. > 2, Non-unique GEO platforms: > I have also downloaded our own CLL dataset 'GSE2466_family.soft.gz' > where we had to use both the Affymetrix HGU95A and HGU95Av2 chips. In > my personal opinion it is a serious flaw of the GEO > database that it declares both chips as single platform GPL91. > In your description of the GEOquery package, chapter 4.3 Converting GSE to > an exprSet, you supply > code to make sure that all of the GSMs are from the same platform (see my > small function below). > Sorrowly, this is not sufficient in this case (and probably other Affymetrix > chips where two versions exist). > Even though the Sample_data_row_count is different (12625 vs 12626) cbind > simply recylces the rows. > In this case, I could test if Sample_data_row_count is identical for all > chips, but theoretically there may > be the case that different chip versions may still have the same number of > probe sets. > One possibility would be that GEO forces the submitters not only to supply > Sample_platform_id, but > also a "Sample_platform_title" which would contain the name of the chip as > given by the manufacturer. Just to clarify--I am in no way affiliated with GEO and have no control over the way their database functions or what is stored in it. I have simply tried to provide a means to easily parse as much of GEO data as possible. As for your situation, this is easily remedied: Instead of using 'cbind' blindly (which assumes that the GPL and the data are in the same order, which they need not be), use match first. In fact, that is probably the safest way to do things--I'll change the vignette. Something like this: probesets <- Table(GPLList(gse)[[1]])$ID dat <- do.call('cbind',lapply(GSMList(gse),function(x) {tab <- Table(x) mymatch <- match(probesets,tab$ID_REF) return(tab$VALUE[mymatch]) } ) ) > > 3, Sample descriptions: > Since most data are useless w/o the sample description, which contains > the clinical data, it would be helpful if GEO would supply a certain > format for adding the clinical data, so that it would be > possible to write a parser to extract these data automatically into a table. Again, I do not have any control over what GEO does with regard to clinical annotation. Where the clinical data is present, it should be possible to write a specific function or set of functions to extract it; writing a general function to do this is currently not possible for GSEs for the reason that you note--there isn't a specified format. I hope this clarifies things a bit. Thanks for the constructive feedback. Sean