Entering edit mode
Christian.Stratowa@vie.boehringer-ingel…
▴
270
@christianstratowavieboehringer-ingelheimcom-545
Last seen 10.2 years ago
Dear Sean
While trying to find a parser for the GEO soft files I encoutered your
GEOquery package which works great.
Nevertheless, I would like to mention two issues which might be of
general
interest:
1, Memory problems:
I have downloaded from GEO the file 'GSE2109_family.soft.gz' first
(due to
our proxy settings I cannot use
getGEO for this purpose) and then imported it into R with:
gse2109 <- getGEO(filename='GSE2109_family.soft.gz')
Although I have succeeded in importing the file into R, it took 39.3
hours
on a 64 bit Opteron machine with
16 GB RAM and used 9.7 GB RAM. The final .Rdata file has a size of 2.0
GB.
Maybe, a future version of GEOquery could reduce both time and memory
consumption.
2, Non-unique GEO platforms:
I have also downloaded our own CLL dataset 'GSE2466_family.soft.gz'
where we
had to use both the
Affymetrix HGU95A and HGU95Av2 chips. In my personal opinion it is a
serious
flaw of the GEO
database that it declares both chips as single platform GPL91.
In your description of the GEOquery package, chapter 4.3 Converting
GSE to
an exprSet, you supply
code to make sure that all of the GSMs are from the same platform (see
my
small function below).
Sorrowly, this is not sufficient in this case (and probably other
Affymetrix
chips where two versions exist).
Even though the Sample_data_row_count is different (12625 vs 12626)
cbind
simply recylces the rows.
In this case, I could test if Sample_data_row_count is identical for
all
chips, but theoretically there may
be the case that different chip versions may still have the same
number of
probe sets.
One possibility would be that GEO forces the submitters not only to
supply
Sample_platform_id, but
also a "Sample_platform_title" which would contain the name of the
chip as
given by the manufacturer.
3, Sample descriptions:
Since most data are useless w/o the sample description, which contains
the
clinical data, it would
be helpful if GEO would supply a certain format for adding the
clinical
data, so that it would be
possible to write a parser to extract these data automatically into a
table.
Best regards
Christian
Attached function:
#---------------------------------------------------------------------
#
table4GEO <- function(gse, column="VALUE", lg2=T){
# (c) Christian Stratowa created: 01/19/2006 last modified:
01/19/2006
# Get sample table of columns "column" for GEO Series GSExxxx
# gse: GEOqueryclass imported from GEO GSE file GSExxxx_family.soft
(or
soft.gz)
# column: name of column to be extracted from data table
# load libraries
library(Biobase);
library(GEOquery);
# get list
gsm <- GSMList(gse);
# check number of platforms (must be one platform only)
tmp <- unlist(lapply(gsm, function(x) {Meta(x)$platform}));
if (length(unique(tmp)) != 1) {
stop("Data must belong to one platform ID only!");
}#if
# number of samples
size <- length(tmp);
print(paste("Number of samples:",size))
# check if all samples have the chosen column
tmp <- unlist(lapply(gsm, function(x) {which(Columns(x)[,1] ==
column)}));
if (length(tmp) != size) {
stop(paste("Only <", length(tmp), "> of <", size, "> samples
have
column ", column));
}#if
# get "column" from all chips
data <- do.call("cbind", lapply(gsm,
function(x){Table(x)[,column]}));
dimnames(data)[[1]] <- Table(gsm[[1]])$ID_REF
if (lg2==TRUE) {
data <- log2(data);
}#if
return(data);
}#table4GEO
==============================================
Christian Stratowa, PhD
Boehringer Ingelheim Austria
Dept NCE Lead Discovery - Bioinformatics
Dr. Boehringergasse 5-11
A-1121 Vienna, Austria
Tel.: ++43-1-80105-2470
Fax: ++43-1-80105-2782
email: christian.stratowa at vie.boehringer-ingelheim.com