GEOquery: preprocessing log of imported data
1
1
Entering edit mode
fabian ▴ 40
@fabian-6215
Last seen 10 months ago

Dear all

I am new to microarray analysis and have been assigned a rather challenging project. I'd like to re-analyze published microarray data from GEO. Essentially I would like to do a differential gene expression meta analysis, between diseased and healthy, across different experiments, chips and species (mouse, human).

I more or less just started and found the bioconductor package GEOquery very useful to import the data. Unfortunately I do not get how the obtained values are produced.

series.hsa.GSE28475 <- getGEO(GEO = "GSE28475",GSEMatrix=TRUE)


and access the expression values via

eset <- series.hsa.GSE28475[[2]]
exprsVals <- exprs(eset)


In parallel I also downloaded the files directly, either via the browser or via

getGEOSuppFiles("GSE28475", fetch_files = TRUE)


In this case the files contain quantile_normalized data (according to the file name). If I compare the values in the file with the values from the expression set from the GEOquery import, they differ.

Therefore I wonder how the data in the expression set from the GEOquery are preprocessed, in terms of background subtraction, transformation, normalization, etc.? and if this is either constant across all series or at least is found with in the expression set.

Later on I would like to do the missing preprocessing steps on my own, using lumi or affy (depending on the platform), doing batch effect correction (ComBat) and eventually use limma for dge analysis.

Thank you very much, for all comments and help.

GEOquery microarray lumi affy limma • 251 views
4
Entering edit mode
@sean-davis-490
Last seen 7 days ago
United States

GEOquery does NO PROCESSING of data coming from GEO. The data are taken as-is. To understand what processing the data have undergone, one must refer to the GEO website or to the manuscript. In this case, take a look at the "Data Processing" description here:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM703713

GEOquery, when used with getGEO(), returns the values supplied with the sample records. In this case, those are the raw values. The supplemental file includes the log2-transformed, quantile-normalized values.

1
Entering edit mode

Thank you Sean for the quick and helpful answer. Especially for the link to the raw data. I was puzzled how GEOquery can return the raw data if in the Series entry only the normalized data are provided. I always underestimate how many places in GEO there are to "hide" data. Thank you very much.

0
Entering edit mode

The Series record contains the "Value" column from the Sample records, which in this case we are told contain the raw data. The "files" attached to the Series record are UNRELATED to the actual data in the Series data values and are included as "extra" info. Confusing--you bet!!