Normalizing Expressionset Exprs without Raw CEL files
1
0
Entering edit mode
PyPer ▴ 20
@pyper-6819
Last seen 9.5 years ago
Australia

I know typically you download CEL files then run them through something like readaffy and then use RMA normalization.

However, many datasets on GEO do not have the associated raw CEL files. Furthermore downloading multiple CEL files is both long and tedious.

For some of the things i'm doing, normalizing ExpressionSet files directly obtained using GEOquery package would be much faster and preferable.

I was wondering what kind of packages are out there to normalize data that takes expressionset directly as input? Ideally they would have the same normalization techniques such as RMA.
 

geoquery normalization • 3.2k views
ADD COMMENT
0
Entering edit mode

The GEOquery package has getGEOSuppFiles('GSEXXXX') that will do the downloading of raw data for you.

ADD REPLY
0
Entering edit mode

One cannot use full RMA-like normalization without access to probe-level data.  Probe-level data is not typically available in the data deposited in GEO, except when the .CEL files are available.

ADD REPLY
0
Entering edit mode
@stephen-piccolo-6761
Last seen 3.6 years ago
United States

The SCAN.UPC package contains a function called UPC_Generic_ExpressionSet, which can be used to normalize an expressionSet object. It's designed to be applied to data for many expression profiling platforms. Alternatively, if you want a solution specific to CEL files (and more analogous to RMA), you can use the SCAN function and specify a GEO ID rather than a path to a CEL file. The package will take care of downloading the CEL files and normalizing and can be executed in parallel.

ADD COMMENT
0
Entering edit mode

I cannot get the UPC_Generic_ExpressionSet to work properly. The code runs, but a boxplot() function of the exprs() shows data that is not normalized.

ADD REPLY
0
Entering edit mode

What do you mean when you say it is not normalized? Can you provide an example?

ADD REPLY
0
Entering edit mode

So say for example i'm working on a breast cancer set from GEO GSE5764.

I run the usual code to grab it;
gse <- getGEO("GSE5764", GSEMatrix=TRUE)

Then I attempt to normalize it.

t1 <- UFC_Generic_ExpressionSet(gse)

and then i visualize it simply by boxplot(exprs(t1))
and what you see is that the boxplots are all over the place; even after a log2 transform it still looks all over the place.

I might be missing something? But i'm practically looking for something like limma packages NormalizeBetweenArrays() but which will also correct for negative values after log transform.

ADD REPLY
0
Entering edit mode

You are right that there will be variability in the distributions across the samples because this is method does not perform a multi-sample correction (such as RMA). There are logistical advantages to single-sample methods like SCAN/UPC, and this method performs well in many scenarios, even when there is variability across the samples. I'm not sure why, but the UPC method is having a hard time distinguishing between expressed and non-expressed genes for these samples, whether I start with the raw data or the preprocessed values. Most genes are showing up as background noise, possibly due to data-quality issues. It might be worth trying the method on another data set and seeing how that looks.

ADD REPLY

Login before adding your answer.

Traffic: 685 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6