Rather than starting at RMA, I'd like to load my Affy data into an expression set in R and use limma to do the full, proper fold change analysis. (My understanding is that RMA is quantile normalized, so fold change analysis won't be accurate.) My goal is to go from CEL files to the limma top table. Bioconductor has posted a very helpful page describing this (http://www.bioconductor.org/help/workflows/arrays/).
However, I don't have a clue what the right way is to build the phenoData object. I have seen very few examples of what the object even looks like. Here is the most detailed I have seen, located here: http://127.0.0.1:20212/library/Biobase/doc/ExpressionSetIntroduction.pdf.
> head(pData(phenoData))
gender type score A Female Control 0.75 B Male Case 0.40 C Male Control 0.73 D Male Case 0.42 E Female Case 0.93 F Male Control 0.22
But the way you are supposed to make this is using other components I don't understand. Code below is from the aforementioned Bioconductor page.
## import "phenotype" data, describing the experimental design
phenoData <-
read.AnnotatedDataFrame(system.file("extdata", "pdata.txt",
package="arrays"))
Would it be okay if I just read a tab-delimited table into R? Like this hypothetical text object:
treatment A sunshine B deadly poison C sunshine D deadly poison E sunshine F deadly poison
...Because I understand how to do that, but not the phenoData creation code outlined above. Please help. Thank you!
Thanks Guido! So I guess head(pheno) could look like:
Name Class1 Class2 Class 3
NameA 1 0 0
NameB 0 1 0
NameC 1 0 1
NameD 0 1 1
From there, you use Biobase or affy to make pheno into an ADF which is read into phenoData for MyArrayData.
Do you have a preference to use AffyBatch or ExpressionSet? If my goal is to get a limma topTable, I would think I would want the eSet because that is how I often see it posted. On the other hand I am worried that doing log2 fold change analysis post RMA is spurious due to the quantile normalization that has taken place.
...I am guessing those numbers acting as row names would refer to transcript cluster IDs?
Note that AffyBatch and ExpressionSet are not interchangable. Generally you use ReadAffy to get an AffyBatch object from your CEL files, then you use RMA (or whatever normalization algorithm you choose), which returns an ExpressionSet, which you can then analyze with limma. AffyBatch contains the individual probe intensities, while Expressionset contains the probeset summarized values.
I am interested in your reasoning about quantile normalization making fold change analysis spurious. The RMA algorithm has been the de facto normalization/summarization algorithm for over 10 years now, and you are the first that I can recall who has made this claim. Do you have a reference, or is this your own personal belief?
Personal belief. I'm happy to be corrected if I'm wrong about this. :) Admittedly, the difference between log2 alone and log2+quantile is minimal when I have made direct comparisons, but there is a difference. If it's something the entire field has decided is acceptable then I will defer to you all. :) I'm a biologist trying to teach myself bioinformatics, so I'm just learning most of the conventions. Furthermore I am picking up gene expression arrays from within a metabolomics heavy group, so sometimes what they say will differ from someone in the gene expression field.
To explain my claim I just did a quick fold change analysis on some of my own data, using only one feature. The raw results (average[raw])/(average[rawB]) = 2.194. After log2, I can still get this result by raising 2 to the power of each value and the result is identical. But after quantile normalization (from the whole 13,000 feature data set, obviously) the fold change result has now changed to 2.202. Very similar but not the same.
Right. But that is the whole purpose of any normalization.
Say you have 10 arrays that for whatever reason are brighter (have higher fluorescence values for all spots on the array) than another set of 10 arrays, due to some technical variability (they used different batch of reagents, the wash station was sort of plugged for that batch, different tech ran one set, whatever). Since a portion of the difference between these two sets of arrays is completely technical, rather than biological, you should want to remove as much of that technical variability, because it is potentially masking your true biological differences. This is what the normalization procedure (any normalization procedure, not just quantile normalization) is intended to do. If you don't normalize, you are introducing unwanted bias into your results.
In addition, when you compare groups, you are assuming that the intra-group variance represents the true biological variability in the measures you are comparing. If you have unwanted technical variability included in your estimate, then your variance estimate may be inflated, resulting in reduced power to detect true differences.
So you are correct that normalizing the data will affect the computed fold changes. But you are not correct when you think this is necessarily a bad thing. It's a feature, not a bug.
You could argue that a given normalization scheme is too 'strong' and removes some biological variability, but to do that well you need to back up your premise with a bunch of statistical blahblahblah, so we should leave that to the PhD statisticians amongst us. And they got bored of that argument in like 2006, at which time most people just said 'Eh, I'll just do RMA and call it a day'.