Proper way to build phenoData object?
1
0
Entering edit mode
chandlerjd58 ▴ 10
@chandlerjd58-7257
Last seen 9.2 years ago
United States

Rather than starting at RMA, I'd like to load my Affy data into an expression set in R and use limma to do the full, proper fold change analysis. (My understanding is that RMA is quantile normalized, so fold change analysis won't be accurate.) My goal is to go from CEL files to the limma top table. Bioconductor has posted a very helpful page describing this (http://www.bioconductor.org/help/workflows/arrays/)

However, I don't have a clue what the right way is to build the phenoData object. I have seen very few examples of what the object even looks like. Here is the most detailed I have seen, located here: http://127.0.0.1:20212/library/Biobase/doc/ExpressionSetIntroduction.pdf. 

> head(pData(phenoData))
  gender    type score
A Female Control  0.75
B   Male    Case  0.40
C   Male Control  0.73
D   Male    Case  0.42
E Female    Case  0.93
F   Male Control  0.22

But the way you are supposed to make this is using other components I don't understand. Code below is from the aforementioned Bioconductor page.

## import "phenotype" data, describing the experimental design
phenoData <- 
    read.AnnotatedDataFrame(system.file("extdata", "pdata.txt",
    package="arrays"))

Would it be okay if I just read a tab-delimited table into R? Like this hypothetical text object:

    treatment
A   sunshine
B   deadly poison   
C   sunshine
D   deadly poison   
E   sunshine
F   deadly poison

...Because I understand how to do that, but not the phenoData creation code outlined above. Please help. Thank you!

 

 

limma affy biobase • 5.2k views
ADD COMMENT
1
Entering edit mode
Guido Hooiveld ★ 3.9k
@guido-hooiveld-2020
Last seen 40 minutes ago
Wageningen University, Wageningen, the …

This is how I usually do this:

 

pheno <- read.delim(file="MyTabDelim.txt", row.names=1, colClasses="factor")

phenoData(MyArrayData) <- AnnotatedDataFrame(pheno)

 

Where:

-  'MyTabDelim.txt' is a file with sample names in rows, and meta-data in columns

- 'MyArrayData' is an AffyBatch (unnormalized data) or ExpressionSet (normalized data) object.

Note that in my case all meta-data is categorical (even the numbers), hence the use of colClasses="factor"

ADD COMMENT
0
Entering edit mode

Thanks Guido! So I guess head(pheno) could look like:

Name Class1 Class2 Class 3

NameA 1 0 0

NameB 0 1 0

NameC 1 0 1

NameD 0 1 1

From there, you use Biobase or affy to make pheno into an ADF which is read into phenoData for MyArrayData. 

Do you have a preference to use AffyBatch or ExpressionSet? If my goal is to get a limma topTable, I would think I would want the eSet because that is how I often see it posted. On the other hand I am worried that doing log2 fold change analysis post RMA is spurious due to the quantile normalization that has taken place.

 

...I am guessing those numbers acting as row names would refer to transcript cluster IDs?

ADD REPLY
1
Entering edit mode

Note that AffyBatch and ExpressionSet are not interchangable. Generally you use ReadAffy to get an AffyBatch object from your CEL files, then you use RMA (or whatever normalization algorithm you choose), which returns an ExpressionSet, which you can then analyze with limma. AffyBatch contains the individual probe intensities, while Expressionset contains the probeset summarized values.

ADD REPLY
0
Entering edit mode

I am interested in your reasoning about quantile normalization making fold change analysis spurious. The RMA algorithm has been the de facto normalization/summarization algorithm for over 10 years now, and you are the first that I can recall who has made this claim. Do you have a reference, or is this your own personal belief?

ADD REPLY
0
Entering edit mode

Personal belief. I'm happy to be corrected if I'm wrong about this. :) Admittedly, the difference between log2 alone and log2+quantile is minimal when I have made direct comparisons, but there is a difference. If it's something the entire field has decided is acceptable then I will defer to you all. :) I'm a biologist trying to teach myself bioinformatics, so I'm just learning most of the conventions. Furthermore I am picking up gene expression arrays from within a metabolomics heavy group, so sometimes what they say will differ from someone in the gene expression field.

To explain my claim I just did a quick fold change analysis on some of my own data, using only one feature. The raw results (average[raw])/(average[rawB]) = 2.194. After log2, I can still get this result by raising 2 to the power of each value and the result is identical. But after quantile normalization (from the whole 13,000 feature data set, obviously) the fold change result has now changed to 2.202. Very similar but not the same. 

ADD REPLY
2
Entering edit mode

Right. But that is the whole purpose of any normalization.

Say you have 10 arrays that for whatever reason are brighter (have higher fluorescence values for all spots on the array) than another set of 10 arrays, due to some technical variability (they used different batch of reagents, the wash station was sort of plugged for that batch, different tech ran one set, whatever). Since a portion of the difference between these two sets of arrays is completely technical, rather than biological, you should want to remove as much of that technical variability, because it is potentially masking your true biological differences. This is what the normalization procedure (any normalization procedure, not just quantile normalization) is intended to do. If you don't normalize, you are introducing unwanted bias into your results.

In addition, when you compare groups, you are assuming that the intra-group variance represents the true biological variability in the measures you are comparing. If you have unwanted technical variability included in your estimate, then your variance estimate may be inflated, resulting in reduced power to detect true differences.

So you are correct that normalizing the data will affect the computed fold changes. But you are not correct when you think this is necessarily a bad thing. It's a feature, not a bug.

You could argue that a given normalization scheme is too 'strong' and removes some biological variability, but to do that well you need to back up your premise with a bunch of statistical blahblahblah, so we should leave that to the PhD statisticians amongst us. And they got bored of that argument in like 2006, at which time most people just said 'Eh, I'll just do RMA and call it a day'.

ADD REPLY

Login before adding your answer.

Traffic: 819 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6