Hi everyone,
I have microarray data from Affymetrix Human Genome U133 plus2.0 array to analyze from KO experiment with 2 different constructs and 3 replicates per condition. When I am doing DEG analysis with limma with all my probes (20422) I obtained only one significant gene (the one impacted by the KO) and all the others have the same adj.P.value (near 1) whereas pvalues are different and some have good FC (sup2). In the meantime if I filter my probes based on their STD and only keep fifty percent of the most variable (so 10 211 instead of 20 422) it completely changes statistics and now I have some significant DEG (not a lot, but with adj.p.value very low). I read in some papers that it could be possible to filter minimum variance across all samples, so if a gene isnt changing (constant expression) its inherently not interesting therefore no need to test. However, I am really not sure if I can do it from a mathematical point of view. Could you please confirm (or not) if it is possible or not ? Since the rest of the analysis (eg pathways) is based on this pvalue is really important for me !
Thank you very much for your help!
Sorry I had some trouble posting my question but now is good! So the only correct way to do is to keep all 20 422 probes ? Sorry I am new in this areas and I found everything and its opposite on papers and forums!
Presumably you are using the limma package to do the comparisons. If not, you should be. If you are, do note that the
eBayes
step is estimating a variance hyperparameter, which for purposes of this discussion we can think of as the average within-group variance over all genes. This hyperparameter is used as a variance prior, and all the observed variance estimates are then 'shrunken' towards the prior value (where 'shrunken' doesn't mean 'made smaller' but instead means 'made more similar to').If you remove all the low variance genes, then the hyperparameter will be larger than it should be, and you will now adjust all the observed variance estimates towards the large variance prior. This is not good!
You would probably be better off using
gcrma
to account for background as a function of GC content, and then filter based on the number of subjects that are above some level that you want to consider to be background binding. As an example, say you have two groups with five subjects per. You could look at the distribution of the average expression level to say what you think is a level where the genes are either not expressed, or expressed at a level that is so low that noise dominates signal. You could then filter genes based on that cutoff as well as the number of subjects that have to exceed the cutoff. Let's say you decide the gene expression level is 4, and the N that have to exceed that value is 3. You could doAnd now you have filtered the data in a way that won't affect the
eBayes
step, and since you usedgcrma
you can argue that the gene levels have been adjusted for GC content so a single arbitrary gene expression cutoff makes sense.Thanks for your elaborated your answer.
IMHO, the relevant article is Independent filtering increases detection power for high-throughput experiments from Bourgon et al.
That article is meant to support the
genefilter
package, which includes functionality to do row-wise t-tests. In that situation it's perfectly OK to filter on variance. But if the OP is using 'limma` (which is what the OP should be doing), then filtering on SD is a horrible idea.Thank you for your valuable comments pointing out the inconsistency of merging stages from different pipelines as I did. This might be the case for the OP.
It is ! Thank you to mention this important inconsistency indeed!
Thank you so much for all your explanation! I did what you mention (with rma instead of gcrma), and filter the data with gene expression level at least at 7.5 in 3 samples (I guess it is a lot but according the results I've obtained with the histogram of the median intensities, I have no choice I think). After this filtering, I obtained 10 995 on which I perform limma analysis and it result at almost exactly same result than without filtering (= no DEG).
I'd look at a PCA plot of your data, and probably a histogram of your p-values. It's likely there are no differences between the groups.
I have 9 different conditions (2 different KO + ctrl, each time in 3 different mouse organ) so it is a lot, in some I have big differences and in other it is not so obvious