Hello!
I've got a small question regarding limma and couldn't find a direct answer anywhere: Does it make sense to perform limma on only a subset of genes?
An example: I am only interested in if there are differences in gene expression between the groups of interest in 700 previously defined genes that had been identified (by limma) in another independent experiment (using a different experimental technology). So one could say it would make sense to test only the 700 genes in order to reduce the multiple testing burden. But I was wondering if this would make
sense, as limma (eBayes) borrows information across all genes. There are about 50 biological replicates per group available.
Would you recommend to 'feed' limma only with the 700 genes of interest, or to perform the analysis on all genes and do the filtering afterwards?
Best,
Simone
No cheating is involved. Under the null hypothesis, the moderated tests in the EB framework should produce uniformly distributed p-values. This is true regardless of the number of genes that are used (given that there are enough genes for stable estimation of some parameters). So long as your manner of selecting the subset is independent of the contrast of interest, the ensuing application of the BH method to the p-values in the subset is valid.
Thank you for the explanation, this is very helpful. Actually the subset of interest in my case is not really independent of the contrast of interest, because the contrast itself is exactly the same as the one used for defining the subset of genes, just on a different dataset of biologically different (but not independent either, I would say) data.
More clearly: I have one model testing for differential expression which identified a list of differentially expressed genes between different groups. For the same groups (and even the same individuals) I have DNA methylation data (transformed to be suitable for limma, following recommendations of Gordon Smyth and Alicia Oshlack in this context) and want to apply the same contrasts, but at this point I am only interested in the methylation profiles of the subset of my genes of interest. This is where the question of if and at which point of the analysis removing the rest of the genes makes sense (and is valid) appeared.
If I understand correctly, you've used the differential expression data to define a subset of DE genes. Now you're testing for differential methylation throughout the genes in that subset. I think that's fine, as the selection of your subset is blind to differential methylation p-values. Specifically, the p-values will be still be uniformly distributed for any true nulls in your subset, so application of the BH method will be valid.
More succintly, the independence I was talking about refers to the stochastic sampling of p-values under the null hypothesis. Otherwise, you might end up keeping only the small p-values, leading to an excess of false rejections and loss of type I error control. Independence between the underlying biological truths of each dataset (i.e., whether or not there is truly differential expression/methylation for a gene) is not required.
Yes, you understood correctly. I was already suspecting that the independence you wrote about was more of a technical nature regarding the way the p-values are obtained, nevertheless I wanted to be sure, so thank you very much for the further clarification and the helpful discussion.