Search
Question: Perform limma on a subset of genes of interest
0
3.6 years ago by
Simone170
Simone170 wrote:

Hello!

I've got a small question regarding limma and couldn't find a direct answer anywhere: Does it make sense to perform limma on only a subset of genes?

An example: I am only interested in if there are differences in gene expression between the groups of interest in 700 previously defined genes that had been identified (by limma) in another independent experiment (using a different experimental technology). So one could say it would make sense to test only the 700 genes in order to reduce the multiple testing burden. But I was wondering if this would make
sense, as limma (eBayes) borrows information across all genes. There are about 50 biological replicates per group available.

Would you recommend to 'feed' limma only with the 700 genes of interest, or to perform the analysis on all genes and do the filtering afterwards?

Best,
Simone

modified 3.6 years ago • written 3.6 years ago by Simone170
3
3.6 years ago by
Denali
Steve Lianoglou12k wrote:

In short: yes, it does make sense.

For some more detail, fortunately the first "hit" on the right "Similar posts" sidebar has your answer :-) The post was entitled "how to test for genes of interest?"

There, you'll find that Gordon suggests that you only filter the genes during the topTable step, like so:

  topTable(fit[indicesofinterest,]))

since it is desirable to keep all of the genes in during the eBayes step.

0
3.6 years ago by
Simone170
Simone170 wrote:

Okay, I missed that question (and answer), sorry! So actually the answer is it does not really make sense to perform limma on a subset of genes, or better said it is not a good way to do so, and the filtering should be done after obtaining the results by limma. This is what I imagined.

Nevertheless, the correction for multiple hypotheses testing can be applied only for the genes of interest, as far as I understand from your answers, because the adjusted p-values change when a subsetting of the fit is performed for the topTable function like suggested.

This is what I would have done intuitively as well, but isn't it a bit of "cheating" if I first include all genes in the statistical model (for a more robust Bayesian estimation) and then perform the BH correction on the results of differential expression as if I only had included a subset of the genes?

I read through the other "similar posts" right now and saw that there have been several questions like this already, but in many of them no clear conclusion was reached. I think I'll stick to Gordon's suggestion and will see how much the results change (regarding the adjusted p-values).

Simone

2

No cheating is involved. Under the null hypothesis, the moderated tests in the EB framework should produce uniformly distributed p-values. This is true regardless of the number of genes that are used (given that there are enough genes for stable estimation of some parameters). So long as your manner of selecting the subset is independent of the contrast of interest, the ensuing application of the BH method to the p-values in the subset is valid.

Thank you for the explanation, this is very helpful. Actually the subset of interest in my case is not really independent of the contrast of interest, because the contrast itself is exactly the same as the one used for defining the subset of genes, just on a different dataset of biologically different (but not independent either, I would say) data.

More clearly: I have one model testing for differential expression which identified a list of differentially expressed genes between different groups. For the same groups (and even the same individuals) I have DNA methylation data (transformed to be suitable for limma, following recommendations of Gordon Smyth and Alicia Oshlack in this context) and want to apply the same contrasts, but at this point I am only interested in the methylation profiles of the subset of my genes of interest. This is where the question of if and at which point of the analysis removing the rest of the genes makes sense (and is valid) appeared.

2

If I understand correctly, you've used the differential expression data to define a subset of DE genes. Now you're testing for differential methylation throughout the genes in that subset. I think that's fine, as the selection of your subset is blind to differential methylation p-values. Specifically, the p-values will be still be uniformly distributed for any true nulls in your subset, so application of the BH method will be valid.

More succintly, the independence I was talking about refers to the stochastic sampling of p-values under the null hypothesis. Otherwise, you might end up keeping only the small p-values, leading to an excess of false rejections and loss of type I error control. Independence between the underlying biological truths of each dataset (i.e., whether or not there is truly differential expression/methylation for a gene) is not required.