Question

Perform limma on a subset of genes of interest

1

Entering edit mode

Simone ▴ 190

@simone-5854

Last seen 7.6 years ago

Hello!

I've got a small question regarding limma and couldn't find a direct answer anywhere: Does it make sense to perform limma on only a subset of genes?

An example: I am only interested in if there are differences in gene expression between the groups of interest in 700 previously defined genes that had been identified (by limma) in another independent experiment (using a different experimental technology). So one could say it would make sense to test only the 700 genes in order to reduce the multiple testing burden. But I was wondering if this would make
sense, as limma (eBayes) borrows information across all genes. There are about 50 biological replicates per group available.

Would you recommend to 'feed' limma only with the 700 genes of interest, or to perform the analysis on all genes and do the filtering afterwards?

Best,
Simone

limma • 4.5k views

ADD COMMENT • link 11.2 years ago Simone ▴ 190

score 3 · Answer 1 · 2014-11-24

In short: yes, it does make sense.

For some more detail, fortunately the first "hit" on the right "Similar posts" sidebar has your answer :-) The post was entitled "how to test for genes of interest?"

There, you'll find that Gordon suggests that you only filter the genes during the topTable step, like so:

  topTable(fit[indicesofinterest,]))

since it is desirable to keep all of the genes in during the eBayes step.

score 0 · Answer 2 · 2014-11-24

0

Entering edit mode

Simone ▴ 190

@simone-5854

Last seen 7.6 years ago

Okay, I missed that question (and answer), sorry! So actually the answer is it does not really make sense to perform limma on a subset of genes, or better said it is not a good way to do so, and the filtering should be done after obtaining the results by limma. This is what I imagined.

Nevertheless, the correction for multiple hypotheses testing can be applied only for the genes of interest, as far as I understand from your answers, because the adjusted p-values change when a subsetting of the fit is performed for the topTable function like suggested.

This is what I would have done intuitively as well, but isn't it a bit of "cheating" if I first include all genes in the statistical model (for a more robust Bayesian estimation) and then perform the BH correction on the results of differential expression as if I only had included a subset of the genes?

I read through the other "similar posts" right now and saw that there have been several questions like this already, but in many of them no clear conclusion was reached. I think I'll stick to Gordon's suggestion and will see how much the results change (regarding the adjusted p-values).

Thank you for your help.

Simone

ADD COMMENT • link 11.2 years ago Simone ▴ 190

2

Entering edit mode

No cheating is involved. Under the null hypothesis, the moderated tests in the EB framework should produce uniformly distributed p-values. This is true regardless of the number of genes that are used (given that there are enough genes for stable estimation of some parameters). So long as your manner of selecting the subset is independent of the contrast of interest, the ensuing application of the BH method to the p-values in the subset is valid.

ADD REPLY • link 11.2 years ago Aaron Lun ★ 29k

0

Entering edit mode

Thank you for the explanation, this is very helpful. Actually the subset of interest in my case is not really independent of the contrast of interest, because the contrast itself is exactly the same as the one used for defining the subset of genes, just on a different dataset of biologically different (but not independent either, I would say) data.

More clearly: I have one model testing for differential expression which identified a list of differentially expressed genes between different groups. For the same groups (and even the same individuals) I have DNA methylation data (transformed to be suitable for limma, following recommendations of Gordon Smyth and Alicia Oshlack in this context) and want to apply the same contrasts, but at this point I am only interested in the methylation profiles of the subset of my genes of interest. This is where the question of if and at which point of the analysis removing the rest of the genes makes sense (and is valid) appeared.

ADD REPLY • link 11.2 years ago Simone ▴ 190

2

Entering edit mode

If I understand correctly, you've used the differential expression data to define a subset of DE genes. Now you're testing for differential methylation throughout the genes in that subset. I think that's fine, as the selection of your subset is blind to differential methylation p-values. Specifically, the p-values will be still be uniformly distributed for any true nulls in your subset, so application of the BH method will be valid.

More succintly, the independence I was talking about refers to the stochastic sampling of p-values under the null hypothesis. Otherwise, you might end up keeping only the small p-values, leading to an excess of false rejections and loss of type I error control. Independence between the underlying biological truths of each dataset (i.e., whether or not there is truly differential expression/methylation for a gene) is not required.

ADD REPLY • link 11.2 years ago Aaron Lun ★ 29k

0

Entering edit mode

Yes, you understood correctly. I was already suspecting that the independence you wrote about was more of a technical nature regarding the way the p-values are obtained, nevertheless I wanted to be sure, so thank you very much for the further clarification and the helpful discussion.

ADD REPLY • link 11.2 years ago Simone ▴ 190