Question

Differential expression analysis: Permutation Test

1

Entering edit mode

Radek ▴ 90

@radek-8889

Last seen 5.3 years ago

Belgium

Hello!

I have a RNA-seq experiment that I analyzed using DESeq2, limma, edgeR with the following design:

11 treatments realized on 2 different cell lines
1 controls from each cell line.

The classical design matrix is the following:

Samples	Treat1	Treat2	...	Treat11	Control	CellLine
1	1	0		0	0	1
2	1	0		0	0	0
...
23	0	0		0	1	1
24	0	0		0	1	0

So to summarize I have two biological replicates for each conditions (11) with the cell line information as blocking factor.

Due to small issues in the wet-lab experiment, I would like to try to exclude the possibility that the results I obtain are due to things unrelated to the treatment that has been applied to the cells. A colleague suggested to use permutation test on the design matrix to test if my targets where just background noise or true differentially expressed genes.

From my readings of some articles/posts I said that is was most probably not possible due to statistical issues with the assumptions made while doing the permutation test. Unfortunately I don't have a strong background in statistic that allows me to easily explain the reason.

It is interesting to note here that my treatments have little effect on the transcriptome of my cells since only few genes are differentially expressed:

	Treat1	Treat2	Treat3	Treat4	Treat5	Treat6	Treat7	Treat8	Treat9	Treat10
DESeq2	196	7	40	0	33	0	9	0	1	18
Limma	83	0	5	3	9	1	8	1	9	9
edgeR	229	7	32	16	17	2	12	3	10	10

My questions are the followings:

1) Can you do a permutation tests at one step of the differential expression analysis (edgeR/limma/DESeq2) that could rule out effects unrelated to the treatment?

2) If answer1=no, could you explain me why we can't in the simpler way possible? Which rules are we breaking when trying to permutate the columns in the design matrix?

Thanks in advance,

Radek

PS: If you have relevant papers I should read to better understand the answer to my question I would be happy to read them.

deseq2 limma edger permutation • 5.1k views

ADD COMMENT • link updated 5 months ago by Gordon Smyth 50k • written 8.2 years ago by Radek ▴ 90

score 4 · Answer 1 · 2016-02-22

Permutations are most easily applied when you have two groups and nothing else, such that you can shuffle observations between groups under the null hypothesis that they're all from the same distribution (for simplicity, I'll just assume that all library sizes are equal, so we're dealing with the same NB distribution for all counts of each gene). In your case, it's complicated because you're blocking on the cell line. This would suggest that you can really only shuffle samples within the cell line (i.e., permute odd samples separately from even samples), otherwise you'd end up testing the null hypothesis that the two cell lines are the same. This is unlikely to be interesting or relevant, unless I've misunderstood the purpose of your experiment.

More problematic is the fact that you have multiple groups - if you shuffle samples across all groups, then your null hypothesis would be that counts from all groups come from the same distribution. This would not be useful if you're trying to identify DE genes between two particular treatments or to the control. In fact, if you wanted to test for DE between groups, you would be restricted to permuting only between those groups, e.g., if you wanted to test for differences between treatment 1 and control, you could swap samples 1 and 23 or samples 2 and 24 (keeping in mind the blocking on the cell line). That gives you a grand total of 4 permutations, 2 of which are redundant, and a minimum p-value of 0.5.

More generally, the controls should protect you from spurious DE unrelated to the treatment. Any experimental factors that could affect gene expression in the treatment samples should also apply to the controls, and cancel out in the DE analysis. If that's not the case, you should make some better controls. I don't see how permutation testing would offer any more protection.

score 3 · Answer 2 · 2016-02-22

See the section called "Parametric modelling versus permutation methods" in this article:

http://nar.oxfordjournals.org/content/43/7/e47

In brief:

1. Permuting tests the wrong null hypothesis.
2. Permuting is technically incorrect for RNA-seq because the samples are not exchangeable (they have different library sizes)
3. Permuting is computationally slow.
4. Permuting cannot possibly give any significant DE genes in a small genomic experiment because permutation is incapable of giving p-values small enough to be significant after multiple testing adjustments.