Question

permutation test in edgeR

0

Entering edit mode

Netanel • 0

@91c8441e

Last seen 3 months ago

Israel

Hello,

I have a simple RNA-seq experiment with treatment and control, each with 3 biological repeats. I run my data through edgeR and obtained differentially expressed genes (DEGs). Due to the low sample number and small effect size, there are likely more genes affected by the treatment that didn't meet the cutoff. For that reason, I want to try extracting more information from my data using a permutation test. I resampled my data by shuffling the columns and generated 1000 permuted dataframes. Next, I run each dataframe through my edgeR pipeline, which produced results such as RowName, logFC, logCPM, LR, and PValue. My qustion is, Which value (for example, logFC or PValue) do I take from each permuted dataframe to generate the distribution for each gene and calculate the p-value? Also, what is the p-value calculation?

Thank you!

rna-seq edgeR • 484 views

ADD COMMENT • link updated 3 months ago by Gordon Smyth 51k • written 3 months ago by Netanel • 0

0

Entering edit mode

Gordon Smyth 51k

@gordon-smyth

Last seen 3 hours ago

WEHI, Melbourne, Australia

The approach you are taking doesn't make much sense. You have started with a method (edgeR) that is very powerful and able to extract as much as is possible from small sample sizes and small effect sizes, but you want to replace it with a very unpowerful method (permutation) that will generally return much less signficant DE. So I'm not following the logic of that. Far from alleviating the problems with small samples sizes and small effects, permutation will make these problems worse.

You have cross-posted this question to at least four help forums that I have seen. Scattering your question around like this means that you've got a variety of advice, some good but also some that is not correct. You have been told on some forums that you have to use logFC as the permutation statistic, but actually the likelihood ratio statistic would be preferable for this purpose because it has a null distribution that is more nearly constant between genes compared to the logFC. You have been given a p-value computation on Cross-Validated that isn't correct and which will give false positives. The p-value computation given above by James MacDonald is the correct one. If you undertake permutation tests for each gene individually, it is actually mathematically impossible for you to obtain any significant results after adjustment for multiple testing. So it's hard to see how that helps you. The only way to get significant results from a permutation approach is to use the LRT to estimate a global null distribution across all genes, but that is a much more sophisticated approach than you seem to be contemplating, and it still won't beat the standard parametric edgeR method.

On top of all that, the basic assumptions of permutation tests are not even satisfied by RNA-seq data. Permutation tests assume that the samples are identically distributed under the null hypothesis, but that it never true for RNA-seq because different samples have different sequencing depths and hence different precisions. So even the most sophisticated permutation approaches fail to control the FDR correctly when there is in imbalance in the library sizes between the groups.

If you really want to explore DE in your data, you should be making exploratory plots using plotMDS and plotMD as shown in the edgeR case studies to check for things like outliers and batch effects. Alternatively, you could explore the DE results with a quantile-quantile probability plot using z-scores from the signed square-root likelihood ratio test statistics.

See almost the same discussion 9 years ago: FDR vs Permutation approach in edgeR

ADD COMMENT • link 3 months ago Gordon Smyth 51k

score 2 · Accepted Answer · 2024-04-08

You cannot do a permutation test with 1000 permutations if you only have six total samples. You will have many (many!) 'permutations' that are exactly the same thing, and those do not provide any useful information. Calling this sort of thing a permutation test is not correct, as what matters are the combinations, not the permutations. For example, a t-test from any of the following permutations is identical

1,2,3 vs 4,5,6
1,3,2 vs 4,5,6
3,1,2 vs 4,5,6
3,2,1 vs 4,5,6
...

In addition, you cannot do 1000 unique permutations with this number of samples. There are only 120 unique permutations, and 20 unique combinations. If you were to assume that those 120 permutations are all informative, your smallest possible p-value would be 0.008, and if you did it correctly, the smallest p-value would be 0.048, neither of which will survive any multiplicity correction. The p-value computation in this case is (B + 1)/(nperm + 1), where B is the number of permutations that produce a p-value<α

Your best bet is to use either edgeR or DESeq2 to do the analysis and call it good.