Search
Question: FDR and padj in deseq and edger
0
3.0 years ago by
tonja.r40
United Kingdom
tonja.r40 wrote:

Two tools give the adjusted p-values for multiple testing.
DESeq used Benjamin-Hochberg correction and edgeR uses FDR. According to wikipedia Benjamin-Hochberg is the same as FDR. So, the correction is the same, but the names are different, right?
Also, how do those tools calculate FDR? According to wiki it is false_positives/(false_positives + true positives). But how can one find if false positives are really false positives? There are just genes with their p-values and one does not know if the top genes are true positives or false positives. How one can know if the state of the null hypothesis one a gene is declared to be significant?

modified 3.0 years ago by Gordon Smyth35k • written 3.0 years ago by tonja.r40
4
3.0 years ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson6.9k wrote:

See the adjust.method argument for edgeR::topTags, and similarly, the pAdjustMethod argument for DESeq2::results. The default p-value adjustment method for both of them is Benjamini-Hochberg, which is labeled as "FDR" in edgeR and "padj" in DESeq2.

Note that there are many variations on methods of computing a false discovery rate, so "FDR" by itself does not designate a specific method, it is just a blanket term that encompasses all of those methods. The formula that you give from the Wikipedia article is how you would calculate the FDR of a method if you had perfect knowledge of the system, and as a result you knew which positives are true and false ones. (For instance, this is how you would benchmark a method on a synthetic data set that you generated yourself from known parameters.) Since you don't have that knowledge in a real system, many methods for approximating or putting a bound on this quantity have been developed, of which the Benjamini-Hochberg method is one (specifically, it computes an upper bound, not an approximation, as Gordon points out).

2

Just to clarify. The Benjamini-Hochberg algorithm doesn't estimate the FDR. Rather it computes an upper bound for the expected FDR. (That is why the limma and edgeR packages always say "controlling" the FDR rather than "estimating" it.) There is no approximation involved -- it is an exact mathematical result.

Thanks, that's an important clarification. I'll edit my answer.

4
3.0 years ago by
Gordon Smyth35k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth35k wrote:

edgeR and DESeq both use the same method, specifically they call the "BH" method of the p.adjust() function in the stats package.

The FDR (aka adjusted p-value) given for each rank position in the toptags table is an upper bound for the expected FDR up to this point in the table. One does not need to know the true state of the null hypotheses to compute the expected FDR. The actual (unknown) FDR in any specific situation can be greater or less than the FDR value given, but the theorem guarantees that on average it will be less.

The BH adjusted p-values also have an another looser (unpublished) empirical Bayes interpretation. They give a conservative estimate of the posterior probability that the null hypothesis is true, supposing that one chooses at random either the specified gene or one above it in the top table, see:

Background: Benjamini and Hochberg (1995) defined the concept of FDR and created an algorithm to control the expected FDR below a specified level given a list of independent p-values. When I was writing the limma package in 2002, I re-interpreted Benjamini and Hochberg's algorithm in terms of adjusted p-values. Specifically I considered each gene to define a list of p-values, being the p-values for all genes above and including that gene in the toptable. Then I defined the adjusted p-value to be the smallest pre-specified FDR for which the gene list would be considered significant according to the BH algorithm. I contributed the code to the R project as part of the p.adjust function, and that's what we use now.