DiffBind: dba.report reports odd FDR values
1
1
Entering edit mode
biotom ▴ 10
@biotom-7371
Last seen 7.8 years ago
Germany

Dear all,

I am analysing ChIP-Seq data and used DiffBind to compare 3 wildtype vs 3 mutant samples. When calling dba.report(myDBAObject, th=1, bUsePval=TRUE) I receive odd FDR values, i.e. it seems they do not match the p-values since there are several peaks with quite low p-values having higher FDR-values than peaks with higher p-values.

When removing the bUsePval=TRUE option the FDR values seems to be OK again but the resulting list contains about four times less peaks than the initial one.

Thank you for your help.

 

Kind regards,

Tom

 

diffbind • 2.1k views
ADD COMMENT
1
Entering edit mode
Rory Stark ★ 5.2k
@rory-stark-5741
Last seen 15 days ago
Cambridge, UK

Hi Tom-

Two things are going on here.

The interesting one concerns how DESeq2 computes multiple-testing adjusted confidence scores (called FDR in DiffBind). An "independent filter" is applied (using the genefilter package) such that only a subset of intervals will be eligible for multiple testing correction. The remainder of intervals will not have an adjusted p-value calculated (reporting a value of NA). In DiffBind, the NA values are converted to FDR values of 1.0 for consistency with edgeR (which uses a more standard multiple testing correction on all the p-values). I refer you to the man page for the results() function in DESeq2, with the relevant passage as follows:

  • "By default, independent filtering is performed to select a set of genes for multiple test correction which maximizes the number of adjusted p-values less than a given critical value alpha (by default 0.1). See the reference in this man page for details on independent filtering. The filter used for maximizing the number of rejections is the mean of normalized counts for all samples in the dataset. Several arguments from the filtered_p function of the genefilter package (used within the results function) are provided here to control the independent filtering behavior. In DESeq2 version >= 1.10, the threshold that is chosen is the lowest quantile of the filter for which the number of rejections is close to the peak of a curve fit to the number of rejections over the filter quantiles. 'Close to' is defined as within 1 residual standard deviation. The adjusted p-values for the genes which do not pass the filter threshold are set to NA."

In practice, DESeq2 can exclude sites with low p-values from being assigned an adjusted p-value while reporting one for sites with higher p-values. If you don't want to use DESeq2's filtering, you can apply a standard multiple testing correction such as p.adjust() to the full set of p-values. 

The second issue is an inconsistency with how thresholds are treated in DiffBind. I have checked in a fix (v2.0.3) to address this, so that sites will always be considered significantly differentially bound if their confidence score is less than or equal to the threshold (rather than only less than in certain cases). After this fix, you should get the same number of results (all the sites in the binding matrix) from dba.report() when th=1, regardless of how bUsePval is set.

ADD COMMENT
0
Entering edit mode

In practice, DESeq2 can exclude sites with low p-values from being assigned an adjusted p-value while reporting one for sites with higher p-values

I think you accidentally swapped your "low" and "higher" in the above sentence?

I mean, intuitively we always want to include the sites with the lowest of p-values, right?

ADD REPLY
0
Entering edit mode

No, this is what I meant. Stepping through the call to DESeq2::results(), I definitely see sites with lower p-values being excluded (adjusted p-value is NA) while ones with higher p-values are included in the multiple testing correction calculation. As DiffBind sorts the results by adjusted p-value (FDR), this is what is leading to the behavior you are seeing, where deep down the list the p-values get lower and the FDR values are 1. 

ADD REPLY
0
Entering edit mode

Weird -- this just doesn't square with intuition, but perhaps I'm not grasping the entirety of the situation ... I feel like this statement doesn't make sense "on its face", however if these low p-value genes (regions here, I guess) are being set to NA because of their low baseMean value, then that makes sense ... and maybe that's what you're saying.

I'm going to back away.  I was just doing some rubbernecking while "driving by", which most often just adds more confusion than clarity. Sorry for that :-)

ADD REPLY
0
Entering edit mode

It does seems to be related to baseMean values. Here's an example:

> deseq2.results <- DESeq2::results(deseq2.dataset)

# Summary of baseMean values for sites that do not pass filtering
> summary(deseq2.results$baseMean[is.na(deseq2.results$padj)])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  6.461  34.610  45.120  44.950  55.760 371.200 

# Summary of baseMean values for sites that pass filtering
> summary(deseq2.results$baseMean[!is.na(deseq2.results$padj)])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     70      76      86     646     105 7187000

The sites filtered out have much lower baseMean scores, but there is some overlap.

ADD REPLY

Login before adding your answer.

Traffic: 684 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6