Question: Does DESeq2 need to correct for overlapping regions?
gravatar for schwarzl
11 weeks ago by
schwarzl0 wrote:

Dear Bioconductor Support Forum,

we are using DESeq2 to call differentially expressed window for overlapping sliding window on transcriptomic regions (50nt, 20nt steps for testing, this can vary). Because the different properties of our data (eCLIP), csaw is not optimal for multiple reason (different preprocessing is needed, winding up in 1nt positions rather than reads, high-background signal of the data which leads to long stretches for SIME correction (although you can restrict that, this is suboptimal), RNA-binding proteins have different binding modes to transcription factors so the SIMES multiple hypothesis correction may not be appropriate, we do not want to test vs extended backgrounds (like windows + flanks) because transcriptomic backgrounds are so variable, etc). Therefore we see the need for a specialised package for iCLIP/eCLIP data instead of using the great csaw package.

Eventually our pre-processing will result in a count matrix having counts for each window for each replicate for samples (IPs) and input controls (SMI). We also have location information for all windows.

Recently, we got feedback and ended up in two approaches and would appreciate your feedback what you think is the right thing to do, or if you have any ideas or comments:

Approach 1) In short, we estimate size factors, perform DESeq2 dispersion estimation and wald test. We modified DESeq2 results function to filter windows with negative wald stat values (ie ve log2foldchanges) and do a right-tailed test on the remaining windows. For the overlapping windows, we are currently using Bonferroni to correct for overlaps (family wise error). In this case, we consider all windows around that overlaps with the current window, irrespective of whether the window is filtered out in the previous step or not. Then we perform multiple hypothesis correction on all corrected p-values. Significant windows will be reported, as well as we provide a function to merge neighbouring windows to enriched regions.

With this approach we can identify smaller binding sites, as well as long big stretches of binding. We do not see any unspecific length biases from the target (longer RNAs are not preferred). We could validate various hits of the results in the lab of three independent proteins and are very happy and confident about results so far. Of course this is a rather a conservative approach.

We got the feedback that DESeq2 and IHW can handle overlaps, as long as the covariate is independent, like baseMean and should try

Approach 2)
DESeq2 with right-tailed testing, but without any correction for overlaps. IHW for multiple hypothesis correction and again, the possibility to merge significant windows.

Problem: with this approach we do see slight length biases (longer RNAs are more targets) - however less than current peak callers. Naturally, the same top targets which we validated are found, but more windows are significant. It seems that this approach accumulates false positives.

We would like feedback if you think approach 1, 2 or none is correct and especially would appreciate ideas or comments.

Eventually we will work an approach with wavelets but for now approach 1 or 2 are showing promising results compared to current methods. Current gold standards for eCLIP analysis do not use replicates or just start using them and most of them do not use the control properly.

Thank you very much and we are happy to hear your comments.

deseq2 ihw • 108 views
ADD COMMENTlink modified 10 weeks ago by Michael Love26k • written 11 weeks ago by schwarzl0
Answer: Does DESeq2 need to correct for overlapping regions?
gravatar for Michael Love
10 weeks ago by
Michael Love26k
United States
Michael Love26k wrote:

I haven't done much work on DESeq2 for overlapping features, and defer to csaw for users who want to take this approach. I'd tend to go with the approaches recommended by someone who looked closely at the performance of the framework and performed extensive benchmarking on simulation and experimental data.

I don't see anything wrong per se with using DESeq2 to generate a test statistic on overlapping windows, e.g. people use t-test for this as well, but then you do have to figure out a way to perform region finding + multiple test correction, and usually the most interpretable type of correction is one producing a per-region FDR. Have you seen this paper:

ADD COMMENTlink written 10 weeks ago by Michael Love26k

Hi Mike, thanks so much for your answer. Thanks, I already had a look at that and had a chat with Simon recently. There are certain reasons why a window approach would be advantageous, because of different cross-linking and binding dynamics. We will try do simulate data and do permutation test to test for behaviour of our approaches. Thanks so much for your input.

ADD REPLYlink written 9 weeks ago by schwarzl0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 314 users visited in the last hour