Does DESeq2 need to correct for overlapping regions?
1
0
Entering edit mode
schwarzl • 0
@schwarzl-12811
Last seen 4.4 years ago

Dear Bioconductor Support Forum,

we are using DESeq2 to call differentially expressed window for overlapping sliding window on transcriptomic regions (50nt, 20nt steps for testing, this can vary). Because the different properties of our data (eCLIP), csaw is not optimal for multiple reason (different preprocessing is needed, winding up in 1nt positions rather than reads, high-background signal of the data which leads to long stretches for SIME correction (although you can restrict that, this is suboptimal), RNA-binding proteins have different binding modes to transcription factors so the SIMES multiple hypothesis correction may not be appropriate, we do not want to test vs extended backgrounds (like windows + flanks) because transcriptomic backgrounds are so variable, etc). Therefore we see the need for a specialised package for iCLIP/eCLIP data instead of using the great csaw package.

Eventually our pre-processing will result in a count matrix having counts for each window for each replicate for samples (IPs) and input controls (SMI). We also have location information for all windows.

Recently, we got feedback and ended up in two approaches and would appreciate your feedback what you think is the right thing to do, or if you have any ideas or comments:

Approach 1) In short, we estimate size factors, perform DESeq2 dispersion estimation and wald test. We modified DESeq2 results function to filter windows with negative wald stat values (ie ve log2foldchanges) and do a right-tailed test on the remaining windows. For the overlapping windows, we are currently using Bonferroni to correct for overlaps (family wise error). In this case, we consider all windows around that overlaps with the current window, irrespective of whether the window is filtered out in the previous step or not. Then we perform multiple hypothesis correction on all corrected p-values. Significant windows will be reported, as well as we provide a function to merge neighbouring windows to enriched regions.

With this approach we can identify smaller binding sites, as well as long big stretches of binding. We do not see any unspecific length biases from the target (longer RNAs are not preferred). We could validate various hits of the results in the lab of three independent proteins and are very happy and confident about results so far. Of course this is a rather a conservative approach.

We got the feedback that DESeq2 and IHW can handle overlaps, as long as the covariate is independent, like baseMean and should try

Approach 2)
DESeq2 with right-tailed testing, but without any correction for overlaps. IHW for multiple hypothesis correction and again, the possibility to merge significant windows.

Problem: with this approach we do see slight length biases (longer RNAs are more targets) - however less than current peak callers. Naturally, the same top targets which we validated are found, but more windows are significant. It seems that this approach accumulates false positives.

We would like feedback if you think approach 1, 2 or none is correct and especially would appreciate ideas or comments.

Eventually we will work an approach with wavelets but for now approach 1 or 2 are showing promising results compared to current methods. Current gold standards for eCLIP analysis do not use replicates or just start using them and most of them do not use the control properly.

Thank you very much and we are happy to hear your comments.

deseq2 ihw • 814 views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 14 hours ago
United States

I haven't done much work on DESeq2 for overlapping features, and defer to csaw for users who want to take this approach. I'd tend to go with the approaches recommended by someone who looked closely at the performance of the framework and performed extensive benchmarking on simulation and experimental data.

I don't see anything wrong per se with using DESeq2 to generate a test statistic on overlapping windows, e.g. people use t-test for this as well, but then you do have to figure out a way to perform region finding + multiple test correction, and usually the most interpretable type of correction is one producing a per-region FDR. Have you seen this paper:

https://www.ncbi.nlm.nih.gov/pubmed/29481604

ADD COMMENT
0
Entering edit mode

Hi Mike, thanks so much for your answer. Thanks, I already had a look at that and had a chat with Simon recently. There are certain reasons why a window approach would be advantageous, because of different cross-linking and binding dynamics. We will try do simulate data and do permutation test to test for behaviour of our approaches. Thanks so much for your input.

ADD REPLY

Login before adding your answer.

Traffic: 748 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6