Dear Bioconductor Support Forum,
we are using DESeq2 to call differentially expressed window for overlapping sliding window on transcriptomic regions (50nt, 20nt steps for testing, this can vary). Because the different properties of our data (eCLIP), csaw is not optimal for multiple reason (different preprocessing is needed, winding up in 1nt positions rather than reads, high-background signal of the data which leads to long stretches for SIME correction (although you can restrict that, this is suboptimal), RNA-binding proteins have different binding modes to transcription factors so the SIMES multiple hypothesis correction may not be appropriate, we do not want to test vs extended backgrounds (like windows + flanks) because transcriptomic backgrounds are so variable, etc). Therefore we see the need for a specialised package for iCLIP/eCLIP data instead of using the great csaw package.
Eventually our pre-processing will result in a count matrix having counts for each window for each replicate for samples (IPs) and input controls (SMI). We also have location information for all windows.
Recently, we got feedback and ended up in two approaches and would appreciate your feedback what you think is the right thing to do, or if you have any ideas or comments:
Approach 1) In short, we estimate size factors, perform DESeq2 dispersion estimation and wald test. We modified DESeq2 results function to filter windows with negative wald stat values (ie ve log2foldchanges) and do a right-tailed test on the remaining windows. For the overlapping windows, we are currently using Bonferroni to correct for overlaps (family wise error). In this case, we consider all windows around that overlaps with the current window, irrespective of whether the window is filtered out in the previous step or not. Then we perform multiple hypothesis correction on all corrected p-values. Significant windows will be reported, as well as we provide a function to merge neighbouring windows to enriched regions.
With this approach we can identify smaller binding sites, as well as long big stretches of binding. We do not see any unspecific length biases from the target (longer RNAs are not preferred). We could validate various hits of the results in the lab of three independent proteins and are very happy and confident about results so far. Of course this is a rather a conservative approach.
We got the feedback that DESeq2 and IHW can handle overlaps, as long as the covariate is independent, like baseMean and should try
DESeq2 with right-tailed testing, but without any correction for overlaps. IHW for multiple hypothesis correction and again, the possibility to merge significant windows.
Problem: with this approach we do see slight length biases (longer RNAs are more targets) - however less than current peak callers. Naturally, the same top targets which we validated are found, but more windows are significant. It seems that this approach accumulates false positives.
We would like feedback if you think approach 1, 2 or none is correct and especially would appreciate ideas or comments.
Eventually we will work an approach with wavelets but for now approach 1 or 2 are showing promising results compared to current methods. Current gold standards for eCLIP analysis do not use replicates or just start using them and most of them do not use the control properly.
Thank you very much and we are happy to hear your comments.