Question

csaw: when not to dedup

0

Entering edit mode

asif.zubair • 0

@asifzubair-6770

Last seen 7.6 years ago

In the csaw workflow and vignette, it is mentioned that duplicate removal is not always advisable. I couldn't quite understand why that would be. Shouldn't duplicate removal help prevent false positives ?

csaw dedup • 1.1k views

ADD COMMENT • link updated 2.8 years ago by uli.arb • 0 • written 8.1 years ago by asif.zubair • 0

score 3 · Answer 1 · 2016-05-04

Not in the case of DB. Consider a situation where you have a large peak, bound with the same intensity in two libraries. Let's say that one library is sequenced at a greater (2x) depth, so the peak is naturally twice as large in that library compared to the other. However, when you remove duplicates, you're effectively applying a hard cap to the coverage of each peak, as each base position can only have one forward/reverse read, e.g., for a peak that spans 200 bp, the cap would be of 200 reads on each strand. When you normalize by library size, your capped coverage becomes 200/1 in the smaller library and 200/2 in the larger library. This introduces a false positive with a two-fold difference in coverage, compared to if you didn't do anything at all.

The other objection to duplicate removal is that you cap your power. Peaks that are strongly DB can't get any more evidence when the coverage is capped in one of the conditions. Indeed, peaks that are DB and strongly bound in both conditions will have the same cap applied to both conditions, such that it won't show up as DB at all. This is not ideal, especially if you end up missing large changes in the binding profile.

Now, going back to the issue of false positives; the reason for duplicate removal is to avoid drawing inferences at read stacks caused by PCR duplicates. However, for DB analyses, protection against stacks is automatically provided by variance modelling in replicated experiments. You're unlikely to get random PCR duplicates affecting the same location on the genome in all of your replicates, such that any windows with duplication in only one replicate will have an inflated variance. This reduces the significance of DB and ensures that those windows don't dominate the final results. In practice, some stacks still show up around problematic regions (e.g., microsatellites); these can generally be discarded via blacklisting, or just ignored during interpretation of the results.

In general, I would reserve PCR duplicate removal for low-quality data, or in experimental designs where I don't have replicates (and hence the protection above doesn't work). I wouldn't do it as part of a routine analysis.