csaw: when not to dedup
1
0
Entering edit mode
@asifzubair-6770
Last seen 7.6 years ago

In the csaw workflow and vignette, it is mentioned that duplicate removal is not always advisable. I couldn't quite understand why that would be. Shouldn't duplicate removal help prevent false positives ? 

csaw dedup • 1.1k views
ADD COMMENT
3
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 49 minutes ago
The city by the bay

Not in the case of DB. Consider a situation where you have a large peak, bound with the same intensity in two libraries. Let's say that one library is sequenced at a greater (2x) depth, so the peak is naturally twice as large in that library compared to the other. However, when you remove duplicates, you're effectively applying a hard cap to the coverage of each peak, as each base position can only have one forward/reverse read, e.g., for a peak that spans 200 bp, the cap would be of 200 reads on each strand. When you normalize by library size, your capped coverage becomes 200/1 in the smaller library and 200/2 in the larger library. This introduces a false positive with a two-fold difference in coverage, compared to if you didn't do anything at all.

The other objection to duplicate removal is that you cap your power. Peaks that are strongly DB can't get any more evidence when the coverage is capped in one of the conditions. Indeed, peaks that are DB and strongly bound in both conditions will have the same cap applied to both conditions, such that it won't show up as DB at all. This is not ideal, especially if you end up missing large changes in the binding profile.

Now, going back to the issue of false positives; the reason for duplicate removal is to avoid drawing inferences at read stacks caused by PCR duplicates. However, for DB analyses, protection against stacks is automatically provided by variance modelling in replicated experiments. You're unlikely to get random PCR duplicates affecting the same location on the genome in all of your replicates, such that any windows with duplication in only one replicate will have an inflated variance. This reduces the significance of DB and ensures that those windows don't dominate the final results. In practice, some stacks still show up around problematic regions (e.g., microsatellites); these can generally be discarded via blacklisting, or just ignored during interpretation of the results.

In general, I would reserve PCR duplicate removal for low-quality data, or in experimental designs where I don't have replicates (and hence the protection above doesn't work). I wouldn't do it as part of a routine analysis.

ADD COMMENT
0
Entering edit mode

Thank you for your detailed response! Would this reason to not dedup mainly apply to ChIPseq where you would expect reads to be smaller in size as opposed to ATAC-seq? I can understand how theoretically there is still a capping that is being done, but even in a ~200bp peak there could still potentially be many more than 200 reads mapping to it if the required overlap is small enough. Maybe I'm misunderstanding how the reads are counted for each peak or something else more obvious.

ADD REPLY

Login before adding your answer.

Traffic: 420 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6