8 months ago by
CRUK, Cambridge, UK
If you want to remove duplicates, you need to mark duplicates before running
bUseSummarizeOverlaps is set to. If duplicated are not marked, even if you set
bRemoveDuplicates=TRUE, no duplicates will be identified.
However for differential analysis, we strongly recommend not removing duplicates. In a well-prepared ChIP-seq experiment, most of the duplicate reads will be "true"duplicates indicating high levels of enrichment. The degree to which this is true will depend on how the sequencing is done (single-end vs paired-end, read length, number of reads). If you remove duplicates, you are clipping the signal, so you might be unable to detect, for example, a difference between one sample group where 30% of the DNA is bound at a particular interval and one where 90% of the DNA is bound. It also helps to use blacklists and greylists as many problematic duplicates are located at the blacklisted intervals.
If your ChIP reads have a high proportion of duplicates (say, greater than 50%), there may be issues with the ChIP, leaving more artifactual duplicates, which you may be better off removing (after marking them in the BAM).