Question

Opinions on DESeqDataSetFromMatrix rowSums minimum for ChIP and/or ATAC-seq

0

Entering edit mode

rbronste ▴ 60

@rbronste-12189

Last seen 4.4 years ago

So Im hoping to get some opinions on what people generally set as the rowSums minimum as follows:

dds <- dds[ rowSums(counts(dds)) > 10, ]

I obtain the count matrices from ChIP-seq and ATAC-seq experiments through DiffBind and then use DESeq2 for the differential binding analysis. I guess its a bit of an arbitrary cutoff, but wondering if there is any additional experimental information people employ to set this boundary? Thanks!

deseq2 atac-seq chipseq differential binding analysis • 1.2k views

ADD COMMENT • link updated 6.4 years ago by Aaron Lun ★ 28k • written 6.4 years ago by rbronste ▴ 60

score 2 · Accepted Answer · 2017-12-14

2

Entering edit mode

Michael Love 41k

@mikelove

Last seen 14 hours ago

United States

I don't analyze much ChIP-seq so I'll let others weigh in on that. I don't think you will be losing many informative features with this filter, e.g. if you have 3 vs 3 samples that would be 10 reads distributed over 6 samples, so the best looking feature might be [3 3 4] vs [0 0 0]. It's hard to imagine even with incredibly minimal technical variability (e.g. UMI dedup) that this feature would survive multiple test correction. So it seems a "safe" filter in terms of false negatives.

ADD COMMENT • link 6.4 years ago Michael Love 41k

0

Entering edit mode

Thanks for the info. Another question I had was how to do it on a cell by cell basis, limiting it to > 10 in all cells of each triplet in the matrix.

ADD REPLY • link 6.4 years ago rbronste ▴ 60

score 2 · Accepted Answer · 2017-12-14

I'll chip in on the ChIP-seq part of it (ho, ho, ho).

I use the background level of enrichment to guide the filter threshold for the average count. There's a number of ways to do this, but the easiest is to cut your genome into large bins, count the number of reads in each bin, calculate the median of the average counts across all bins and then scale that average count based on the ratio of the bin width to your median peak width. This gives you a reasonable indication of what average count you might expect in the absence of binding, assuming most genomic locations are not bound.

There are some more sophisticated strategies using local or control-based background estimates, but I find that a global estimate works pretty well. You can increase the stringency of the filter by asking for all regions that are some X-fold change above the expected background coverage, where a "good" value of X might be 3-5, depending on the strength of enrichment for your protein target (e.g., higher for TFs, lower for histone marks).

Honestly, though, any half-decent peak caller should not be calling peaks in regions that have very low coverage, so I would be surprised if there is any filtering that needs to be done at this stage. In other words, the peak-caller should have implicitly filtered for you (though there are difficult challenges with making this implicit filtering compatible with differential analysis tools like DESeq(2) and edgeR).