Question: Opinions on DESeqDataSetFromMatrix rowSums minimum for ChIP and/or ATAC-seq
0
22 months ago by
rbronste60
rbronste60 wrote:

So Im hoping to get some opinions on what people generally set as the rowSums minimum as follows:

dds <- dds[ rowSums(counts(dds)) > 10, ]

I obtain the count matrices from ChIP-seq and ATAC-seq experiments through DiffBind and then use DESeq2 for the differential binding analysis. I guess its a bit of an arbitrary cutoff, but wondering if there is any additional experimental information people employ to set this boundary? Thanks!

modified 22 months ago by Aaron Lun25k • written 22 months ago by rbronste60
Answer: Opinions on DESeqDataSetFromMatrix rowSums minimum for ChIP and/or ATAC-seq
2
22 months ago by
Michael Love25k
United States
Michael Love25k wrote:

I don't analyze much ChIP-seq so I'll let others weigh in on that. I don't think you will be losing many informative features with this filter, e.g. if you have 3 vs 3 samples that would be 10 reads distributed over 6 samples, so the best looking feature might be [3 3 4] vs [0 0 0]. It's hard to imagine even with incredibly minimal technical variability (e.g. UMI dedup) that this feature would survive multiple test correction. So it seems a "safe" filter in terms of false negatives.

Thanks for the info. Another question I had was how to do it on a cell by cell basis, limiting it to > 10 in all cells of each triplet in the matrix.

Answer: Opinions on DESeqDataSetFromMatrix rowSums minimum for ChIP and/or ATAC-seq
2
22 months ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

I'll chip in on the ChIP-seq part of it (ho, ho, ho).

I use the background level of enrichment to guide the filter threshold for the average count. There's a number of ways to do this, but the easiest is to cut your genome into large bins, count the number of reads in each bin, calculate the median of the average counts across all bins and then scale that average count based on the ratio of the bin width to your median peak width. This gives you a reasonable indication of what average count you might expect in the absence of binding, assuming most genomic locations are not bound.

There are some more sophisticated strategies using local or control-based background estimates, but I find that a global estimate works pretty well. You can increase the stringency of the filter by asking for all regions that are some X-fold change above the expected background coverage, where a "good" value of X might be 3-5, depending on the strength of enrichment for your protein target (e.g., higher for TFs, lower for histone marks).

Honestly, though, any half-decent peak caller should not be calling peaks in regions that have very low coverage, so I would be surprised if there is any filtering that needs to be done at this stage. In other words, the peak-caller should have implicitly filtered for you (though there are difficult challenges with making this implicit filtering compatible with differential analysis tools like DESeq(2) and edgeR).