Question: Opinions on DESeqDataSetFromMatrix rowSums minimum for ChIP and/or ATAC-seq
0
gravatar for rbronste
16 months ago by
rbronste60
rbronste60 wrote:

So Im hoping to get some opinions on what people generally set as the rowSums minimum as follows:

dds <- dds[ rowSums(counts(dds)) > 10, ]

I obtain the count matrices from ChIP-seq and ATAC-seq experiments through DiffBind and then use DESeq2 for the differential binding analysis. I guess its a bit of an arbitrary cutoff, but wondering if there is any additional experimental information people employ to set this boundary? Thanks! 

ADD COMMENTlink modified 16 months ago by Aaron Lun23k • written 16 months ago by rbronste60
Answer: Opinions on DESeqDataSetFromMatrix rowSums minimum for ChIP and/or ATAC-seq
2
gravatar for Michael Love
16 months ago by
Michael Love23k
United States
Michael Love23k wrote:

I don't analyze much ChIP-seq so I'll let others weigh in on that. I don't think you will be losing many informative features with this filter, e.g. if you have 3 vs 3 samples that would be 10 reads distributed over 6 samples, so the best looking feature might be [3 3 4] vs [0 0 0]. It's hard to imagine even with incredibly minimal technical variability (e.g. UMI dedup) that this feature would survive multiple test correction. So it seems a "safe" filter in terms of false negatives.

ADD COMMENTlink written 16 months ago by Michael Love23k

Thanks for the info. Another question I had was how to do it on a cell by cell basis, limiting it to > 10 in all cells of each triplet in the matrix.

ADD REPLYlink written 16 months ago by rbronste60
Answer: Opinions on DESeqDataSetFromMatrix rowSums minimum for ChIP and/or ATAC-seq
2
gravatar for Aaron Lun
16 months ago by
Aaron Lun23k
Cambridge, United Kingdom
Aaron Lun23k wrote:

I'll chip in on the ChIP-seq part of it (ho, ho, ho).

I use the background level of enrichment to guide the filter threshold for the average count. There's a number of ways to do this, but the easiest is to cut your genome into large bins, count the number of reads in each bin, calculate the median of the average counts across all bins and then scale that average count based on the ratio of the bin width to your median peak width. This gives you a reasonable indication of what average count you might expect in the absence of binding, assuming most genomic locations are not bound.

There are some more sophisticated strategies using local or control-based background estimates, but I find that a global estimate works pretty well. You can increase the stringency of the filter by asking for all regions that are some X-fold change above the expected background coverage, where a "good" value of X might be 3-5, depending on the strength of enrichment for your protein target (e.g., higher for TFs, lower for histone marks).

Honestly, though, any half-decent peak caller should not be calling peaks in regions that have very low coverage, so I would be surprised if there is any filtering that needs to be done at this stage. In other words, the peak-caller should have implicitly filtered for you (though there are difficult challenges with making this implicit filtering compatible with differential analysis tools like DESeq(2) and edgeR).

ADD COMMENTlink modified 16 months ago • written 16 months ago by Aaron Lun23k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 148 users visited in the last hour