The support.bioconductor.org editor has been updated to markdown! Please see more info at: Tutorial: Updated Support Site Editor

Question: Opinions on DESeqDataSetFromMatrix rowSums minimum for ChIP and/or ATAC-seq
0
gravatar for rbronste
14 months ago by
rbronste60
rbronste60 wrote:

So Im hoping to get some opinions on what people generally set as the rowSums minimum as follows:

dds <- dds[ rowSums(counts(dds)) > 10, ]

I obtain the count matrices from ChIP-seq and ATAC-seq experiments through DiffBind and then use DESeq2 for the differential binding analysis. I guess its a bit of an arbitrary cutoff, but wondering if there is any additional experimental information people employ to set this boundary? Thanks! 

ADD COMMENTlink modified 14 months ago by Aaron Lun22k • written 14 months ago by rbronste60
Answer: Opinions on DESeqDataSetFromMatrix rowSums minimum for ChIP and/or ATAC-seq
2
gravatar for Michael Love
14 months ago by
Michael Love21k
United States
Michael Love21k wrote:

I don't analyze much ChIP-seq so I'll let others weigh in on that. I don't think you will be losing many informative features with this filter, e.g. if you have 3 vs 3 samples that would be 10 reads distributed over 6 samples, so the best looking feature might be [3 3 4] vs [0 0 0]. It's hard to imagine even with incredibly minimal technical variability (e.g. UMI dedup) that this feature would survive multiple test correction. So it seems a "safe" filter in terms of false negatives.

ADD COMMENTlink written 14 months ago by Michael Love21k

Thanks for the info. Another question I had was how to do it on a cell by cell basis, limiting it to > 10 in all cells of each triplet in the matrix.

ADD REPLYlink written 14 months ago by rbronste60
Answer: Opinions on DESeqDataSetFromMatrix rowSums minimum for ChIP and/or ATAC-seq
2
gravatar for Aaron Lun
14 months ago by
Aaron Lun22k
Cambridge, United Kingdom
Aaron Lun22k wrote:

I'll chip in on the ChIP-seq part of it (ho, ho, ho).

I use the background level of enrichment to guide the filter threshold for the average count. There's a number of ways to do this, but the easiest is to cut your genome into large bins, count the number of reads in each bin, calculate the median of the average counts across all bins and then scale that average count based on the ratio of the bin width to your median peak width. This gives you a reasonable indication of what average count you might expect in the absence of binding, assuming most genomic locations are not bound.

There are some more sophisticated strategies using local or control-based background estimates, but I find that a global estimate works pretty well. You can increase the stringency of the filter by asking for all regions that are some X-fold change above the expected background coverage, where a "good" value of X might be 3-5, depending on the strength of enrichment for your protein target (e.g., higher for TFs, lower for histone marks).

Honestly, though, any half-decent peak caller should not be calling peaks in regions that have very low coverage, so I would be surprised if there is any filtering that needs to be done at this stage. In other words, the peak-caller should have implicitly filtered for you (though there are difficult challenges with making this implicit filtering compatible with differential analysis tools like DESeq(2) and edgeR).

ADD COMMENTlink modified 14 months ago • written 14 months ago by Aaron Lun22k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 251 users visited in the last hour