I've successfully run csaw and identified windows with significantly different ChIP-seq enrichment across my experimental conditions. Instead of producing the genome browser-like plots outlined in the user manual, I would simply like to export the total (or ideally, mean per base) number of reads, scaled by the normalization factors, for each input sample so that I can represent the signal over differential windows as a heatmap. Perhaps unsurprisingly, simply writing out a BED file of the significant window coordinates, then running BEDtools to compute coverage over those windows, then scaling by sequencing depth isn't necessarily producing the same answer. In other words, I had thousands of windows that csaw called as having increased enrichment, but my manual scaling of the read counts by sequencing depth over those windows actually shows the opposite trend. Perhaps instead of manually scaling by sequencing depth like I've been doing, I could simply scale by the csaw scaling factors, but if there were a way to directly export the normalized counts exactly as csaw saw them, that would save a few steps.
Here's the code that I've been running:
# Initialize
bam.files <- c("x1.bam","x2.bam","y1.bam","y2.bam") design <- model.matrix(~factor(c('x', 'x', 'y', 'y'))) colnames(design) <- c("intercept", "cell.type") # Load in data from BAM files. param <- readParam(minq=50) data <- windowCounts(bam.files, ext=250, width=150, param=param) # Filter based on union set of peak calls from MACS2 peaks=import("unionPeaks.bed",format="bed") suppressWarnings(keep <- overlapsAny(rowRanges(data), peaks)) filtered.data <- data[keep,] # Calculate normalization factors. binned <- windowCounts(bam.files, bin=TRUE, width=10000, param=param) normfacs <- normOffsets(binned) filtered.data$norm.factors <- normfacs # Identify DB windows. y <- asDGEList(filtered.data, norm.factors=normfacs) y <- estimateDisp(y, design) fit <- glmQLFit(y, design, robust=TRUE) results <- glmQLFTest(fit) rowData(filtered.data) <- cbind(rowData(filtered.data), results$table) # Correct for multiple testing. merged <- mergeWindows(rowRanges(filtered.data), tol=1000L,max.width=2000L) tabcom <- combineTests(merged$id, results$table) # Write out results is.sig <- tabcom$FDR <= 0.05 tabbest <- getBestTest(merged$id, results$table) out.ranges <- merged$region elementMetadata(out.ranges) <- data.frame(tabcom,best.pos=mid(ranges(rowRanges(filtered.data[tabbest$best]))),best.logFC=tabbest$logFC) simplified <- out.ranges[is.sig] simplified$score <- -10*log10(simplified$FDR) export(con="csaw_results.bed", object=simplified) is.sig.pos <- (tabbest$logFC > 0)[is.sig] print(summary(is.sig.pos))
Note that, if you're going to use MACS for peak calling, we don't recommend taking the union of peaks from multiple samples as this biases the differential testing (https://doi.org/10.1093/nar/gku351). Rather, it is better to pool all samples together and run MACS on the resulting file, which ensures that the peak calls are independent from the differential test statistic.
I've been calling peaks relative to input controls, with biological duplicate on everything. So in other words I have
So you would concatenate all four IPs and call peaks relative to the concatenation of the four input controls? While that may control for Type I error rates in csaw, does not not cause upstream issues for the peak calls themselves?
Yes, I would concatenate all IPs, assuming you're not using csaw to compare to the inputs (in which case you'd have to concatenate the inputs as well). What specific issues for the peak calls are you referring to?
In any case, this is the reason why csaw uses windows in the first place - to avoid the difficulties in determining how to perform peak calling in a manner that does not bias the DB analysis. Note that the same considerations are present in any DB analysis; this is not surprising, as the downstream analysis depends heavily on the nature of the provided features.
To illustrate, consider what happens near the peak-calling "boundary", i.e., when a peak is just strong enough to be retained in the final set of features. When you take the union of peak calls, you will increase the proportion of regions that have high signal in only one sample, whereas regions that have consistently low signal will be depleted. In your particular experimental design, this ends up inflating the dispersion estimates as the enriched regions will have stronger differences between replicates. (In fact, this can lead to some rather toxic interactions with the empirical Bayes shrinkage in edgeR, where the dispersion estimates for all other regions are artificially pulled up, thereby reducing power; while the inflated dispersion estimates for these enriched regions are artificially pulled down, thereby increasing the false positive rate.)
One could argue that this is a subtle effect that only affects a few genomic regions. However, in ChIP-seq, few regions are bound at high abundance, and most regions are weakly or not bound. This means that the number of regions at the peak-calling boundary is actually quite high compared to the number of called regions, sufficient to interfere with edgeR.
My gut tells me that combining 4 samples will hinder your power to detect those "boundary" type peaks to begin with. But worse, "boundary" in this instance may mean not just weaker sites, but also those that exhibit meaningful variability across experimental conditions. In my experience, adding more noise to the system (unsurprisingly) has a negative effect on the resulting peaks. In a quick calculation for one of my sample sets, even though ~95% of the peaks called for any one of the 4 samples in the bulleted list above were present in the 4-way-concatenated peak set, that still leaves me with ~15k fewer regions in total that I would have otherwise obtained by taking the union set, which is a lot. I can't yet speak to their homogeneity (or lack thereof), nor do I have summary statistics on these yet.
So the concatenation method you describe may focus the differential analysis on the most robust, ubiquitous, less-variable regions, and thus not overly inflate the edgeR dispersion estimates. If we assume, though, that some non-zero number of those 15k regions exhibit real biological variability (ie across conditions) and thus were missed when everything got thrown together for peak-calling, I'm curious how that doesn't result in an inflation of Type II error, since those variable regions would never even have the chance to be considered in the differential analysis. I don't see a discussion of false negative rates in the paper you linked, but I'd be interested to see that
If you don't want to throw out potential DB regions, reduce the stringency of your peak calls; this is a consideration that is independent of the approach you are using to consolidate the peaks. In fact, the thresholds with the union approach are not comparable to the threshold of the concatenated approach, and need to be adjusted to obtain the same effective stringency if you want to compare the filter strategies. Of course, retention of more regions is not free, as it involves an increase in the severity of the BH correction; this is particularly relevant for ChIP-seq where there are many more unbound regions than bound regions.
All I will say is that the union approach will distort the inferences from edgeR for the regions that are called as peaks - if you just want to call more peaks, there's better ways to do it that allow you to have your cake and eat it too. Or you could just not do peak calling at all; you don't strictly need it for DB.