Question

Exporting normalized counts over final DB windows from csaw

1

Entering edit mode

jms1223 ▴ 10

@jms1223-13630

Last seen 6.7 years ago

I've successfully run csaw and identified windows with significantly different ChIP-seq enrichment across my experimental conditions. Instead of producing the genome browser-like plots outlined in the user manual, I would simply like to export the total (or ideally, mean per base) number of reads, scaled by the normalization factors, for each input sample so that I can represent the signal over differential windows as a heatmap. Perhaps unsurprisingly, simply writing out a BED file of the significant window coordinates, then running BEDtools to compute coverage over those windows, then scaling by sequencing depth isn't necessarily producing the same answer. In other words, I had thousands of windows that csaw called as having increased enrichment, but my manual scaling of the read counts by sequencing depth over those windows actually shows the opposite trend. Perhaps instead of manually scaling by sequencing depth like I've been doing, I could simply scale by the csaw scaling factors, but if there were a way to directly export the normalized counts exactly as csaw saw them, that would save a few steps.

Here's the code that I've been running:

# Initialize

bam.files <- c("x1.bam","x2.bam","y1.bam","y2.bam")
design <- model.matrix(~factor(c('x', 'x', 'y', 'y')))
colnames(design) <- c("intercept", "cell.type")

# Load in data from BAM files.
param <- readParam(minq=50)
data <- windowCounts(bam.files, ext=250, width=150, param=param)

# Filter based on union set of peak calls from MACS2
peaks=import("unionPeaks.bed",format="bed")
suppressWarnings(keep <- overlapsAny(rowRanges(data), peaks))
filtered.data <- data[keep,]

# Calculate normalization factors.
binned <- windowCounts(bam.files, bin=TRUE, width=10000, param=param)
normfacs <- normOffsets(binned)
filtered.data$norm.factors <- normfacs

# Identify DB windows.
y <- asDGEList(filtered.data, norm.factors=normfacs)
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design, robust=TRUE)
results <- glmQLFTest(fit)
rowData(filtered.data) <- cbind(rowData(filtered.data), results$table)

# Correct for multiple testing.
merged <- mergeWindows(rowRanges(filtered.data), tol=1000L,max.width=2000L)
tabcom <- combineTests(merged$id, results$table)

# Write out results
is.sig <- tabcom$FDR <= 0.05
tabbest <- getBestTest(merged$id, results$table)
out.ranges <- merged$region
elementMetadata(out.ranges) <- data.frame(tabcom,best.pos=mid(ranges(rowRanges(filtered.data[tabbest$best]))),best.logFC=tabbest$logFC)
simplified <- out.ranges[is.sig]
simplified$score <- -10*log10(simplified$FDR)
export(con="csaw_results.bed", object=simplified)
is.sig.pos <- (tabbest$logFC > 0)[is.sig]
print(summary(is.sig.pos))

csaw edger chip-seq • 2.3k views

ADD COMMENT • link updated 6.7 years ago by Ryan C. Thompson ★ 7.9k • written 6.7 years ago by jms1223 ▴ 10

0

Entering edit mode

Note that, if you're going to use MACS for peak calling, we don't recommend taking the union of peaks from multiple samples as this biases the differential testing (https://doi.org/10.1093/nar/gku351). Rather, it is better to pool all samples together and run MACS on the resulting file, which ensures that the peak calls are independent from the differential test statistic.

ADD REPLY • link 6.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

I've been calling peaks relative to input controls, with biological duplicate on everything. So in other words I have

Condition1, rep1
Condition1, rep2
Condition2, rep1
Condition2, rep2
Condition1, input rep1
Condition1, input rep2
Condition2, input rep1
Condition2, input rep2

So you would concatenate all four IPs and call peaks relative to the concatenation of the four input controls? While that may control for Type I error rates in csaw, does not not cause upstream issues for the peak calls themselves?

ADD REPLY • link 6.7 years ago jms1223 ▴ 10

0

Entering edit mode

Yes, I would concatenate all IPs, assuming you're not using csaw to compare to the inputs (in which case you'd have to concatenate the inputs as well). What specific issues for the peak calls are you referring to?

In any case, this is the reason why csaw uses windows in the first place - to avoid the difficulties in determining how to perform peak calling in a manner that does not bias the DB analysis. Note that the same considerations are present in any DB analysis; this is not surprising, as the downstream analysis depends heavily on the nature of the provided features.

To illustrate, consider what happens near the peak-calling "boundary", i.e., when a peak is just strong enough to be retained in the final set of features. When you take the union of peak calls, you will increase the proportion of regions that have high signal in only one sample, whereas regions that have consistently low signal will be depleted. In your particular experimental design, this ends up inflating the dispersion estimates as the enriched regions will have stronger differences between replicates. (In fact, this can lead to some rather toxic interactions with the empirical Bayes shrinkage in edgeR, where the dispersion estimates for all other regions are artificially pulled up, thereby reducing power; while the inflated dispersion estimates for these enriched regions are artificially pulled down, thereby increasing the false positive rate.)

One could argue that this is a subtle effect that only affects a few genomic regions. However, in ChIP-seq, few regions are bound at high abundance, and most regions are weakly or not bound. This means that the number of regions at the peak-calling boundary is actually quite high compared to the number of called regions, sufficient to interfere with edgeR.

ADD REPLY • link 6.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

My gut tells me that combining 4 samples will hinder your power to detect those "boundary" type peaks to begin with. But worse, "boundary" in this instance may mean not just weaker sites, but also those that exhibit meaningful variability across experimental conditions. In my experience, adding more noise to the system (unsurprisingly) has a negative effect on the resulting peaks. In a quick calculation for one of my sample sets, even though ~95% of the peaks called for any one of the 4 samples in the bulleted list above were present in the 4-way-concatenated peak set, that still leaves me with ~15k fewer regions in total that I would have otherwise obtained by taking the union set, which is a lot. I can't yet speak to their homogeneity (or lack thereof), nor do I have summary statistics on these yet.

So the concatenation method you describe may focus the differential analysis on the most robust, ubiquitous, less-variable regions, and thus not overly inflate the edgeR dispersion estimates. If we assume, though, that some non-zero number of those 15k regions exhibit real biological variability (ie across conditions) and thus were missed when everything got thrown together for peak-calling, I'm curious how that doesn't result in an inflation of Type II error, since those variable regions would never even have the chance to be considered in the differential analysis. I don't see a discussion of false negative rates in the paper you linked, but I'd be interested to see that

ADD REPLY • link 6.7 years ago jms1223 ▴ 10

1

Entering edit mode

If you don't want to throw out potential DB regions, reduce the stringency of your peak calls; this is a consideration that is independent of the approach you are using to consolidate the peaks. In fact, the thresholds with the union approach are not comparable to the threshold of the concatenated approach, and need to be adjusted to obtain the same effective stringency if you want to compare the filter strategies. Of course, retention of more regions is not free, as it involves an increase in the severity of the BH correction; this is particularly relevant for ChIP-seq where there are many more unbound regions than bound regions.

All I will say is that the union approach will distort the inferences from edgeR for the regions that are called as peaks - if you just want to call more peaks, there's better ways to do it that allow you to have your cake and eat it too. Or you could just not do peak calling at all; you don't strictly need it for DB.

ADD REPLY • link 6.7 years ago Aaron Lun ★ 28k

score 1 · Answer 1 · 2017-07-31

1

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 8 months ago

Scripps Research, La Jolla, CA

Since csaw creates a DGEList and uses edgeR for the differential binding tests, you can just use the cpm function from edgeR to get the normalized counts-per-million values, which should be comparable between samples and between windows since all windows are the same size. You can attach the CPM values for each sample to the rowRanges of your SimmarizedExperiment and export the result as a BED file using the rtracklayer package.

(As always note that edgeR does not directly use the sample CPM values in its model, but rather fits a model to the raw counts with the normalization factors as offsets.)

ADD COMMENT • link 6.7 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Thanks for the reply. It seems like in this instance, cpm(y) provides those counts for all windows, rather than just the final merged/significant ones...though it's hard to tell since cpm(y) doesn't by default give me any row names. How would I get counts for the final windows, and annotate them with their coordinates? I'm a GRanges novice...

Also just to note, the significant windows are not all the same size, as you stated - in this case they range from ~500bp to ~2kb. So it likely would be best to normalize the window counts by the window size for my subsequent analyses, but that is easier to do post hoc as needed

ADD REPLY • link 6.7 years ago jms1223 ▴ 10

0

Entering edit mode

Note that csaw collects counts for windows but obtains the final statistics for regions. csaw never sees the counts for each region, as the p-values from all windows in a region are combined into a single region-level statistic. The best option would be to use getBestTest to select the top window in each significant region, and visualize that. This is quite easy, as getBestTest returns a table like combineTests, so you can just subset the output of the former by the significance from the latter. (Don't use the getBestTest p-values, though, they do something different.) Then use $best to get the relevant windows in filtered.data and/or cpm(y). The other option would be to recount across each significant region with regionCounts. In most applications, this will usually give you something that looks visually similar enough.

ADD REPLY • link 6.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Running regionCounts() seemed like the most direct approach to me, so I went with that. The values seem to be integers though, so I'm guessing these are raw counts based on reads pulled from the BAM files. Do I just scale these values by normfacs, or is that incorrect?

Also if you can just sanity check the following to make sure I'm annotating everything correctly that would be helpful!

regcts=regionCounts(bam.files,regions=simplified,ext=250,param=param)
coords=paste(as.vector(seqnames(regcts)),as.vector(start(regcts)-1),as.vector(end(regcts)),sep="_")
samples=bam.files
rownames(regcts)=coords
colnames(regcts)=samples

ADD REPLY • link 6.7 years ago jms1223 ▴ 10

0

Entering edit mode

Looks fine to me. Note that you can do as.character on a GRanges object to get a string. To get some normalized coverage values, I would suggest running:

cpm(assay(regcts), lib.size=exp(getOffset(y)))

The offsets of your y are the log-transformed effective library sizes, i.e., the product of the library sizes and the normalization factors. The latter don't make sense without the former.