Question

Strategy for choosing window size and read extension length for csaw?

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 15 months ago

Icahn School of Medicine at Mount Sinai…

In a previous question, I mentioned that I was using csaw to do read counting for histone modification ChIP-Seq. I'm trying to determine the right combination of the read extension and width parameters. The User's Guide uses width=150 and no ext option for histone data (e.g. in Section 4.3.1). However, this seem like it would be too permissive to measure the reads corresponding to a single nucleosome. For example, if the read length was 100, then this allows a 250 range of starting positions to be counted for each window. The approach that makes the most sense to me is the following:

window.counts <- windowCounts(sample.table$bampath, ext=147, width=1, spacing=73, param=param)

Specifically, this means I'm using windows with only a 1 bp width spaced every half-nucleosome length (73 bp) across the genome, and I'm extending each read out to the length of a nucleosome. This effectively means that I'm positing each 1-bp window as the exact center of a nucleosome's footprint, and then any reads that overlap this window (after extension are reads that could potentially belong to a nucleosome centered at that position. And since the space between windows is only half a nucleosome length, it should be impossible to "skip over" a nucleosome by accident. In fact, if my math is correct, each read should overlap exactly 2 windows. On the other hand, many nucleosomes will probably be covered by 2 windows, which believe is not a major concern since differential binding significance will be aggregated between neighboring windows anyway.

So, I have chosen this scheme because it seems to me that it will generate counts that are as representative as possible of the number of reads derived from each nucleosome. However, I realize that this criterion does not necessarily guarantee the optimal performance of DB test, so I figured I would ask about the rationale for the parameter choices for histone data in the csaw User's Guide, and what advantages it would have over my scheme.

csaw differential binding analysis bin size • 1.9k views

ADD COMMENT • link updated 9.8 years ago by Aaron Lun ★ 29k • written 9.8 years ago by Ryan C. Thompson ★ 7.9k

score 2 · Accepted Answer · 2016-04-25

Well, the simple logic was that the nucleosome would hold onto the 150 bp worth of core DNA wrapped around it. If there was any fragmentation of this core DNA (assuming that the protection afforded by nucleosome binding isn't perfect), and assuming that the nucleosome only needs to hold onto 1 bp of a fragment after cross-linking, then a read corresponding to a fragment bound by the nucleosome could maximally be a fragment length away from the boundaries of the core DNA. Thus, the settings in the user's guide will:

Count all forward and reverse reads mapped within the 150 bp window. This would obviously correspond to fragments that lie within the nucleosome, i.e., fragmentation occurs in the core DNA. (One can debate whether this actually happens when the nucleosome protects the core DNA, but let's just assume it does. Otherwise, we should see an empty space where the nucleosome protects the DNA, and that doesn't seem like it happens in the ChIP-seq data sets I've seen. Sonication's pretty intense, from what I gather.)
Count all forward reads mapped within 100 bp of the 5' end of the window (assuming a fragment length of 100 bp). These reads are generated after sequencing putative fragments where fragmentation of the non-sequenced end occurs inside the nucleosome.
Count all reverse reads mapped within 100 bp of the 3' end of the window, for the same reason described above.

Your approach will only count forward reads that are 5' and within 73 bp of the nucleosome centre (i.e., the 1 bp window), and reverse reads that are 3' of the centre. This doesn't account for the fact that the core DNA of the nucleosome has non-negligible width, such that you want to count both forward and reverse reads within the "footprint" of the nucleosome.

In practical terms, the idea with using a larger width was just to get larger counts, given that histone mark peaks are often broader than their TF counterparts. Larger counts then gives you more power to detect DB, albeit at the cost of reduced spatial resolution (though that doesn't matter much for histone marks, provided you don't go overboard and start using huge bins).