Question

csaw: post clustering description of regions

0

Entering edit mode

asif.zubair • 0

@asifzubair-6770

Last seen 7.6 years ago

For csaw's merge operation over windows, my understanding is that the merge operation simply clusters nearby windows into a region. It doesn't pay heed to the direction of the FC or if the windows have significant DB. Is this understanding correct ?

However, an obvious issue then is how do we define the DB for the region. In the csaw vignette, it suggests looking at the best window for a region and use it's FC for the FC of the region. The thing I am struggling with is that sometimes I have large regions which are 1kb and I don't know if it is justifiable to declare FC based on a single window (which is 150bp).

In the face of this, what would be the best strategy for ascribing region-level FC ?

csaw mergeWindows • 865 views

ADD COMMENT • link updated 8.0 years ago by Aaron Lun ★ 28k • written 8.0 years ago by asif.zubair • 0

score 1 · Accepted Answer · 2016-05-04

Yes, that's right. Clustering must be independent of DB in order for the downstream multiplicity corrections in combineTests to work properly. This means that we can't pick and choose our windows on their DB status. This restriction also applies to the sign of the DB log-fold change. As a result, you can end up with clusters with windows going in opposite directions. In general, I find that it's not too much of an issue as my clusters tend to be small. You could try reducing cluster size by setting tol=100 and max.width=5000 in mergeWindows, and/or filtering more aggressively to get rid of windows that might otherwise "chain" clusters together.

Anyway, if the regions are genuinely large, then there's several things you can do:

Use the best log-fold change in each cluster. This is a less representative summary of the DB direction, but it tends to work reasonably well due to strong correlations between adjacent windows, even in large-ish regions. (I should add that I don't consider 1 kbp as being particularly large.)
Use the numbers of up/down windows reported by combineTests to decide whether a region is generally going up or down. If the majority of windows are going in one direction, then it's probably safe to use that to summarize the direction for the entire region. Of course, complex regions containing intervals with multiple directions of change can't be easily described with a simple up/down - this is an inevitable result of complex biology, so I don't really see that as a problem to be fixed.
Use clusterWindows to identify your clusters (currently in BioC-devel, but it should tick over to release soon enough). This will use the DB direction/status of the windows to assist clustering, in a more rigorous way than just using only the DB windows to cluster. The idea is that, sometimes, enrichment is too weak for filtering to distinguish between enriched and background regions; thus, too many windows are left after filtering, such that chaining prevents the formation of separate clusters. Use of the DB status allows the background (and presumably non-DB) windows to be ignored during clustering. The advantage of this approach is that each cluster will only be formed from DB windows going in the same direction. However, I don't recommend this as the default as it takes some liberties with FDR control, so it isn't quite as statistically rigorous as combineTests.