csaw : effect of library size on filtering of uninteresting windows
1
0
Entering edit mode
Vivek.b ▴ 100
@vivekb-7661
Last seen 4.5 years ago
Germany

Hi there,

I am trying the csaw package to filter the background from my data using the "local enrichment" method. I first tested in a dataset with low input material (resulting in low library sizes) and find that the method works nicely, when I keep all regions with 2-fold enrichment over the local background. But when I tested the method on a dataset with higher input material (resulting in 10x more library sizes), I find that I have to increase the filtering threshold to 6-fold enrichment to keep the bound regions without noise. 

I wanted to automate this process and that's why I am wondering what would be an appropriate way to select the filtering cutoff from the filter.stats that works for all library sizes? 

I managed to use cpm instead of normal windowCounts and regionCounts to get the filter.stats. But the distribution of filter.stats is still not similar between the two kind of samples, so I won't be able to use a single cutoff for both. Any ideas?

 

Thanks

Vivek

csaw • 1.4k views
ADD COMMENT
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 14 hours ago
The city by the bay

The only component of filterWindows that does not cancel out with library size is the pseudo-count used in aveLogCPM. This squeezes the filter statistics (i.e., filter in the output) towards zero. The smaller the counts, the stronger the shrinkage - and with good reason, otherwise the function would happily report "large" enrichments for regions with a handful of reads. This effectively means that a small threshold for a low library size is as stringent (in terms of number of windows retained) as a large threshold for a larger library size.

Now, if all of your libraries are large, the behaviour of the pseudo-count will not make a difference. This is because the amount of shrinkage will approach zero as your counts increase, such that you should get similar results with the same threshold for different (but still large) library sizes. However, for very small libraries, it will affect the results - after all, that's why we use it - so you'll just have to pay attention to those cases.

Full automation of these analyses would be nice. But then I wouldn't have a job.

ADD COMMENT
0
Entering edit mode

Got it.. Thanks Aaron.. I want you to keep doing your great job :)

ADD REPLY

Login before adding your answer.

Traffic: 871 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6