Question

csaw: input coverage RleLists instead of bam files

0

Entering edit mode

Arne Muller ▴ 20

@arne-muller-8308

Last seen 7.9 years ago

United States

Hello,

would it be possible+reasonable to enter csaw peak calling with RleLists
objects such as those generated by the GenomicAlignments' coverage method?
I guess it'd be more performant to store and load the RleList objects compared to bam files. As I understand the coverage vector + the available views contain the count data per position, too.

regards,

Arne

csaw • 952 views

ADD COMMENT • link updated 8.8 years ago by Aaron Lun ★ 28k • written 8.8 years ago by Arne Muller ▴ 20

score 1 · Answer 1 · 2015-07-07

First, a clarification; csaw doesn't do peak calling in any formal sense. Rather, it slides windows across the genome and counts the number of (possibly extended) reads overlapping each window. To do so, it needs access to the position, strand and length of each mapped read. Such information is readily available from the BAM file, which is why this is the default (and only) input format. I toyed around with other input formats, but I didn't feel they were worth supporting directly, given that users could simply convert their files to BAM (or get the original BAM files fresh out of the aligner) prior to entry into csaw.

Anyway, if windowCounts were to accept RleList objects as input, it'd have to figure out how to extract alignment information from the coverage vector. This is impossible for strandedness and length, given that the same coverage profile can be obtained with any combination of forward/reverse and short/long reads. As for the read positions; in theory, if all reads were known to be of the same length and strand (e.g., by supplying different coverage vectors for each combination of length and strand), one could derive the positions from the coverage profile. However, I suspect that the complexity of this procedure would outweigh any memory benefits of using RleList objects.

Also, as you've noted, the coverage vectors contain the count data for each position. If you cbind the coverage vectors for all libraries together, you get a count matrix that you can use for statistical analyses. This is equivalent to counting reads in windowCounts with a window size of 1, a spacing of 1 (i.e., each base is its own window) and no read extension. Obviously, a single-base-resolution analysis will be very memory- and time-intensive, with little gain in information given that the tests for adjacent windows are so highly correlated. We find that a spacing of ~50 bp provides a reasonable compromise between resolution and convenience.