Hello,
I need to bin my data into 200b bins and calculate the mean coverage for each bin. I found "How to compute binned averages along a genome" in HowTo document of Genomic Ranges and used a very simple dataset to test the function. Here is my test dataset:
chr1 156 355
chr1 241 440
chr1 311 510
and this is the function from document:
binnedAverage <- function(bins, numvar, mcolname)
+ {
+ stopifnot(is(bins, "GRanges"))
+ stopifnot(is(numvar, "RleList"))
+ stopifnot(identical(seqlevels(bins), names(numvar)))
+ bins_per_chrom <- split(ranges(bins), seqnames(bins))
+ means_list <- lapply(names(numvar),
+ function(seqname) {
+ views <- Views(numvar[[seqname]],
+ bins_per_chrom[[seqname]])
+ viewMeans(views)
+ })
+ new_mcol <- unsplit(means_list, as.factor(seqnames(bins)))
+ mcols(bins)[[mcolname]] <- new_mcol
+ bins
+ }
For my dataset I expect to get the following bins and the mean coverage:
bin mean
1-200 0.225
201-400 2.025
401-600 0.75
But, I receive these results:
[1] chr1 [ 1, 200] * | 0.225
[2] chr1 [201, 400] * | 2.025
[3] chr1 [401, 600] * | 1.36363636363636
The first 2 bins have the expected mean coverage, but the last one has a coverage higher than what expected. I found out when calculating mean coverage of the last bin, the function ignores the zero coverage of 510-600. If it includes 510-600 range, we should see a mean coverage of 0.75:
400-440 --> 40 * 2 (this range falls in two of our initial ranges in the bed file (chr1 241 440 and chr1 311 510)) =80
440-510 --> 70 * 1 (this range only falls in one of the initial ranges (only chr1 311 510)) = 70
510-600 --> 90 * 0 (we don't have this range in the initial bed file) = 0
70 + 80 + 0 = 150 / (40+70+90) = 0.75
But, instead the last range (510-600) with zero coverage is ignored and the mean coverage is calculated like this:
400-440 --> 40 * 2 = 80
440-510 --> 70 * 1 = 70
70 + 80 = 150 / (40+70) = 1.363636
that is higher than what we expect. I think 510-600 range shouldn't be excluded from data because here the genome is not finished. It continues to 249250621 nucleotides. We just don't have any reads after 510, so we should keep the zero coverage parts in the mean coverage calculation the same way that function adds 1-156 range of zero coverage for the first bin.
Can you please let me know how this can be fixed?
Thanks,
Fa
In GenomicRanges 1.20.6 (current BioC release) and 1.21.21 (current BioC devel), I've changed the definition of
binnedAverage()
to handle properly the "missing trailing zeros" situation. The new definition of the function is:This new definition is in the example section of
?tileGenome
and in the GenomicRanges HOWTOs.Cheers,
H.