Question

Problem with Finding Mean Coverage using binnedAverage function

0

Entering edit mode

keshavarz.f10 • 0

@keshavarzf10-8719

Last seen 4.4 years ago

Canada

Hello,

I need to bin my data into 200b bins and calculate the mean coverage for each bin. I found "How to compute binned averages along a genome" in HowTo document of Genomic Ranges and used a very simple dataset to test the function. Here is my test dataset:

chr1 156 355

chr1 241 440

chr1 311 510

and this is the function from document:

binnedAverage <- function(bins, numvar, mcolname)

+ {

+ stopifnot(is(bins, "GRanges"))

+ stopifnot(is(numvar, "RleList"))

+ stopifnot(identical(seqlevels(bins), names(numvar)))

+ bins_per_chrom <- split(ranges(bins), seqnames(bins))

+ means_list <- lapply(names(numvar),

+ function(seqname) {

+ views <- Views(numvar[[seqname]],

+ bins_per_chrom[[seqname]])

+ viewMeans(views)

+ })

+ new_mcol <- unsplit(means_list, as.factor(seqnames(bins)))

+ mcols(bins)[[mcolname]] <- new_mcol

+ bins

+ }

For my dataset I expect to get the following bins and the mean coverage:

bin mean

1-200 0.225

201-400 2.025

401-600 0.75

But, I receive these results:

[1] chr1 [ 1, 200] * | 0.225
[2] chr1 [201, 400] * | 2.025
[3] chr1 [401, 600] * | 1.36363636363636

The first 2 bins have the expected mean coverage, but the last one has a coverage higher than what expected. I found out when calculating mean coverage of the last bin, the function ignores the zero coverage of 510-600. If it includes 510-600 range, we should see a mean coverage of 0.75:

400-440 --> 40 * 2 (this range falls in two of our initial ranges in the bed file (chr1 241 440 and chr1 311 510)) =80

440-510 --> 70 * 1 (this range only falls in one of the initial ranges (only chr1 311 510)) = 70

510-600 --> 90 * 0 (we don't have this range in the initial bed file) = 0

70 + 80 + 0 = 150 / (40+70+90) = 0.75

But, instead the last range (510-600) with zero coverage is ignored and the mean coverage is calculated like this:

400-440 --> 40 * 2 = 80

440-510 --> 70 * 1 = 70

70 + 80 = 150 / (40+70) = 1.363636

that is higher than what we expect. I think 510-600 range shouldn't be excluded from data because here the genome is not finished. It continues to 249250621 nucleotides. We just don't have any reads after 510, so we should keep the zero coverage parts in the mean coverage calculation the same way that function adds 1-156 range of zero coverage for the first bin.

Can you please let me know how this can be fixed?

Thanks,

Fa

GenomicRanges • 1.9k views

ADD COMMENT • link updated 8.7 years ago by Hervé Pagès 16k • written 8.7 years ago by keshavarz.f10 • 0

score 0 · Answer 1 · 2015-08-28

Hi Fa,

Please always show the exact code you used and your sessionInfo().

I think the problem is that binnedAverage() expects to see the trailing zeros in the coverage vector:

bins <- tileGenome(c(chr1=600), tilewidth=200,
                   cut.last.tile.in.chrom=TRUE)
reads <- GRanges("chr1", IRanges(c(156, 241, 311),
                                 c(355, 440, 510)))
cvg <- coverage(reads)
cvg
# RleList of length 1
# $chr1
# integer-Rle of length 510 with 6 runs
#   Lengths: 155  85  70  45  85  70
#   Values :   0   1   2   3   2   1  # <--- NO trailing zeros!

Then:

binnedAverage(bins, cvg, "mean_cvg")
# GRanges object with 3 ranges and 1 metadata column:
#       seqnames     ranges strand |         mean_cvg
#          <Rle>  <IRanges>  <Rle> |        <numeric>
#   [1]     chr1 [  1, 200]      * |            0.225
#   [2]     chr1 [201, 400]      * |            2.025
#   [3]     chr1 [401, 600]      * | 1.36363636363636
#   -------
#   seqinfo: 1 sequence from an unspecified genome

In order to have the coverage() function add the traling zero's, it needs to know the length of the chromosome:

seqinfo(reads)
# Seqinfo object with 1 sequence from an unspecified genome; no seqlengths:
#   seqnames seqlengths isCircular genome
#   chr1             NA         NA   <NA>

seqlengths(reads) <- seqlengths(bins)
seqinfo(reads)
# Seqinfo object with 1 sequence from an unspecified genome:
#   seqnames seqlengths isCircular genome
#   chr1            600         NA   <NA>

cvg <- coverage(reads)
cvg
# RleList of length 1
# $chr1
# integer-Rle of length 600 with 7 runs
#   Lengths: 155  85  70  45  85  70  90
#   Values :   0   1   2   3   2   1   0  # <--- trailing zeros!

Then:

binnedAverage(bins, cvg, "mean_cvg")
# GRanges object with 3 ranges and 1 metadata column:
#       seqnames     ranges strand |  mean_cvg
#          <Rle>  <IRanges>  <Rle> | <numeric>
#   [1]     chr1 [  1, 200]      * |     0.225
#   [2]     chr1 [201, 400]      * |     2.025
#   [3]     chr1 [401, 600]      * |      0.75
#   -------
#   seqinfo: 1 sequence from an unspecified genome

Note that in a typical "real-world" workflow, this problem is unlikely to arise because the reads usually contain the chromosome lengths (if they were loaded from a BAM file with the readGAlignements() function from the GenomicAlignments package, they do). Then the bins are typically generated with something like bins <- tileGenome(seqlengths(reads), tilewidth=200, cut.last.tile.in.chrom=TRUE) so the sequence lengths are propagated from reads to bins.

Anyway, I'll modify binnedAverage() in the doc to handle the situation where the trailing zeros are missing.

Cheers,

H.