Problem with Finding Mean Coverage using binnedAverage function
1
0
Entering edit mode
@keshavarzf10-8719
Last seen 4.4 years ago
Canada

Hello,

I need to bin my data into 200b bins and calculate the mean coverage for each bin. I found "How to compute binned averages along a genome" in HowTo document of Genomic Ranges and used a very simple dataset to test the function. Here is my test dataset:

chr1     156      355

chr1     241      440

chr1     311      510

and this is the function from document:

binnedAverage <- function(bins, numvar, mcolname)

+ {

+ stopifnot(is(bins, "GRanges"))

+ stopifnot(is(numvar, "RleList"))

+ stopifnot(identical(seqlevels(bins), names(numvar)))

+ bins_per_chrom <- split(ranges(bins), seqnames(bins))

+ means_list <- lapply(names(numvar),

+ function(seqname) {

+ views <- Views(numvar[[seqname]],

+ bins_per_chrom[[seqname]])

+ viewMeans(views)

+ })

+ new_mcol <- unsplit(means_list, as.factor(seqnames(bins)))

+ mcols(bins)[[mcolname]] <- new_mcol

+ bins

+ }

For my dataset I expect to get the following bins and the mean coverage:

bin             mean

1-200        0.225

201-400    2.025

401-600    0.75

But, I receive these results:

[1]     chr1            [  1,  200]      *   |            0.225
[2]     chr1            [201,  400]      *   |            2.025
[3]     chr1            [401,  600]      *   | 1.36363636363636

The first 2 bins have the expected mean coverage, but the last one has a coverage higher than what expected. I found out when calculating mean coverage of the last bin, the function ignores the zero coverage of 510-600. If it includes 510-600 range, we should see a mean coverage of 0.75:

400-440 --> 40 * 2 (this range falls in two of our initial ranges in the bed file (chr1     241      440 and chr1     311      510)) =80

440-510 --> 70 * 1 (this range only falls in one of the initial ranges (only chr1     311     510)) = 70

510-600 --> 90 * 0 (we don't have this range in the initial bed file) = 0

 

70 + 80 + 0 = 150 / (40+70+90) = 0.75

 

But, instead the last range (510-600) with zero coverage is ignored and the mean coverage is calculated like this:

400-440 --> 40 * 2 = 80

440-510 --> 70 * 1 = 70

 

70 + 80 = 150 / (40+70) = 1.363636

that is higher than what we expect. I think 510-600 range shouldn't be excluded from data because here the genome is not finished. It continues to 249250621 nucleotides. We just don't have any reads after 510, so we should keep the zero coverage parts in the mean coverage calculation the same way that function adds 1-156 range of zero coverage for the first bin.

Can you please let me know how this can be fixed?

Thanks,

Fa

GenomicRanges • 1.9k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 13 hours ago
Seattle, WA, United States

Hi Fa,

Please always show the exact code you used and your sessionInfo().

I think the problem is that binnedAverage() expects to see the trailing zeros in the coverage vector:

bins <- tileGenome(c(chr1=600), tilewidth=200,
                   cut.last.tile.in.chrom=TRUE)
reads <- GRanges("chr1", IRanges(c(156, 241, 311),
                                 c(355, 440, 510)))
cvg <- coverage(reads)
cvg
# RleList of length 1
# $chr1
# integer-Rle of length 510 with 6 runs
#   Lengths: 155  85  70  45  85  70
#   Values :   0   1   2   3   2   1  # <--- NO trailing zeros!

Then:

binnedAverage(bins, cvg, "mean_cvg")
# GRanges object with 3 ranges and 1 metadata column:
#       seqnames     ranges strand |         mean_cvg
#          <Rle>  <IRanges>  <Rle> |        <numeric>
#   [1]     chr1 [  1, 200]      * |            0.225
#   [2]     chr1 [201, 400]      * |            2.025
#   [3]     chr1 [401, 600]      * | 1.36363636363636
#   -------
#   seqinfo: 1 sequence from an unspecified genome

In order to have the coverage() function add the traling zero's, it needs to know the length of the chromosome:

seqinfo(reads)
# Seqinfo object with 1 sequence from an unspecified genome; no seqlengths:
#   seqnames seqlengths isCircular genome
#   chr1             NA         NA   <NA>

seqlengths(reads) <- seqlengths(bins)
seqinfo(reads)
# Seqinfo object with 1 sequence from an unspecified genome:
#   seqnames seqlengths isCircular genome
#   chr1            600         NA   <NA>

cvg <- coverage(reads)
cvg
# RleList of length 1
# $chr1
# integer-Rle of length 600 with 7 runs
#   Lengths: 155  85  70  45  85  70  90
#   Values :   0   1   2   3   2   1   0  # <--- trailing zeros!

Then:

binnedAverage(bins, cvg, "mean_cvg")
# GRanges object with 3 ranges and 1 metadata column:
#       seqnames     ranges strand |  mean_cvg
#          <Rle>  <IRanges>  <Rle> | <numeric>
#   [1]     chr1 [  1, 200]      * |     0.225
#   [2]     chr1 [201, 400]      * |     2.025
#   [3]     chr1 [401, 600]      * |      0.75
#   -------
#   seqinfo: 1 sequence from an unspecified genome

Note that in a typical "real-world" workflow, this problem is unlikely to arise because the reads usually contain the chromosome lengths (if they were loaded from a BAM file with the readGAlignements() function from the GenomicAlignments package, they do). Then the bins are typically generated with something like bins <- tileGenome(seqlengths(reads), tilewidth=200, cut.last.tile.in.chrom=TRUE) so the sequence lengths are propagated from reads to bins.

Anyway, I'll modify binnedAverage() in the doc to handle the situation where the trailing zeros are missing.

Cheers,

H.

ADD COMMENT
0
Entering edit mode

In GenomicRanges 1.20.6 (current BioC release) and 1.21.21 (current BioC devel), I've changed the definition of binnedAverage() to handle properly the "missing trailing zeros" situation. The new definition of the function is:

binnedAverage <- function(bins, numvar, mcolname)
{
    stopifnot(is(bins, "GRanges"))
    stopifnot(is(numvar, "RleList"))
    stopifnot(identical(seqlevels(bins), names(numvar)))

    ## A version of viewMeans() that pads "out of limits" views with zeros.
    viewMeans2 <- function(v) {
        means <- viewMeans(v)
        w0 <- width(v)
        w1 <- width(trim(v))
        means <- means * w1 / w0
        means[w0 != 0L & w1 == 0L] <- 0
        means
    }

    bins_per_chrom <- split(ranges(bins), seqnames(bins))
    means_list <- lapply(names(numvar),
        function(seqname) {
            v <- Views(numvar[[seqname]], bins_per_chrom[[seqname]])
            viewMeans2(v)
        })
    new_mcol <- unsplit(means_list, as.factor(seqnames(bins)))
    mcols(bins)[[mcolname]] <- new_mcol
    bins
}

This new definition is in the example section of ?tileGenome and in the GenomicRanges HOWTOs.

Cheers,

H.

ADD REPLY

Login before adding your answer.

Traffic: 747 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6