Coverage function assigns wrong weights for gapped alignments
2
1
Entering edit mode
rasi1983 ▴ 10
@rasi1983-7454
Last seen 6.4 years ago
United States

I am calculating weighted coverage for two reads one of which has a gap. I find that the coverage weights are assigned wrong after the gapped region due to the GAlignment being converted to GRangesList with multiple GRanges. Is there a way to calculate the right weighted coverage?

I can correct this manually by expanding the assigned weights by hand as shown below, but the lapply step seems to be very slow when working with millions of reads.

> library(GenomicAlignments)

seqnames=Rle(rep("chr1",2)),
pos=as.integer(c(2,40)),
strand=strand(rep("+",2)),
cigar=c("8M2N12M", "10M"),
weight=c(0.1, 0.2))

GAlignments object with 2 alignments and 1 metadata column:
seqnames strand       cigar    qwidth     start       end     width
<Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
[1]     chr1      +     8M2N12M        20         2        23        22
[2]     chr1      +         10M        10        40        49        10
njunc |    weight
<integer> | <numeric>
[1]         1 |       0.1
[2]         0 |       0.2
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths

RleList of length 1
$chr1 integer-Rle of length 49 with 6 runs Lengths: 1 8 2 12 16 10 Values : 0 1 0 1 0 1 > coverage(reads, weight=mcols(reads)$weight)
RleList of length 1
$chr1 numeric-Rle of length 49 with 6 runs Lengths: 1 8 2 12 16 10 Values : 0 0.1 0 0.2 0 0.1 Warning message: In .recycle(arg, length(x), arg.label, "x") : 'weight' length is not a divisor of 'x' length #### correct above problem by manually assigning weights > shape <- lapply(grlist(reads), function(x) rep(1, length(x))) > weights <- unlist(mapply("*", shape, mcols(reads)$weight)
RleList of length 1
chr1 numeric-Rle of length 49 with 6 runs Lengths: 1 8 2 12 16 10 Values : 0 0.1 0 0.1 0 0.2 > sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.3 LTS locale: [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8 [5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8 [7] LC_PAPER=en_US.utf8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices datasets utils [8] methods base other attached packages: [1] GenomicAlignments_1.10.0 Rsamtools_1.26.1 [3] Biostrings_2.42.1 XVector_0.14.0 [5] SummarizedExperiment_1.4.0 Biobase_2.34.0 [7] GenomicRanges_1.26.3 GenomeInfoDb_1.10.3 [9] IRanges_2.8.1 S4Vectors_0.12.1 [11] BiocGenerics_0.20.0 loaded via a namespace (and not attached): [1] lattice_0.20-34 bitops_1.0-6 grid_3.3.2 zlibbioc_1.20.0 [5] Matrix_1.2-8 BiocParallel_1.8.1 tools_3.3.2 RCurl_1.95-4.8 [9] compiler_3.3.2 genomicalignments splicing coverage • 1.6k views ADD COMMENT 0 Entering edit mode @haroldpimentel-9160 Last seen 4.8 years ago I came across the same issue. It still appears to not have been fixed. Here is a adjustment to your solution that should be significantly faster:  cig = cigar(reads) n_matches = lengths(regmatches(cig, gregexpr("M", cig))) shape = lapply(n_matches, function(y) rep(1, y))  ADD COMMENT 0 Entering edit mode @michael-lawrence-3846 Last seen 2.6 years ago United States Seems like a bug in coverage,GRangesList(), because it should automatically expand the weights. For a relatively simple speed up, consider: grl <- grglist(reads) weight <- mcols(reads)weight[togroup(PartitioningByEnd(grl))]
coverage(grl, weight)