I am calculating weighted coverage for two reads one of which has a gap. I find that the coverage weights are assigned wrong after the gapped region due to the GAlignment being converted to GRangesList with multiple GRanges. Is there a way to calculate the right weighted coverage?
I can correct this manually by expanding the assigned weights by hand as shown below, but the lapply step seems to be very slow when working with millions of reads.
> library(GenomicAlignments)
> reads <- GAlignments(
seqnames=Rle(rep("chr1",2)),
pos=as.integer(c(2,40)),
strand=strand(rep("+",2)),
cigar=c("8M2N12M", "10M"),
weight=c(0.1, 0.2))
> reads
GAlignments object with 2 alignments and 1 metadata column:
seqnames strand cigar qwidth start end width
<Rle> <Rle> <character> <integer> <integer> <integer> <integer>
[1] chr1 + 8M2N12M 20 2 23 22
[2] chr1 + 10M 10 40 49 10
njunc | weight
<integer> | <numeric>
[1] 1 | 0.1
[2] 0 | 0.2
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
> coverage(reads)
RleList of length 1
$chr1
integer-Rle of length 49 with 6 runs
Lengths: 1 8 2 12 16 10
Values : 0 1 0 1 0 1
> coverage(reads, weight=mcols(reads)$weight)
RleList of length 1
$chr1
numeric-Rle of length 49 with 6 runs
Lengths: 1 8 2 12 16 10
Values : 0 0.1 0 0.2 0 0.1
Warning message:
In .recycle(arg, length(x), arg.label, "x") :
'weight' length is not a divisor of 'x' length
#### correct above problem by manually assigning weights
> shape <- lapply(grlist(reads), function(x) rep(1, length(x)))
> weights <- unlist(mapply("*", shape, mcols(reads)$weight)
> coverage(reads, weight=weights)
RleList of length 1
$chr1
numeric-Rle of length 49 with 6 runs
Lengths: 1 8 2 12 16 10
Values : 0 0.1 0 0.1 0 0.2
> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS
locale:
[1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
[3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
[5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8
[7] LC_PAPER=en_US.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices datasets utils
[8] methods base
other attached packages:
[1] GenomicAlignments_1.10.0 Rsamtools_1.26.1
[3] Biostrings_2.42.1 XVector_0.14.0
[5] SummarizedExperiment_1.4.0 Biobase_2.34.0
[7] GenomicRanges_1.26.3 GenomeInfoDb_1.10.3
[9] IRanges_2.8.1 S4Vectors_0.12.1
[11] BiocGenerics_0.20.0
loaded via a namespace (and not attached):
[1] lattice_0.20-34 bitops_1.0-6 grid_3.3.2 zlibbioc_1.20.0
[5] Matrix_1.2-8 BiocParallel_1.8.1 tools_3.3.2 RCurl_1.95-4.8
[9] compiler_3.3.2
