Question

Coverage function assigns wrong weights for gapped alignments

1

Entering edit mode

rasi1983 ▴ 10

@rasi1983-7454

Last seen 6.2 years ago

United States

I am calculating weighted coverage for two reads one of which has a gap. I find that the coverage weights are assigned wrong after the gapped region due to the GAlignment being converted to GRangesList with multiple GRanges. Is there a way to calculate the right weighted coverage?

I can correct this manually by expanding the assigned weights by hand as shown below, but the lapply step seems to be very slow when working with millions of reads.

> library(GenomicAlignments)

> reads <- GAlignments(
seqnames=Rle(rep("chr1",2)),
pos=as.integer(c(2,40)),
strand=strand(rep("+",2)),
cigar=c("8M2N12M", "10M"),
weight=c(0.1, 0.2))

> reads
GAlignments object with 2 alignments and 1 metadata column:
      seqnames strand       cigar    qwidth     start       end     width
         <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
  [1]     chr1      +     8M2N12M        20         2        23        22
  [2]     chr1      +         10M        10        40        49        10
          njunc |    weight
      <integer> | <numeric>
  [1]         1 |       0.1
  [2]         0 |       0.2
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

> coverage(reads)
RleList of length 1
$chr1
integer-Rle of length 49 with 6 runs
  Lengths:  1  8  2 12 16 10
  Values :  0  1  0  1  0  1


> coverage(reads, weight=mcols(reads)$weight)
RleList of length 1
$chr1
numeric-Rle of length 49 with 6 runs
  Lengths:   1   8   2  12  16  10
  Values :   0 0.1   0 0.2   0 0.1

Warning message:
In .recycle(arg, length(x), arg.label, "x") :
  'weight' length is not a divisor of 'x' length

#### correct above problem by manually assigning weights
> shape <- lapply(grlist(reads), function(x) rep(1, length(x)))
> weights <- unlist(mapply("*", shape, mcols(reads)$weight)
> coverage(reads, weight=weights)
RleList of length 1
$chr1
numeric-Rle of length 49 with 6 runs
  Lengths:   1   8   2  12  16  10
  Values :   0 0.1   0 0.1   0 0.2

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C             
 [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8    
 [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=en_US.utf8   
 [7] LC_PAPER=en_US.utf8       LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices datasets  utils    
[8] methods   base     

other attached packages:
 [1] GenomicAlignments_1.10.0   Rsamtools_1.26.1          
 [3] Biostrings_2.42.1          XVector_0.14.0            
 [5] SummarizedExperiment_1.4.0 Biobase_2.34.0            
 [7] GenomicRanges_1.26.3       GenomeInfoDb_1.10.3       
 [9] IRanges_2.8.1              S4Vectors_0.12.1          
[11] BiocGenerics_0.20.0       

loaded via a namespace (and not attached):
[1] lattice_0.20-34    bitops_1.0-6       grid_3.3.2         zlibbioc_1.20.0   
[5] Matrix_1.2-8       BiocParallel_1.8.1 tools_3.3.2        RCurl_1.95-4.8    
[9] compiler_3.3.2

genomicalignments splicing coverage • 1.5k views

ADD COMMENT • link updated 4.5 years ago by Michael Lawrence ★ 11k • written 7.1 years ago by rasi1983 ▴ 10

score 0 · Answer 1 · 2019-10-11

0

Entering edit mode

haroldpimentel • 0

@haroldpimentel-9160

Last seen 4.5 years ago

I came across the same issue. It still appears to not have been fixed.

Here is a adjustment to your solution that should be significantly faster:

    cig = cigar(reads)
    n_matches = lengths(regmatches(cig, gregexpr("M", cig)))
    shape = lapply(n_matches, function(y) rep(1, y))

ADD COMMENT • link 4.5 years ago haroldpimentel • 0

score 0 · Answer 2 · 2019-10-11

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.4 years ago

United States

Seems like a bug in coverage,GRangesList(), because it should automatically expand the weights. For a relatively simple speed up, consider:

grl <- grglist(reads)
weight <- mcols(reads)$weight[togroup(PartitioningByEnd(grl))]
coverage(grl, weight)

ADD COMMENT • link 4.5 years ago Michael Lawrence ★ 11k