Question: Calling coverage on GRanges object yields small negative values
1
gravatar for Anya
4.0 years ago by
Anya10
United Kingdom
Anya10 wrote:

I am working on genomic data stored as normalised signal in Metadata column named 'score'. The signal, being a normalised count, can have a value of 0, or increments of some minimal value (here 0.0287). 

> minus.gr[seqnamesminus.gr)=="chr10"]
GRanges object with 61520 ranges and 1 metadata column:
          seqnames               ranges strand   |              score
             <Rle>            <IRanges>  <Rle>   |          <numeric>
      [1]    chr10         [ 699,  699]      -   |                  0
      [2]    chr10         [ 761,  761]      -   |                  0
      [3]    chr10         [1932, 1932]      -   | 0.0286903229090105
      [4]    chr10         [1964, 1964]      -   | 0.0286903229090105
      [5]    chr10         [1987, 1987]      -   | 0.0573806458180211
      ...      ...                  ...    ... ...                ...
  [61516]    chr10 [46588348, 46588348]      -   | 0.0286903229090105
  [61517]    chr10 [46588463, 46588463]      -   |  0.143451614545053
  [61518]    chr10 [46589107, 46589107]      -   | 0.0286903229090105
  [61519]    chr10 [46589451, 46589451]      -   | 0.0286903229090105
  [61520]    chr10 [46590345, 46590345]      -   |                  0
  -------
  seqinfo: 26 sequences from an unspecified genome; no seqlengths

> summaryminus.gr$score)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
    0.000     0.029     0.029     0.281     0.057 11460.000 

 

Calling coverage on this yields unexpected behaviour: some stretches on e.g. chr10 with no signal have very small negative values. Surely, coverage sure have minimal value of zero? Please see example below:

 

> minus.cov <- coverageminus.gr, weight='score')
> minus.cov$chr10
numeric-Rle of length 46590345 with 97680 runs
  Lengths:                  1931                     1 ...                   894
  Values :                     0    0.0286903229090105 ... -2.08721928629529e-14

> head(table(runValue(minus.cov$chr10)))

-3.68594044175552e-14 -3.22242232897452e-14 -3.22172843958413e-14 -3.21964677141295e-14 -3.21409565628983e-14 -3.21132009872827e-14 
                    1                     1                     1                    18                     2                     3 

 

Could anyone provide an explanation as to what is happening here? 

Cheers,

 Anya

 

> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.4 (Mavericks)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] CAGEr_1.10.3         BSgenome_1.34.1      rtracklayer_1.26.3   Biostrings_2.34.1    XVector_0.6.0        GenomicRanges_1.18.4 GenomeInfoDb_1.2.5  
 [8] IRanges_2.0.1        S4Vectors_0.4.0      BiocGenerics_0.12.1  BiocInstaller_1.16.2

loaded via a namespace (and not attached):
 [1] base64enc_0.1-2         BatchJobs_1.6           BBmisc_1.9              beanplot_1.2            BiocParallel_1.0.3      bitops_1.0-6           
 [7] brew_1.0-6              checkmate_1.5.2         chron_2.3-45            codetools_0.2-11        data.table_1.9.4        DBI_0.3.1              
[13] digest_0.6.8            fail_1.2                foreach_1.4.2           GenomicAlignments_1.2.2 iterators_1.0.7         plyr_1.8.1             
[19] Rcpp_0.11.5             RCurl_1.95-4.5          reshape2_1.4.1          Rsamtools_1.18.3        RSQLite_1.0.0           sendmailR_1.2-1        
[25] som_0.3-5               splines_3.1.3           stringr_0.6.2           tools_3.1.3             VGAM_0.9-7              XML_3.98-1.1           
[31] zlibbioc_1.12.0        

 

coverage granges • 628 views
ADD COMMENTlink modified 4.0 years ago by Hervé Pagès ♦♦ 13k • written 4.0 years ago by Anya10
Answer: Calling coverage on GRanges object yields small negative values
3
gravatar for Hervé Pagès
4.0 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi Anya,

This is life with floating point arithmetic. For example:

s1 <- 0.9587693
s2 <- 10000.0
s1 + s2 - s2 - s1
# [1] -5.967449e-13

The same sequence of operations is performed when computing the weighted coverage of the following GRanges object:

library(GenomicRanges)
gr <- GRanges("chr10", IRanges(c(4, 10), c(18, 15)), score=c(s1, s2))
seqlengths(gr) <- 30
cov <- coverage(gr, weight="score")
runValue(cov$chr10)
# [1]  0.000000e+00  9.587693e-01  1.000096e+04  9.587693e-01 -5.967449e-13

The algorithm used by coverage() performed s1 + s2 - s2 - s1 (in that order) to obtain the last run value.

It's not the language fault that we don't get 0 (coverage() is implemented in C), it's just how floating point arithmetic works in general.

H.

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by Hervé Pagès ♦♦ 13k

Sorry for reviving an old thread, but I am having the same issue as described in this question. What's a good way to solve this issue if your downstream analysis depends on not having negative values? cov[cov < 0] <- 0 is pretty slow. Perhaps running round(cov) with some number of digits?

ADD REPLYlink written 6 months ago by maltethodberg110
1

round(cov) with some number of digits is a good option.

FWIW cov[cov < 0] <- 0 is slow because of the way x[i] <- value is currently implemented on Rle objects. The [<- method for Rle objects needs to be optimized. However, note that in the particular case of  cov[cov < 0] <- 0, a much faster way is to operate directly on the run values of the Rle object e.g. with something like runValue(x) <- ifelse(runValue(x) < 0, 0L, runValue(x)).

Anyway, it's important to realize that for the same reason that you can get small negative values in cov, you can also get small positive values, and there is no reason to not replace them with zeroes too. Otherwise, code that relies on things like cov != 0 to find regions with coverage won't work properly. So it's probably cleaner/safer to round(cov) with some number of digits. It will also have the interesting side effect to reduce the memory footprint  of the Rle object (because some runs will be merged into a single run as a consequence of the rounding).

ADD REPLYlink modified 6 months ago • written 6 months ago by Hervé Pagès ♦♦ 13k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 142 users visited in the last hour