I am working on genomic data stored as normalised signal in Metadata column named 'score'. The signal, being a normalised count, can have a value of 0, or increments of some minimal value (here 0.0287).
> minus.gr[seqnamesminus.gr)=="chr10"] GRanges object with 61520 ranges and 1 metadata column: seqnames ranges strand | score <Rle> <IRanges> <Rle> | <numeric> [1] chr10 [ 699, 699] - | 0 [2] chr10 [ 761, 761] - | 0 [3] chr10 [1932, 1932] - | 0.0286903229090105 [4] chr10 [1964, 1964] - | 0.0286903229090105 [5] chr10 [1987, 1987] - | 0.0573806458180211 ... ... ... ... ... ... [61516] chr10 [46588348, 46588348] - | 0.0286903229090105 [61517] chr10 [46588463, 46588463] - | 0.143451614545053 [61518] chr10 [46589107, 46589107] - | 0.0286903229090105 [61519] chr10 [46589451, 46589451] - | 0.0286903229090105 [61520] chr10 [46590345, 46590345] - | 0 -------
seqinfo: 26 sequences from an unspecified genome; no seqlengths
> summaryminus.gr$score) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 0.029 0.029 0.281 0.057 11460.000
Calling coverage on this yields unexpected behaviour: some stretches on e.g. chr10 with no signal have very small negative values. Surely, coverage sure have minimal value of zero? Please see example below:
> minus.cov <- coverageminus.gr, weight='score') > minus.cov$chr10 numeric-Rle of length 46590345 with 97680 runs Lengths: 1931 1 ... 894 Values : 0 0.0286903229090105 ... -2.08721928629529e-14
> head(table(runValue(minus.cov$chr10)))
-3.68594044175552e-14 -3.22242232897452e-14 -3.22172843958413e-14 -3.21964677141295e-14 -3.21409565628983e-14 -3.21132009872827e-14 1 1 1 18 2 3
Could anyone provide an explanation as to what is happening here?
Cheers,
Anya
> sessionInfo() R version 3.1.3 (2015-03-09) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.9.4 (Mavericks)
locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages: [1] CAGEr_1.10.3 BSgenome_1.34.1 rtracklayer_1.26.3 Biostrings_2.34.1 XVector_0.6.0 GenomicRanges_1.18.4 GenomeInfoDb_1.2.5 [8] IRanges_2.0.1 S4Vectors_0.4.0 BiocGenerics_0.12.1 BiocInstaller_1.16.2
loaded via a namespace (and not attached): [1] base64enc_0.1-2 BatchJobs_1.6 BBmisc_1.9 beanplot_1.2 BiocParallel_1.0.3 bitops_1.0-6 [7] brew_1.0-6 checkmate_1.5.2 chron_2.3-45 codetools_0.2-11 data.table_1.9.4 DBI_0.3.1 [13] digest_0.6.8 fail_1.2 foreach_1.4.2 GenomicAlignments_1.2.2 iterators_1.0.7 plyr_1.8.1 [19] Rcpp_0.11.5 RCurl_1.95-4.5 reshape2_1.4.1 Rsamtools_1.18.3 RSQLite_1.0.0 sendmailR_1.2-1 [25] som_0.3-5 splines_3.1.3 stringr_0.6.2 tools_3.1.3 VGAM_0.9-7 XML_3.98-1.1 [31] zlibbioc_1.12.0
Sorry for reviving an old thread, but I am having the same issue as described in this question. What's a good way to solve this issue if your downstream analysis depends on not having negative values?
cov[cov < 0] <- 0
is pretty slow. Perhaps runninground(cov)
with some number of digits?round(cov)
with some number of digits is a good option.FWIW
cov[cov < 0] <- 0
is slow because of the wayx[i] <- value
is currently implemented on Rle objects. The[<-
method for Rle objects needs to be optimized. However, note that in the particular case ofcov[cov < 0] <- 0
, a much faster way is to operate directly on the run values of the Rle object e.g. with something likerunValue(x) <- ifelse(runValue(x) < 0, 0L, runValue(x))
.Anyway, it's important to realize that for the same reason that you can get small negative values in
cov
, you can also get small positive values, and there is no reason to not replace them with zeroes too. Otherwise, code that relies on things likecov != 0
to find regions with coverage won't work properly. So it's probably cleaner/safer toround(cov)
with some number of digits. It will also have the interesting side effect to reduce the memory footprint of the Rle object (because some runs will be merged into a single run as a consequence of the rounding).