Entering edit mode
Dear BioC list,
I have some data that is in the form of a list where each element is a
Rle
object (IRanges package). These Rle's are very long and each
represents
the data from a chromosome. I want to combine them and then calculate
some
quantiles. However, what I would naively do doesn't work due to the
2^31
ceiling on the length of Rle's. I have an example with made-up data
and a
possible solution here http://rpubs.com/lcollado/10279 (Rmd file:
Is there something in Rle-world that I am missing on how to do this?
Do you
have ideas on how to do this more efficiently? After all, I am
guessing
that the 2^31 length limit on a Rle cannot be changed, even on 64-bit
environments.
Code used for the Rpubs link above:
-----------------------------------------------------
Quantiles from Rle's with long lengths
======================================
The goal is to calculate quantiles on aggregated long Rle's. The
information is split by chr, and combining it naively fails.
```{r}
suppressMessages(library("IRanges"))
## Create dummy data
lengths <- c(249250621, 243199373, 198022430, 191154276, 180915260,
171115067, 159138663, 146364022, 141213431, 135534747, 135006516,
133851895, 115169878, 107349540, 102531392, 90354753, 81195210,
78077248,
59128983, 63025520, 48129895, 51304566, 155270560, 59373566)
chrdata <- lapply(lengths, function(x) {
Rle(rep(0, x))
})
## Just to see visualize explore it (doesn't matter if the type is
numeric-Rle or integer-Rle)
head(chrdata)
## Trying to combine
unlist(RleList(chrdata), use.names=FALSE)
traceback()
## Works manually up to the first 12
tmp <- c(chrdata[[1]], chrdata[[2]], chrdata[[3]], chrdata[[4]],
chrdata[[5]], chrdata[[6]], chrdata[[7]], chrdata[[8]], chrdata[[9]],
chrdata[[10]], chrdata[[11]], chrdata[[12]])
## For the first 13 it breaks
tmp <- c(chrdata[[1]], chrdata[[2]], chrdata[[3]], chrdata[[4]],
chrdata[[5]], chrdata[[6]], chrdata[[7]], chrdata[[8]], chrdata[[9]],
chrdata[[10]], chrdata[[11]], chrdata[[12]], chrdata[[13]])
traceback()
## Note the main issue is related to the 2^31 ceiling
plot(cumsum(lengths), main="Length of Rle exceeding 2^31")
abline(v=13, col="red")
abline(h=2^31, col="orange")
```
One idea: leave the Rle-world and use some kind of weighted quantile.
```{r}
## Manually combine in two parts
part1 <- unlist(RleList(chrdata[1:12]), use.names=FALSE)
part2 <- unlist(RleList(chrdata[13:length(chrdata)]), use.names=FALSE)
parts <- list(part1, part2)
## Reduce number of runs in the Rle's by sorting (kind of doing
table())
sortedParts <- lapply(parts, function(x) sort(x))
## Leave Rle-world
info <- lapply(sortedParts, function(y) {
data.frame(value=runValue(y), length=runLength(y))
})
info <- do.call(rbind, info)
## Format
info$length <- as.numeric(info$length)
## Collapse
collapsed <- tapply(info$length, info$value, sum)
weights <- as.integer(names(collapsed))
## Calculate weighted quantile of interest
library("Hmisc")
wtd.quantile(collapsed, weights=weights, probs = 0.5)
```
Is there a way to calculate the quantiles of the aggregated Rle's
without
all this machinery? Are there other possibly more efficients ways to
do so?
```{r}
## Reproducibility
sessionInfo()
```
-------------- End of Rmd file ---------------
> sessionInfo() ## for cluster environment since there is more memory
there
R version 3.0.2 Patched (2013-10-17 r64066)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C
[3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915
[5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915
[7] LC_PAPER=en_US.iso885915 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
attached base packages:
[1] parallel splines stats graphics grDevices datasets utils
[8] methods base
other attached packages:
[1] GenomicRanges_1.14.3 XVector_0.2.0 IRanges_1.20.4
[4] BiocGenerics_0.8.0 Hmisc_3.12-2 Formula_1.1-1
[7] survival_2.37-4
loaded via a namespace (and not attached):
[1] cluster_1.14.4 grid_3.0.2 lattice_0.20-24 rpart_4.1-3
[5] stats4_3.0.2 tools_3.0.2
> sessionInfo() ## for laptop environment, although both are 64-bit
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] splines parallel stats graphics grDevices utils
datasets
methods base
other attached packages:
[1] Hmisc_3.12-2 Formula_1.1-1 survival_2.37-4
IRanges_1.20.4
BiocGenerics_0.8.0
loaded via a namespace (and not attached):
[1] cluster_1.14.4 grid_3.0.2 lattice_0.20-24 rpart_4.1-3
stats4_3.0.2
Thanks,
Leonardo
Leonardo Collado Torres, PhD student
Department of Biostatistics
Johns Hopkins University
Bloomberg School of Public Health
Website: http://www.biostat.jhsph.edu/~lcollado/<http: www.biostat.jh="" sph.edu="" ~lcollado="" #.uc10mnce5ls="">
Blog: http://lcolladotor.github.io/
[[alternative HTML version deleted]]