Calculating quantiles from very long Rle's that need to be aggregated

0

Entering edit mode

Leonardo Collado Torres ★ 1.1k

@lcolladotor

Last seen 10 days ago

United States

Dear BioC list, I have some data that is in the form of a list where each element is a Rle object (IRanges package). These Rle's are very long and each represents the data from a chromosome. I want to combine them and then calculate some quantiles. However, what I would naively do doesn't work due to the 2^31 ceiling on the length of Rle's. I have an example with made-up data and a possible solution here http://rpubs.com/lcollado/10279 (Rmd file: Is there something in Rle-world that I am missing on how to do this? Do you have ideas on how to do this more efficiently? After all, I am guessing that the 2^31 length limit on a Rle cannot be changed, even on 64-bit environments. Code used for the Rpubs link above: ----------------------------------------------------- Quantiles from Rle's with long lengths ====================================== The goal is to calculate quantiles on aggregated long Rle's. The information is split by chr, and combining it naively fails. ```{r} suppressMessages(library("IRanges")) ## Create dummy data lengths <- c(249250621, 243199373, 198022430, 191154276, 180915260, 171115067, 159138663, 146364022, 141213431, 135534747, 135006516, 133851895, 115169878, 107349540, 102531392, 90354753, 81195210, 78077248, 59128983, 63025520, 48129895, 51304566, 155270560, 59373566) chrdata <- lapply(lengths, function(x) { Rle(rep(0, x)) }) ## Just to see visualize explore it (doesn't matter if the type is numeric-Rle or integer-Rle) head(chrdata) ## Trying to combine unlist(RleList(chrdata), use.names=FALSE) traceback() ## Works manually up to the first 12 tmp <- c(chrdata[[1]], chrdata[[2]], chrdata[[3]], chrdata[[4]], chrdata[[5]], chrdata[[6]], chrdata[[7]], chrdata[[8]], chrdata[[9]], chrdata[[10]], chrdata[[11]], chrdata[[12]]) ## For the first 13 it breaks tmp <- c(chrdata[[1]], chrdata[[2]], chrdata[[3]], chrdata[[4]], chrdata[[5]], chrdata[[6]], chrdata[[7]], chrdata[[8]], chrdata[[9]], chrdata[[10]], chrdata[[11]], chrdata[[12]], chrdata[[13]]) traceback() ## Note the main issue is related to the 2^31 ceiling plot(cumsum(lengths), main="Length of Rle exceeding 2^31") abline(v=13, col="red") abline(h=2^31, col="orange") ``` One idea: leave the Rle-world and use some kind of weighted quantile. ```{r} ## Manually combine in two parts part1 <- unlist(RleList(chrdata[1:12]), use.names=FALSE) part2 <- unlist(RleList(chrdata[13:length(chrdata)]), use.names=FALSE) parts <- list(part1, part2) ## Reduce number of runs in the Rle's by sorting (kind of doing table()) sortedParts <- lapply(parts, function(x) sort(x)) ## Leave Rle-world info <- lapply(sortedParts, function(y) { data.frame(value=runValue(y), length=runLength(y)) }) info <- do.call(rbind, info) ## Format info$length <- as.numeric(info$length) ## Collapse collapsed <- tapply(info$length, info$value, sum) weights <- as.integer(names(collapsed)) ## Calculate weighted quantile of interest library("Hmisc") wtd.quantile(collapsed, weights=weights, probs = 0.5) ``` Is there a way to calculate the quantiles of the aggregated Rle's without all this machinery? Are there other possibly more efficients ways to do so? ```{r} ## Reproducibility sessionInfo() ``` -------------- End of Rmd file --------------- > sessionInfo() ## for cluster environment since there is more memory there R version 3.0.2 Patched (2013-10-17 r64066) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 [7] LC_PAPER=en_US.iso885915 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C attached base packages: [1] parallel splines stats graphics grDevices datasets utils [8] methods base other attached packages: [1] GenomicRanges_1.14.3 XVector_0.2.0 IRanges_1.20.4 [4] BiocGenerics_0.8.0 Hmisc_3.12-2 Formula_1.1-1 [7] survival_2.37-4 loaded via a namespace (and not attached): [1] cluster_1.14.4 grid_3.0.2 lattice_0.20-24 rpart_4.1-3 [5] stats4_3.0.2 tools_3.0.2 > sessionInfo() ## for laptop environment, although both are 64-bit R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] splines parallel stats graphics grDevices utils datasets methods base other attached packages: [1] Hmisc_3.12-2 Formula_1.1-1 survival_2.37-4 IRanges_1.20.4 BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] cluster_1.14.4 grid_3.0.2 lattice_0.20-24 rpart_4.1-3 stats4_3.0.2 Thanks, Leonardo Leonardo Collado Torres, PhD student Department of Biostatistics Johns Hopkins University Bloomberg School of Public Health Website: http://www.biostat.jhsph.edu/~lcollado/<http: www.biostat.jh="" sph.edu="" ~lcollado="" #.uc10mnce5ls=""> Blog: http://lcolladotor.github.io/ [[alternative HTML version deleted]]

• 1.6k views

ADD COMMENT • link updated 11.2 years ago by Michael Lawrence ★ 11k • written 11.2 years ago by Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.1 years ago

United States

We're aiming to eventually address these long lengths, but for now, why not just use wtd.quantile() with the run values and lengths concatenated across the Rles. That is, concatenate the run values and lengths separately instead of the Rles. Michael On Mon, Nov 4, 2013 at 3:17 PM, Leonardo Collado Torres <lcollado@jhsph.edu>wrote: > Dear BioC list, > > I have some data that is in the form of a list where each element is a Rle > object (IRanges package). These Rle's are very long and each represents > the data from a chromosome. I want to combine them and then calculate some > quantiles. However, what I would naively do doesn't work due to the 2^31 > ceiling on the length of Rle's. I have an example with made-up data and a > possible solution here http://rpubs.com/lcollado/10279 (Rmd file: > > > Is there something in Rle-world that I am missing on how to do this? Do you > have ideas on how to do this more efficiently? After all, I am guessing > that the 2^31 length limit on a Rle cannot be changed, even on 64-bit > environments. > > > Code used for the Rpubs link above: > ----------------------------------------------------- > > > > Quantiles from Rle's with long lengths > ====================================== > > > > The goal is to calculate quantiles on aggregated long Rle's. The > information is split by chr, and combining it naively fails. > > ```{r} > suppressMessages(library("IRanges")) > > ## Create dummy data > lengths <- c(249250621, 243199373, 198022430, 191154276, 180915260, > 171115067, 159138663, 146364022, 141213431, 135534747, 135006516, > 133851895, 115169878, 107349540, 102531392, 90354753, 81195210, 78077248, > 59128983, 63025520, 48129895, 51304566, 155270560, 59373566) > chrdata <- lapply(lengths, function(x) { > Rle(rep(0, x)) > }) > > ## Just to see visualize explore it (doesn't matter if the type is > numeric-Rle or integer-Rle) > head(chrdata) > > ## Trying to combine > unlist(RleList(chrdata), use.names=FALSE) > traceback() > > ## Works manually up to the first 12 > tmp <- c(chrdata[[1]], chrdata[[2]], chrdata[[3]], chrdata[[4]], > chrdata[[5]], chrdata[[6]], chrdata[[7]], chrdata[[8]], chrdata[[9]], > chrdata[[10]], chrdata[[11]], chrdata[[12]]) > > ## For the first 13 it breaks > tmp <- c(chrdata[[1]], chrdata[[2]], chrdata[[3]], chrdata[[4]], > chrdata[[5]], chrdata[[6]], chrdata[[7]], chrdata[[8]], chrdata[[9]], > chrdata[[10]], chrdata[[11]], chrdata[[12]], chrdata[[13]]) > traceback() > > ## Note the main issue is related to the 2^31 ceiling > plot(cumsum(lengths), main="Length of Rle exceeding 2^31") > abline(v=13, col="red") > abline(h=2^31, col="orange") > ``` > > One idea: leave the Rle-world and use some kind of weighted quantile. > > ```{r} > ## Manually combine in two parts > part1 <- unlist(RleList(chrdata[1:12]), use.names=FALSE) > part2 <- unlist(RleList(chrdata[13:length(chrdata)]), use.names=FALSE) > parts <- list(part1, part2) > > ## Reduce number of runs in the Rle's by sorting (kind of doing table()) > sortedParts <- lapply(parts, function(x) sort(x)) > > ## Leave Rle-world > info <- lapply(sortedParts, function(y) { > data.frame(value=runValue(y), length=runLength(y)) > }) > info <- do.call(rbind, info) > > ## Format > info$length <- as.numeric(info$length) > > ## Collapse > collapsed <- tapply(info$length, info$value, sum) > weights <- as.integer(names(collapsed)) > > ## Calculate weighted quantile of interest > library("Hmisc") > wtd.quantile(collapsed, weights=weights, probs = 0.5) > ``` > > Is there a way to calculate the quantiles of the aggregated Rle's without > all this machinery? Are there other possibly more efficients ways to do so? > > ```{r} > ## Reproducibility > sessionInfo() > ``` > > > -------------- End of Rmd file --------------- > > > sessionInfo() ## for cluster environment since there is more memory there > R version 3.0.2 Patched (2013-10-17 r64066) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C > [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 > [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 > [7] LC_PAPER=en_US.iso885915 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel splines stats graphics grDevices datasets utils > [8] methods base > > other attached packages: > [1] GenomicRanges_1.14.3 XVector_0.2.0 IRanges_1.20.4 > [4] BiocGenerics_0.8.0 Hmisc_3.12-2 Formula_1.1-1 > [7] survival_2.37-4 > > loaded via a namespace (and not attached): > [1] cluster_1.14.4 grid_3.0.2 lattice_0.20-24 rpart_4.1-3 > [5] stats4_3.0.2 tools_3.0.2 > > > sessionInfo() ## for laptop environment, although both are 64-bit > R version 3.0.2 (2013-09-25) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] splines parallel stats graphics grDevices utils datasets > methods base > > other attached packages: > [1] Hmisc_3.12-2 Formula_1.1-1 survival_2.37-4 IRanges_1.20.4 > BiocGenerics_0.8.0 > > loaded via a namespace (and not attached): > [1] cluster_1.14.4 grid_3.0.2 lattice_0.20-24 rpart_4.1-3 > stats4_3.0.2 > > > Thanks, > Leonardo > > Leonardo Collado Torres, PhD student > Department of Biostatistics > Johns Hopkins University > Bloomberg School of Public Health > Website: http://www.biostat.jhsph.edu/~lcollado/< > http://www.biostat.jhsph.edu/~lcollado/#.UC10MNCe5Ls> > Blog: http://lcolladotor.github.io/ > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 11.2 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thank you Michael, your idea works very well too. I just sorted the Rle's before following your tip to reduce the number of runs per Rle and thus reduce the length of the regular vectors that I feed into wtd.quantile(). Depending on how much memory you have, this might not be needed. On Mon, Nov 4, 2013 at 7:17 PM, Michael Lawrence <lawrence.michael@gene.com>wrote: > We're aiming to eventually address these long lengths, but for now, why > not just use wtd.quantile() with the run values and lengths concatenated > across the Rles. That is, concatenate the run values and lengths separately > instead of the Rles. > > Michael > > > > On Mon, Nov 4, 2013 at 3:17 PM, Leonardo Collado Torres < > lcollado@jhsph.edu> wrote: > >> Dear BioC list, >> >> I have some data that is in the form of a list where each element is a Rle >> object (IRanges package). These Rle's are very long and each represents >> the data from a chromosome. I want to combine them and then calculate some >> quantiles. However, what I would naively do doesn't work due to the 2^31 >> ceiling on the length of Rle's. I have an example with made-up data and a >> possible solution here http://rpubs.com/lcollado/10279 (Rmd file: >> >> >> Is there something in Rle-world that I am missing on how to do this? Do >> you >> have ideas on how to do this more efficiently? After all, I am guessing >> that the 2^31 length limit on a Rle cannot be changed, even on 64-bit >> environments. >> >> >> Code used for the Rpubs link above: >> ----------------------------------------------------- >> >> >> >> Quantiles from Rle's with long lengths >> ====================================== >> >> >> >> The goal is to calculate quantiles on aggregated long Rle's. The >> information is split by chr, and combining it naively fails. >> >> ```{r} >> suppressMessages(library("IRanges")) >> >> ## Create dummy data >> lengths <- c(249250621, 243199373, 198022430, 191154276, 180915260, >> 171115067, 159138663, 146364022, 141213431, 135534747, 135006516, >> 133851895, 115169878, 107349540, 102531392, 90354753, 81195210, 78077248, >> 59128983, 63025520, 48129895, 51304566, 155270560, 59373566) >> chrdata <- lapply(lengths, function(x) { >> Rle(rep(0, x)) >> }) >> >> ## Just to see visualize explore it (doesn't matter if the type is >> numeric-Rle or integer-Rle) >> head(chrdata) >> >> ## Trying to combine >> unlist(RleList(chrdata), use.names=FALSE) >> traceback() >> >> ## Works manually up to the first 12 >> tmp <- c(chrdata[[1]], chrdata[[2]], chrdata[[3]], chrdata[[4]], >> chrdata[[5]], chrdata[[6]], chrdata[[7]], chrdata[[8]], chrdata[[9]], >> chrdata[[10]], chrdata[[11]], chrdata[[12]]) >> >> ## For the first 13 it breaks >> tmp <- c(chrdata[[1]], chrdata[[2]], chrdata[[3]], chrdata[[4]], >> chrdata[[5]], chrdata[[6]], chrdata[[7]], chrdata[[8]], chrdata[[9]], >> chrdata[[10]], chrdata[[11]], chrdata[[12]], chrdata[[13]]) >> traceback() >> >> ## Note the main issue is related to the 2^31 ceiling >> plot(cumsum(lengths), main="Length of Rle exceeding 2^31") >> abline(v=13, col="red") >> abline(h=2^31, col="orange") >> ``` >> >> One idea: leave the Rle-world and use some kind of weighted quantile. >> >> ```{r} >> ## Manually combine in two parts >> part1 <- unlist(RleList(chrdata[1:12]), use.names=FALSE) >> part2 <- unlist(RleList(chrdata[13:length(chrdata)]), use.names=FALSE) >> parts <- list(part1, part2) >> >> ## Reduce number of runs in the Rle's by sorting (kind of doing table()) >> sortedParts <- lapply(parts, function(x) sort(x)) >> >> ## Leave Rle-world >> info <- lapply(sortedParts, function(y) { >> data.frame(value=runValue(y), length=runLength(y)) >> }) >> info <- do.call(rbind, info) >> >> ## Format >> info$length <- as.numeric(info$length) >> >> ## Collapse >> collapsed <- tapply(info$length, info$value, sum) >> weights <- as.integer(names(collapsed)) >> >> ## Calculate weighted quantile of interest >> library("Hmisc") >> wtd.quantile(collapsed, weights=weights, probs = 0.5) >> ``` >> >> Is there a way to calculate the quantiles of the aggregated Rle's without >> all this machinery? Are there other possibly more efficients ways to do >> so? >> >> ```{r} >> ## Reproducibility >> sessionInfo() >> ``` >> >> >> -------------- End of Rmd file --------------- >> >> > sessionInfo() ## for cluster environment since there is more memory >> there >> R version 3.0.2 Patched (2013-10-17 r64066) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C >> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 >> [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 >> [7] LC_PAPER=en_US.iso885915 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel splines stats graphics grDevices datasets utils >> [8] methods base >> >> other attached packages: >> [1] GenomicRanges_1.14.3 XVector_0.2.0 IRanges_1.20.4 >> [4] BiocGenerics_0.8.0 Hmisc_3.12-2 Formula_1.1-1 >> [7] survival_2.37-4 >> >> loaded via a namespace (and not attached): >> [1] cluster_1.14.4 grid_3.0.2 lattice_0.20-24 rpart_4.1-3 >> [5] stats4_3.0.2 tools_3.0.2 >> >> > sessionInfo() ## for laptop environment, although both are 64-bit >> R version 3.0.2 (2013-09-25) >> Platform: x86_64-apple-darwin10.8.0 (64-bit) >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: >> [1] splines parallel stats graphics grDevices utils datasets >> methods base >> >> other attached packages: >> [1] Hmisc_3.12-2 Formula_1.1-1 survival_2.37-4 >> IRanges_1.20.4 >> BiocGenerics_0.8.0 >> >> loaded via a namespace (and not attached): >> [1] cluster_1.14.4 grid_3.0.2 lattice_0.20-24 rpart_4.1-3 >> stats4_3.0.2 >> >> >> Thanks, >> Leonardo >> >> Leonardo Collado Torres, PhD student >> Department of Biostatistics >> Johns Hopkins University >> Bloomberg School of Public Health >> Website: http://www.biostat.jhsph.edu/~lcollado/< >> http://www.biostat.jhsph.edu/~lcollado/#.UC10MNCe5Ls> >> Blog: http://lcolladotor.github.io/ >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]]

ADD REPLY • link 11.2 years ago Leonardo Collado Torres ★ 1.1k

Login before adding your answer.