tapply for enormous (>2^31 row) matrices

0

Entering edit mode

Matthew Keller ▴ 100

@matthew-keller-2483

Last seen 11.4 years ago

Hello all, I just sent this to the main R forum, but realized this audience might have more familiarity with this type of problem... ---------- Forwarded message ---------- From: Matthew Keller <mckellercran@gmail.com> Date: Tue, Feb 21, 2012 at 4:04 PM Subject: tapply for enormous (>2^31 row) matrices To: r help <r-help at="" r-project.org=""> Hi all, SETUP: I have pairwise data on 22 chromosomes. Data matrix X for a given chromosome looks like this: 1 13 58 1.12 6 142 56 1.11 18 307 64 3.13 22 320 58 0.72 Where column 1 is person ID 1, column 2 is person ID 2, column 3 can be ignored, and column 4 is how much chromosomal sharing those two individuals have in some small portion of the chromosome. There are 9000 individual people, and therefore ~ (9000^2)/2 pairwise matches at each small location on the chromosome, so across an entire chromosome, these matrices are VERY large (e.g., 3 billion rows, which is > the 2^31 vector size limitation in R). I have access to a server with 64 bit R, 1TB RAM and 80 processors. PROBLEM: A pair of individuals (e.g., person 1 and 13 from the first row above) will show up multiple times in a given file. I want to sum column 4 across each pair of individuals. If I could bring the matrix into R, I could use tapply() to accomplish this by indexing on paste(X[,1],X[,2]), but the matrix doesn't fit into R. I have been trying to use bigmemory and bigtabulate packages in R, but when I try to use the bigsplit function, R never completes the operation (after a day, I killed the process). In particular, I did this: X <- read.big.matrix("file.loc.X",sep=" ",type="double") hap.indices <- bigsplit(X,1:2) #this runs for too long to be useful on these matrices #I was then going to use foreach loop to sum across the splits identified by bigsplit SO - does anyone have ideas on how to deal with this problem - i.e., how to use a tapply() like function on an enormous matrix? This isn't necessarily a bigtabulate question (although if I screwed up using bigsplit, let me know). If another package (e.g., an SQL package) can do something like this efficiently, I'd like to hear about it and your experiences using it. Thank you in advance, Matt -- Matthew C Keller Asst. Professor of Psychology University of Colorado at Boulder www.matthewckeller.com

• 1.4k views

ADD COMMENT • link updated 14.0 years ago by Hervé Pagès 16k • written 14.0 years ago by Matthew Keller ▴ 100

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 7 days ago

United States

Hi, On Tue, Feb 21, 2012 at 6:11 PM, Matthew Keller <mckellercran at="" gmail.com=""> wrote: > Hello all, > > I just sent this to the main R forum, but realized this audience might > have more familiarity with this type of problem... If you're determined to do this in R, I'd split your file into a few smaller ones (you can even use the *nix `split` command), do your group-by-and-summarize on the smaller files and in different R processes, then summarize your summaries (sounds like a job for hadoop, no?) For your `tapply` functionality, I'd look to the data.table package -- it has super-faset group-by mojo, and tries to be as memory efficient as possible. Assuming you can get your (subset) of data into a data.frame `df` and that your column names are something like, c("ID1", "ID2", "XX", "score"), you'd then: R> library(data.table) R> df <- as.data.table(df) ## makes a copy R> setkeyv(df, c("ID1", "ID2")) ## no copy R> ans <- df[, list(shared=sum(score)), by=key(df)] Summarizing the results from separate processes will be trivial. Loading your data into a data.frame to start with, however, will likely take painfully long. HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD COMMENT • link 14.0 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

One alternative to Steve's suggestion is to dump the file in an SQL db (say, RSQLite) and summarize through that... Below the naive solution (and comparison with an "all in R alternative") and the exercise on parallelising the task is left as an exercise. --benilton set.seed(1) n <- 1e6 tmp <- data.frame(V1=sample(1000, n, rep=T), V2=sample(1000, n, rep=T), V3=sample(1000, n, rep=T), V4=runif(n)) ref <- with(tmp, aggregate(list(SV4=V4), by=list(V1=V1, V2=V2), sum)) write.table(tmp, file='tmp.loc', sep=' ', col.names=T, row.names=F) rm(tmp) gc() library(sqldf) cat(file='tmploc.db') conn <- read.csv.sql('tmp.loc', sep=' ', dbname='tmploc.db', sql="CREATE TABLE main AS SELECT * FROM file") sqldf('SELECT * FROM main LIMIT 10', dbname='tmploc.db') test <- sqldf('SELECT V1, V2, sum(V4) as SV4 FROM main GROUP BY V2, V1', dbname='tmploc.db') all.equal(ref, test)

ADD REPLY • link 14.0 years ago Benilton Carvalho ★ 4.3k

0

Entering edit mode

errata: and the *extension of* parallelising.... On 22 February 2012 00:23, Benilton Carvalho <beniltoncarvalho at="" gmail.com=""> wrote: > One alternative to Steve's suggestion is to dump the file in an SQL db > (say, RSQLite) and summarize through that... > > Below the naive solution (and comparison with an "all in R > alternative") and the exercise on parallelising the task is left as an > exercise. > > --benilton > > > set.seed(1) > n <- 1e6 > tmp <- data.frame(V1=sample(1000, n, rep=T), > ? ? ? ? ? ? ? ? ?V2=sample(1000, n, rep=T), > ? ? ? ? ? ? ? ? ?V3=sample(1000, n, rep=T), > ? ? ? ? ? ? ? ? ?V4=runif(n)) > ref <- with(tmp, aggregate(list(SV4=V4), by=list(V1=V1, V2=V2), sum)) > write.table(tmp, file='tmp.loc', sep=' ', col.names=T, row.names=F) > rm(tmp) > gc() > > library(sqldf) > cat(file='tmploc.db') > conn <- read.csv.sql('tmp.loc', sep=' ', dbname='tmploc.db', > ? ? ? ? ? ? ? ? ? ? sql="CREATE TABLE main AS SELECT * FROM file") > > sqldf('SELECT * FROM main LIMIT 10', dbname='tmploc.db') > > test <- sqldf('SELECT V1, V2, sum(V4) as SV4 FROM main GROUP BY V2, > V1', dbname='tmploc.db') > > all.equal(ref, test)

ADD REPLY • link 14.0 years ago Benilton Carvalho ★ 4.3k

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 1 day ago

Seattle, WA, United States

Hi Matthew, You don't need a supercomputer for that. If you know in advance the nb of individual people, you can preallocate the final result. Use a square numeric matrix for the final result, where the rows and cols correspond to the individuals. Conceptually, the matrix will be symmetric so you just need to fill half of it. Then just read your big file by chunks and skip the cols you don't need. computeAllPairSums <- function(filename, nbindiv) { con <- file(filename, open="r") on.exit(close(con)) ans <- matrix(numeric(nbindiv * nbindiv), nrow=nbindiv) CHUNK_SIZE <- 5000000L chunk <- 0L while (TRUE) { df0 <- read.table(con, col.names=c("ID1", "ID2", "ignored", "sharing"), colClasses=c("integer", "integer", "NULL", "numeric"), nrows=CHUNK_SIZE) if (nrow(df0) == 0L) break chunk <- chunk + 1L cat("Processing chunk", chunk, "... ") ## Swap ids just in case but maybe you can assume ## ID1 is always <= ID2 in your file. swap_idx <- df0[["ID2"]] < df0[["ID1"]] tmp <- df0[["ID1"]][swap_idx] df0[["ID1"]][swap_idx] <- df0[["ID2"]][swap_idx] df0[["ID2"]][swap_idx] <- tmp df <- aggregate(df0[["sharing"]], df0[c("ID1", "ID2")], sum) idx <- as.matrix(df[1:2]) ans[idx] <- ans[idx] + df[[3L]] cat("OK\n") } ans } Takes 4 min 11 s and 1.5GB of RAM on my laptop to process a 80M line file (with 'nbindiv=9000L'). Having the result in this square matrix might or might not be what you wanted but note that it will allow fast lookup of any given pair (or set of pairs) which is not the case with a 3 col data.frame. Cheers, H. On 02/21/2012 03:11 PM, Matthew Keller wrote: > Hello all, > > I just sent this to the main R forum, but realized this audience might > have more familiarity with this type of problem... > > > ---------- Forwarded message ---------- > From: Matthew Keller<mckellercran at="" gmail.com=""> > Date: Tue, Feb 21, 2012 at 4:04 PM > Subject: tapply for enormous (>2^31 row) matrices > To: r help<r-help at="" r-project.org=""> > > > Hi all, > > SETUP: > I have pairwise data on 22 chromosomes. Data matrix X for a given > chromosome looks like this: > > 1 13 58 1.12 > 6 142 56 1.11 > 18 307 64 3.13 > 22 320 58 0.72 > > Where column 1 is person ID 1, column 2 is person ID 2, column 3 can > be ignored, and column 4 is how much chromosomal sharing those two > individuals have in some small portion of the chromosome. There are > 9000 individual people, and therefore ~ (9000^2)/2 pairwise matches at > each small location on the chromosome, so across an entire chromosome, > these matrices are VERY large (e.g., 3 billion rows, which is> the > 2^31 vector size limitation in R). I have access to a server with 64 > bit R, 1TB RAM and 80 processors. > > PROBLEM: > A pair of individuals (e.g., person 1 and 13 from the first row above) > will show up multiple times in a given file. I want to sum column 4 > across each pair of individuals. If I could bring the matrix into R, I > could use tapply() to accomplish this by indexing on > paste(X[,1],X[,2]), but the matrix doesn't fit into R. I have been > trying to use bigmemory and bigtabulate packages in R, but when I try > to use the bigsplit function, R never completes the operation (after a > day, I killed the process). In particular, I did this: > > X<- read.big.matrix("file.loc.X",sep=" ",type="double") > hap.indices<- bigsplit(X,1:2) #this runs for too long to be useful on > these matrices > #I was then going to use foreach loop to sum across the splits > identified by bigsplit > > SO - does anyone have ideas on how to deal with this problem - i.e., > how to use a tapply() like function on an enormous matrix? This isn't > necessarily a bigtabulate question (although if I screwed up using > bigsplit, let me know). If another package (e.g., an SQL package) can > do something like this efficiently, I'd like to hear about it and your > experiences using it. > > Thank you in advance, > > Matt > > > > -- > Matthew C Keller > Asst. Professor of Psychology > University of Colorado at Boulder > www.matthewckeller.com > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 14.0 years ago Hervé Pagès 16k

Login before adding your answer.