[R] tapply for enormous (>2^31 row) matrices
1
0
Entering edit mode
@matthew-keller-2483
Last seen 10.2 years ago
Thank you all very much for your help (on both the r-help and the bioconductor listserves). Benilton - I couldn't get sqldf to install on the server I'm using (error is: Error : package 'gsubfn' does not have a name space). I think this was a problem for R 2.13, and I'm trying to get the admin's to install a more up-to-date version. I know that I need to probably learn a modicum of SQL given the sizes of datasets I'm using now. I ended up using a modified version of Hervé Pagès' excellent code (thank you!). I got a huge (40-fold) speed bump by using the data.table package for indexing/aggregate steps, making an hours long job a minutes long job. SO - read.table is hugely useful if you're dealing with indexing/apply-family functions on huge datasets. By the way, I'm not sure why, but read.table was a bit faster than scan for this problem... Here is the code for others: require(data.table) computeAllPairSums <- function(filename, nbindiv,nrows.to.read) { con <- file(filename, open="r") on.exit(close(con)) ans <- matrix(numeric(nbindiv * nbindiv), nrow=nbindiv) chunk <- 0L while (TRUE) { #read.table faster than scan df0 <- read.table(con,col.names=c("ID1", "ID2", "ignored", "sharing"), colClasses=c("integer", "integer", "NULL", "numeric"),nrows=nrows.to.read,comment.char="") DT <- data.table(df0) setkey(DT,ID1,ID2) ss <- DT[,sum(sharing),by="ID1,ID2"] if (nrow(df0) == 0L) break chunk <- chunk + 1L cat("Processing chunk", chunk, "... ") idd <- as.matrix(subset(ss,select=1:2)) newvec <- as.vector(as.matrix(subset(ss,select=3))) ans[idd] <- ans[idd] + newvec cat("OK\n") } ans } On Wed, Feb 22, 2012 at 3:20 PM, ilai <keren at="" math.montana.edu=""> wrote: > On Tue, Feb 21, 2012 at 4:04 PM, Matthew Keller <mckellercran at="" gmail.com=""> wrote: > >> X <- read.big.matrix("file.loc.X",sep=" ",type="double") >> hap.indices <- bigsplit(X,1:2) #this runs for too long to be useful on >> these matrices >> #I was then going to use foreach loop to sum across the splits >> identified by bigsplit > > How about just using foreach earlier in the process ? e.g. split > file.loc.X to (80) sub files and then run > read.big.matrix/bigsplit/sum inside %dopar% > > If splitting X beforehand is a problem, you could also use ?scan to > read in different chunks of the file, something like (untested > obviously): > # for X a matrix 800x4 > lineind<- seq(1,800,100) ?# create an index vec for the lines to read > ReducedX<- foreach(i = 1:8) %dopar%{ > ?x <- scan('file.loc.X',list(double(0),double(0),double(0),double(0) ),skip=lineind[i],nlines=100) > ... do your thing on x (aggregate/tapply etc.) > ?} > > Hope this helped > Elai. > > > >> >> SO - does anyone have ideas on how to deal with this problem - i.e., >> how to use a tapply() like function on an enormous matrix? This isn't >> necessarily a bigtabulate question (although if I screwed up using >> bigsplit, let me know). If another package (e.g., an SQL package) can >> do something like this efficiently, I'd like to hear about it and your >> experiences using it. >> >> Thank you in advance, >> >> Matt >> >> >> >> -- >> Matthew C Keller >> Asst. Professor of Psychology >> University of Colorado at Boulder >> www.matthewckeller.com >> >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting- guide.html >> and provide commented, minimal, self-contained, reproducible code. -- Matthew C Keller Asst. Professor of Psychology University of Colorado at Boulder www.matthewckeller.com
PROcess PROcess • 1.0k views
ADD COMMENT
0
Entering edit mode
@gabor-grothendieck-2332
Last seen 10.2 years ago
On Thu, Feb 23, 2012 at 11:39 AM, Matthew Keller <mckellercran at="" gmail.com=""> wrote: > Thank you all very much for your help (on both the r-help and the > bioconductor listserves). > > Benilton - I couldn't get sqldf to install on the server I'm using > (error is: Error : package 'gsubfn' does not have a name space). I > think this was a problem for R 2.13, and I'm trying to get the admin's > to install a more up-to-date version. I know that I need to probably > learn a modicum of SQL given the sizes of datasets I'm using now. Right. See the troubleshooting section of the sqldf home page: http://code.google.com/p/sqldf/#Troubleshooting -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
ADD COMMENT

Login before adding your answer.

Traffic: 545 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6