Entering edit mode

Greetings,

I was attempting to use HOPACH to cluster the rows of a 610758 x 9 matrix of floating points, but the distancematrix function gave the following error:

Error in .Call("R_disscosangle", as.vector(X), as.numeric(dim(X)[1]), : negative length vectors are not allowed

I can recreate the error as follows (session info included):

> library("hopach") > test <- matrix(runif(1000*10), 1000, 10) > my.dist <- distancematrix(test, "cosangle") # works > dim(my.dist) [1] 1000 1000 > test <- matrix(runif(610758*9), 610758, 9) > my.dist <- distancematrix(test, "cosangle") # error message shows up immediately Error in .Call("R_disscosangle", as.vector(X), as.numeric(dim(X)[1]), : negative length vectors are not allowed > sessionInfo() R version 3.1.2 (2014-10-31) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] hopach_2.26.0 Biobase_2.26.0 BiocGenerics_0.12.1 [4] cluster_2.0.1 > test <- matrix(runif(100000*10), 100000, 10) > my.dist <- distancematrix(test, "cosangle") # after a while we get a segfault *** caught segfault *** address 0x7f60e326b000, cause 'invalid permissions' Traceback: 1: .Call("R_disscosangle", as.vector(X), as.numeric(dim(X)[1]), as.numeric(dim(X)[2]), as.logical(na.rm)) 2: disscosangle(X, na.rm) 3: distancematrix(test, "cosangle")

Am I running out of memory? (https://www.google.com/webhp?q=negative+length+vectors+are+not+allowed+r)

Cheers,

Eric

Hi Katie,

Would you have any suggestions for this situation? If the Internet is correct in saying that R (even 3.1.2) cannot allocate vectors longer than 2^31 - 1, is there a package somewhere that has circumvented this?

Thanks

R can allocate larger vectors (try

`integer(2^31)`

, for instance, if your computer has enough memory!) but packages with C code have to be written to work with large vectors; packages that were developed before R supported large vectors (like hopach) are not likely to support these.It seems like the reasonable statistical thing to do is to pre-process your data in some way to reduce its volume, e.g., by filtering on variability or kmeans-clustering followed by use of centroids (but these are naive suggestions, maybe Katie can provide something more substantive).

Thanks Martin,

Indeed, I've found that any function requiring a distance matrix, like the built in hclust(), cannot handle that many rows.

To see any pattern, I've been using kmeans with a large k (e.g. 80), then using hclust on the 80 cluster centroids, and finding an optimal ordering for the tree of the centroids (with the 'cba' package) to reorder my original matrix. Is this what you meant by using k-means clustering to pre-process data?

Yes that sounds approximately like what I was thinking.