Correlation works, but dist() runs out of memory
2
0
Entering edit mode
Daniel Brewer ★ 1.9k
@daniel-brewer-1791
Last seen 9.6 years ago
I am attempting to do plot a hierarchical clustering dendogram of a reasonable modestly sized gene expression matrix of 22011 x 16. If I choose to use a correlation measure it works fine ( c2 <- cor(ExonExpr) d2 <- as.dist(1-c2) hier2 <- hclust(d2,method="average") ). If I try to create a Euclidean distance object it crashes out with a memory error ( > Error in vector("double", length) : vector size specified is too large ). This seems strange as I have 3GB ram, which I would think is plenty. Any ideas what is going wrong or how to get round this. Thanks Dan PS Running R 2.4.1, Bioconductor 0.9 on SUSE 10.2 Linux. -- ************************************************************** Daniel Brewer, Ph.D. Institute of Cancer Research Email: daniel.brewer at icr.ac.uk ************************************************************** The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the addre...{{dropped}}
Clustering Cancer Clustering Cancer • 1.6k views
ADD COMMENT
0
Entering edit mode
@sean-davis-490
Last seen 3 months ago
United States
On Tuesday 13 March 2007 11:34, Daniel Brewer wrote: > I am attempting to do plot a hierarchical clustering dendogram of a > reasonable modestly sized gene expression matrix of 22011 x 16. > > If I choose to use a correlation measure it works fine ( > c2 <- cor(ExonExpr) > d2 <- as.dist(1-c2) > hier2 <- hclust(d2,method="average") > ). If I try to create a Euclidean distance object it crashes out with a > memory error ( > > > Error in vector("double", length) : vector size specified is too large > > ). > > This seems strange as I have 3GB ram, which I would think is plenty. Any > ideas what is going wrong or how to get round this. Hi, Dan. You probably want to do the dist() on the transposed matrix. > a <- matrix(rnorm(20000),nc=10) # a 2000 x 10 matrix > b <- dist(a) > dim(as.matrix(b)) [1] 2000 2000 > d <- cor(a) > dim(d) [1] 10 10 Note the difference in sizes of the matrices. Sean
ADD COMMENT
0
Entering edit mode
@wolfgang-huber-3550
Last seen 18 days ago
EMBL European Molecular Biology Laborat…
Dear Daniel, Please read the posting guide that recommends that you give a reproducible example and the output of sessionInfo. Also, there is no such thing as Bioconductor 0.9. 1) Are you sure you are giving it "only" a 22011 x 16 matrix? I get > a=numeric(2^31-1) Error in vector("double", length) : cannot allocate vector of length 2147483647 > a=numeric(2^31) Error in vector("double", length) : vector size specified is too large and of course 2^31 >> choose(22011,2). 2) choose(22011,2)*8/1e6 = 1937.84 i.e. one copy of your distance matrix would need 2 GB RAM, and if you have other large stuff around or if it needs to be copied, your 3 GB RAM may not be enough. Rather than brute force, thinking about reducing the set of genes to an interesting subset before doing the clustering might help. > sessionInfo() R version 2.5.0 Under development (unstable) (2007-03-13 r40832) i686-pc-linux-gnu locale: LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_GB .UTF-8;LC_MONETARY=en_GB.UTF-8;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_GB. UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF-8 ;LC_IDENTIFICATION=C attached base packages: [1] "stats" "graphics" "grDevices" "utils" "datasets" "methods" [7] "base" Best wishes Wolfgang ------------------------------------------------------------------ Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber > I am attempting to do plot a hierarchical clustering dendogram of a > reasonable modestly sized gene expression matrix of 22011 x 16. > > If I choose to use a correlation measure it works fine ( > c2 <- cor(ExonExpr) > d2 <- as.dist(1-c2) > hier2 <- hclust(d2,method="average") > ). If I try to create a Euclidean distance object it crashes out with a > memory error ( >> Error in vector("double", length) : vector size specified is too large > ). > > This seems strange as I have 3GB ram, which I would think is plenty. Any > ideas what is going wrong or how to get round this. > > > Thanks > > Dan > > PS Running R 2.4.1, Bioconductor 0.9 on SUSE 10.2 Linux. > --
ADD COMMENT
0
Entering edit mode
Apologies for the post. It was just a typo on the Bioconductor version, I meant 1.9. I have found out my error though, basically I was trying to cluster on samples, and I had not transposed the matrix before trying to calculate the distance matrix. It is strange that cor() calculates between columns and dist() calculates between rows. Thanks for the input anyway. Daniel Wolfgang Huber wrote: > Dear Daniel, > > Please read the posting guide that recommends that you give a > reproducible example and the output of sessionInfo. Also, there is no > such thing as Bioconductor 0.9. > > 1) Are you sure you are giving it "only" a 22011 x 16 matrix? I get > >> a=numeric(2^31-1) > Error in vector("double", length) : cannot allocate vector of length > 2147483647 > >> a=numeric(2^31) > Error in vector("double", length) : vector size specified is too large > > and of course 2^31 >> choose(22011,2). > > 2) choose(22011,2)*8/1e6 = 1937.84 i.e. one copy of your distance matrix > would need 2 GB RAM, and if you have other large stuff around or if it > needs to be copied, your 3 GB RAM may not be enough. Rather than brute > force, thinking about reducing the set of genes to an interesting subset > before doing the clustering might help. > >> sessionInfo() > R version 2.5.0 Under development (unstable) (2007-03-13 r40832) > i686-pc-linux-gnu > > locale: > LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_ GB.UTF-8;LC_MONETARY=en_GB.UTF-8;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_G B.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF -8;LC_IDENTIFICATION=C > > > attached base packages: > [1] "stats" "graphics" "grDevices" "utils" "datasets" "methods" > [7] "base" > > > Best wishes > Wolfgang > > ------------------------------------------------------------------ > Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber > > >> I am attempting to do plot a hierarchical clustering dendogram of a >> reasonable modestly sized gene expression matrix of 22011 x 16. >> >> If I choose to use a correlation measure it works fine ( >> c2 <- cor(ExonExpr) >> d2 <- as.dist(1-c2) >> hier2 <- hclust(d2,method="average") >> ). If I try to create a Euclidean distance object it crashes out with a >> memory error ( >>> Error in vector("double", length) : vector size specified is too large >> ). >> >> This seems strange as I have 3GB ram, which I would think is plenty. Any >> ideas what is going wrong or how to get round this. >> >> >> Thanks >> >> Dan >> >> PS Running R 2.4.1, Bioconductor 0.9 on SUSE 10.2 Linux. >> > > -- ************************************************************** Daniel Brewer, Ph.D. Institute of Cancer Research Molecular Carcinogenesis MUCRC 15 Cotswold Road Sutton, Surrey SM2 5NG United Kingdom Tel: +44 (0) 20 8722 4109 Fax: +44 (0) 20 8722 4141 Email: daniel.brewer at icr.ac.uk ************************************************************** The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the addre...{{dropped}}
ADD REPLY

Login before adding your answer.

Traffic: 965 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6