Question

Error and questions about clustering & ploting functions in r packages cluster and mclust regarding kmeans method

0

Entering edit mode

svlachavas ▴ 840

@svlachavas-7225

Last seen 12 days ago

Germany/Heidelberg/German Cancer Resear…

Dear Bioconductor community,

after performing preprocessing and statistical analysis in an Illumina dataset with Limma, i have acquired a specific DEG list, which i would like to use it afterwards to subset my dataset, and then performed some clustering analysis and subsequent functional enrichment to see if any interesting pathways can found perturbed in any cluster. As i wanted to have an insight and able to get the genes to each cluster, firstly i used the R package mclust to compute the optimal number of clusters depending my selected deg genes:

class(filtered.2)
[1] "EList"
attr(,"package")
[1] "limma"

significant # the data.frame from topTable after extracting the DEG genes(1272 DEG genes-probeIDs)

filtered.3 <- filtered.2[rownames(significant),] # where rownames are the probeIDs

jason.mclust=function(data,g1,g2){
d_clust <- Mclust(as.matrix(data), G=g1:g2)
m.best <- dim(d_clust$z)[2]
cat("model-based optimal number of clusters:", m.best, "\n")
return(m.best)
}

It returned my 12 as the optimal number of clusters

then i used the Kmeans function from package cluster in one other function i implemented:

get_clusters=function(data,nclusters){
fit <- kmeans(data, nclusters,iter.max=50)
aggregate(data, by=list(fit$cluster), FUN=mean)
clust.out <- fit$cluster
kclust <- as.matrix(clust.out)
clusplot(data, fit$cluster, shade=F,lines=0, color=T, lty=4, main='PC of K-means clusters')
return(kclust)
}

Then when i firstly used :

kclust=get_clusters(filtered.3,12)

Error in clusplot.default(data, fit$cluster, shade = F, lines = 0, color = T, :
x is not numeric

On the other hand, when i used : kclust=get_clusters(filtered.3$E,12)

Error in clusplot.default(data, fit$cluster, shade = F, lines = 0, color = T, :
4 arguments passed to .Internal(nchar) which requires 3

Except these errors, i would like to ask one more naive(silly) but also important question

if i use just the function kmeans, as : fit <- kmeans(data, nclusters,iter.max=50),

should i use data=filtered.3 or data in a form of a matrix,i.e., data=filtered.3$E ?

i know that with the iterations the results change a bit but which is more appropriate ? Or it doesnt matter and it is the same ??

Thank you in advance for your time !!

cluster mclust limma clusplot kmeans • 5.0k views

ADD COMMENT • link 9.9 years ago svlachavas ▴ 840

score 1 · Answer 1 · 2015-07-27

Put a minimal working example that allows us to reproduce the error. You've got a lot of code, and your post rambles on a bit, so it's hard to tell where the problem is.
This seems to be primarily a problem with the cluster package, which isn't on Bioconductor. If the documentation doesn't help, you'd be better off contacting their package maintainer to deal with bugs.
It shouldn't matter whether you use filtered.3 or filtered.3$E, as as.matrix is called within kmeans to coerce the EList (former) into a matrix of expression values (latter).

score 0 · Answer 2 · 2015-07-27

0

Entering edit mode

svlachavas ▴ 840

@svlachavas-7225

Last seen 12 days ago

Germany/Heidelberg/German Cancer Resear…

Dear Aaron,

thank you for your responce. Because im still a freshman in R and programming generally, thats why i posted the above code, specifically the two functions in case they have some problem. When you refer to a minimal working example, i will try the following(if im not mistaken and you mean something else) :

library(cluster)

library(mclust)

m <- matrix(sample(1:400,20,replace=TRUE), nrow=50,ncol=4)

Genes=sample(letters,50,replace=T)

rownames(m) <- Genes

optimal_Nclusters=jason.mclust(m,2,10) # from the function above named jason.mclust

model-based optimal number of clusters: 7

kclust=get_clusters(m,7) # from the defined function above:

Error in clusplot.default(data, fit$cluster, shade = F, lines = 0, color = T, :
4 arguments passed to .Internal(nchar) which requires 3

When i dont use the second function, and just:

fit <- kmeans(m, centers=7,iter.max=50)

clusplot(m, fit$cluster)
Error in clusplot.default(m, fit$cluster) :
4 arguments passed to .Internal(nchar) which requires 3 # again the same error !!

-Maybe it is probably something with the function clusplot- so i could continue with the implementation of the specific plot, as far as i can isolate the genes belonging to each cluster ? Or perhaps as you propose i should contact the package maintainers

[*Regarding my last question, because also due to my luck of programming experience, i wasnt sure that regarding the coersion of the filtered.3 to a matrix it will take into consideration also other mumeric values of the Elist object, like the detection pvalues]

ADD COMMENT • link 9.9 years ago svlachavas ▴ 840

1

Entering edit mode

The code runs fine on my machine. You'll need to chase up the package maintainers to get a better idea as to what's happening. As an additional note, make sure you put set.seed(0) or the like at the start of your code, to ensure that the random number generation is reproducible.

ADD REPLY • link 9.9 years ago Aaron Lun ★ 28k

0

Entering edit mode

Dear Aaron,

thank you for your verification. So there is something wrong happening with my machine.

Also heres my sessionInfo()

sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Greek_Greece.1253 LC_CTYPE=Greek_Greece.1253
[3] LC_MONETARY=Greek_Greece.1253 LC_NUMERIC=C
[5] LC_TIME=Greek_Greece.1253

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods
[9] base

other attached packages:
[1] cluster_2.0.3 mclust_5.0.2 illuminaHumanv4.db_1.26.0
[4] org.Hs.eg.db_3.1.2 RSQLite_1.0.0 DBI_0.3.1
[7] AnnotationDbi_1.30.1 GenomeInfoDb_1.4.1 IRanges_2.2.5
[10] S4Vectors_0.6.2 limma_3.24.14 BiocInstaller_1.18.4
[13] simpleaffy_2.44.0 gcrma_2.40.0 genefilter_1.50.0
[16] affy_1.46.0 Biobase_2.28.0 BiocGenerics_0.14.0

loaded via a namespace (and not attached):
[1] XVector_0.8.0 splines_3.2.0 zlibbioc_1.14.0
[4] affyPLM_1.44.0 xtable_1.7-4 tools_3.2.0
[7] survival_2.38-3 preprocessCore_1.30.0 affyio_1.36.0
[10] Biostrings_2.36.1 XML_3.98-1.3 annotate_1.46.1

I would get in touch with the maintainers to see what is happening with the specific function-on the other hand, i know that is not of your consern to reply or suggest, but as the other part of the code works fine-should i continue with my initial implementation ?

Best,

Efstathios

ADD REPLY • link 9.9 years ago svlachavas ▴ 840