Error and questions about clustering & ploting functions in r packages cluster and mclust regarding kmeans method
2
0
Entering edit mode
svlachavas ▴ 780
@svlachavas-7225
Last seen 6 days ago
Germany/Heidelberg/German Cancer Resear…

Dear Bioconductor community,

after performing preprocessing and statistical analysis in an Illumina dataset with Limma, i have acquired a specific DEG list, which i would like to use it afterwards to subset my dataset, and then performed some clustering analysis and subsequent functional enrichment to see if any interesting pathways can found perturbed in any cluster. As i wanted to have an insight and able to get the genes to each cluster, firstly i used the R package mclust to compute the optimal number of clusters depending my selected deg genes:

class(filtered.2)
[1] "EList"
attr(,"package")
[1] "limma"

significant # the data.frame from topTable after extracting the DEG genes(1272 DEG genes-probeIDs)

filtered.3 <- filtered.2[rownames(significant),] # where rownames are the probeIDs

jason.mclust=function(data,g1,g2){
d_clust <- Mclust(as.matrix(data), G=g1:g2)
m.best <- dim(d_clust$z)[2] cat("model-based optimal number of clusters:", m.best, "\n") return(m.best) } It returned my 12 as the optimal number of clusters then i used the Kmeans function from package cluster in one other function i implemented: get_clusters=function(data,nclusters){ fit <- kmeans(data, nclusters,iter.max=50) aggregate(data, by=list(fit$cluster), FUN=mean)
clust.out <- fit$cluster kclust <- as.matrix(clust.out) clusplot(data, fit$cluster, shade=F,lines=0, color=T, lty=4, main='PC of K-means clusters')
return(kclust)
}

Then when i firstly used :

kclust=get_clusters(filtered.3,12)

Error in clusplot.default(data, fit$cluster, shade = F, lines = 0, color = T, : x is not numeric On the other hand, when i used : kclust=get_clusters(filtered.3$E,12)

Error in clusplot.default(data, fit$cluster, shade = F, lines = 0, color = T, : 4 arguments passed to .Internal(nchar) which requires 3 Except these errors, i would like to ask one more naive(silly) but also important question if i use just the function kmeans, as : fit <- kmeans(data, nclusters,iter.max=50), should i use data=filtered.3 or data in a form of a matrix,i.e., data=filtered.3$E ?

i know that with the iterations the results change a bit but which is more appropriate ? Or it doesnt matter and it is the same ??

cluster mclust limma clusplot kmeans • 3.3k views
1
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 5 hours ago
The city by the bay
1. Put a minimal working example that allows us to reproduce the error. You've got a lot of code, and your post rambles on a bit, so it's hard to tell where the problem is.
2. This seems to be primarily a problem with the cluster package, which isn't on Bioconductor. If the documentation doesn't help, you'd be better off contacting their package maintainer to deal with bugs.
3. It shouldn't matter whether you use filtered.3 or filtered.3$E, as as.matrix is called within kmeans to coerce the EList (former) into a matrix of expression values (latter). ADD COMMENT 0 Entering edit mode svlachavas ▴ 780 @svlachavas-7225 Last seen 6 days ago Germany/Heidelberg/German Cancer Resear… Dear Aaron, thank you for your responce. Because im still a freshman in R and programming generally, thats why i posted the above code, specifically the two functions in case they have some problem. When you refer to a minimal working example, i will try the following(if im not mistaken and you mean something else) : library(cluster) library(mclust) m <- matrix(sample(1:400,20,replace=TRUE), nrow=50,ncol=4) Genes=sample(letters,50,replace=T) rownames(m) <- Genes optimal_Nclusters=jason.mclust(m,2,10) # from the function above named jason.mclust model-based optimal number of clusters: 7 kclust=get_clusters(m,7) # from the defined function above: Error in clusplot.default(data, fit$cluster, shade = F, lines = 0, color = T,  :
4 arguments passed to .Internal(nchar) which requires 3

When i dont use the second function, and just:

fit <- kmeans(m, centers=7,iter.max=50)

clusplot(m, fit$cluster) Error in clusplot.default(m, fit$cluster) :
4 arguments passed to .Internal(nchar) which requires 3
# again the same error !!

-Maybe it is probably something with the function clusplot- so i could continue with the implementation of the specific plot, as far as i can isolate the genes belonging to each cluster ? Or perhaps as you propose i should contact the package maintainers

[*Regarding my last question, because also due to my luck of programming experience, i wasnt sure that regarding the coersion of the filtered.3 to a matrix it will take into consideration also other mumeric values of the Elist object, like the detection pvalues]

1
Entering edit mode

The code runs fine on my machine. You'll need to chase up the package maintainers to get a better idea as to what's happening. As an additional note, make sure you put set.seed(0) or the like at the start of your code, to ensure that the random number generation is reproducible.

0
Entering edit mode

Dear Aaron,

thank you for your verification. So there is something wrong happening with my machine.

Also heres my sessionInfo()

sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Greek_Greece.1253  LC_CTYPE=Greek_Greece.1253
[3] LC_MONETARY=Greek_Greece.1253 LC_NUMERIC=C
[5] LC_TIME=Greek_Greece.1253

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods
[9] base

other attached packages:
[1] cluster_2.0.3             mclust_5.0.2              illuminaHumanv4.db_1.26.0
[4] org.Hs.eg.db_3.1.2        RSQLite_1.0.0             DBI_0.3.1
[7] AnnotationDbi_1.30.1      GenomeInfoDb_1.4.1        IRanges_2.2.5
[10] S4Vectors_0.6.2           limma_3.24.14             BiocInstaller_1.18.4
[13] simpleaffy_2.44.0         gcrma_2.40.0              genefilter_1.50.0
[16] affy_1.46.0               Biobase_2.28.0            BiocGenerics_0.14.0

loaded via a namespace (and not attached):
[1] XVector_0.8.0         splines_3.2.0         zlibbioc_1.14.0
[4] affyPLM_1.44.0        xtable_1.7-4          tools_3.2.0
[7] survival_2.38-3       preprocessCore_1.30.0 affyio_1.36.0
[10] Biostrings_2.36.1     XML_3.98-1.3          annotate_1.46.1

I would get in touch with the maintainers to see what is happening with the specific function-on the other hand, i know that is not of your consern to reply or suggest, but as the other part of the code works fine-should i continue with my initial implementation ?

Best,

Efstathios