Question

Consensus Cluster Plus pre-computed distance matrix

0

Entering edit mode

neuro3030 • 0

@neuro3030-15768

Last seen 6.8 years ago

In the package ConsensusClusterPlus, there is an option to input a pre-computed distance matrix to speed up the computation time. In the reference manual, it states this is because ConsensusClusterPlus re-calculates a distance matrix for each iteration.

Thus, I have pre-computed the distance matrix for a very large dataset (~700 samples with ~50,000 rows). However, when I input this distance object into ConsensusClusterPlus, the computation time is dramatically INCREASED and struggles to get past the first iteration. Of note, the "dist" object is very large for this large dataset (approx. 4-6 gb). Although, given the distance is pre-calculated, wouldn't this save time during consensus clustering?

Any ideas would be great. Thanks.

consensusclusterplus distance clustering • 1.9k views

ADD COMMENT • link updated 6.8 years ago by chris86 ▴ 420 • written 6.8 years ago by neuro3030 • 0

score 0 · Answer 1 · 2019-03-28

May be better to pre-filter your data-set features based on variance. Your unlikely to need 50,000 features.

I also find the delta K with that method subjective to what constitutes the best number of clusters and it can't handle higher numbers of clusters. An alternative to this is M3C which uses the PAC score and various derivatives of this, if time is an issue it has a fast mode or lower the iterations param and it works well with PAM I find (https://www.bioconductor.org/packages/devel/bioc/html/M3C.html). Another good alternative, I have tested quite extensively, is CLEST (https://rdrr.io/cran/RSKC/man/Clest.html).