This doesn't seem like a particularly useful evaluation strategy. By definition, all of the methods will identify DEGs that will have different expression between conditions. Thus, I suspect that all methods would form "correct" clusters (i.e., samples would be clustered together corresponding to their conditions) based on the reported DEGs, unless something goes catastrophically wrong. The main differences between methods are more subtle, and lie in how they model the variability and in error rate control, neither of which can be easily assessed with clustering - or at all, in fact, unless you have a data set with known truths, e.g., simulations.
Anyway, when I want to cluster RNA-seq data, I usually apply a log-transformation:
expr.vals <- cpm(y, log=TRUE, prior.count=2) # where 'y' is a DGEList.
... and then do my clustering on the matrix of expression values. The log-transformation provides some level of variance stabilization, especially at large counts (see the voom paper for some discussion of this). The large prior count just protects against poorly defined log-fold changes at low counts. It's better to use this for clustering instead of the E-values reported by voom
, as voom
only adds a continuity correction of 0.5. (Note that this small addition is fine within limma, as observations are precision-weighted anyway, but if you're doing unweighted analyses outside of limma with those E-values, then you could run into problems.)
I suppose if you wanted to make comparisons between methods easier, you should use the same variance-stabilized expression values (either with cpm
above, or with rlog
in DESeq) to make your clusters for all methods. I don't think this would pose any problems, but then again, I don't think this is a worthwhile analysis in the first place, so make of that what you will.