Dear all,
I am running a differential gene expression between 2 groups and got 124 differentially expressed genes using the limma package.
When I run hierarchal clustering on the dataset using 30 top genes I get pretty clear separation between the 2 groups. When I increase the number of genes to 50 the separation is not so clear and with 124 genes, I don't see the separation on heatmap between the 2 groups.
Has anyone come across a similar situation when you choose different number of differentially expressed genes (all with adjp < 0.05) you get very different clustering of samples?
Is there a way to choose the best set of DE genes (within the ones I get from limma) that separates the two sample groups the best? Thanks a lot,
Liron
Thank you very much!
Liron
I will try using coolmap(). Thanks!
I don't know if that should be an issue as well in explaining why I don't see a clear clustering but the groups I compare are very unequal in their size (one has 6 samples and the other has 100 samples). Could that be contributing to the problem?
Thanks!
Possibly, depending on the idiosyncrasies of the clustering algorithm. However, all of these things depend on the strength of separation between groups in the first place. If the separation is strong, technical issues such as the number of samples, choice of clustering algorithm, number of genes, etc. should have little effect.
Going back to your original post; the later DE genes (i.e., with larger p-values) will, by definition, have lower log-fold changes between groups relative to the within-group variability. For these genes, if you plot a histogram of expression values for each group, you can imagine that the distributions are intermingling even though the centres of the distributions are still distinct. In comparison, the top 30 DE genes probably don't have much overlap in the distributions between groups.
This means that you get good separation with the very best DE genes; more intermingling as you include weaker DE genes; and eventually, no separation at all when you include more genes. The sensitivity of your clustering to the number of genes reflects the weakness of the underlying separation between groups, as I said above.
I must say that I don't really know what you want to show by clustering. If you're trying to show there's differences between groups, you've already done that with a DE analysis. If you want to visualize the DE genes, just make a heatmap with fixed column order - there's no need to empirically cluster the samples if you already know their identities.