Question: Choosing genes for clustering
gravatar for lirongrossmann
2.2 years ago by
lirongrossmann40 wrote:

Dear all, 

I am running a differential gene expression between 2 groups and got 124 differentially expressed genes using the limma package. 

When I run hierarchal clustering on the dataset using 30 top genes I get pretty clear separation between the 2 groups. When I increase the number of genes to 50 the separation is not so clear and with 124 genes, I don't see the separation on heatmap between the 2 groups.

Has anyone come across a similar situation when you choose different number of differentially expressed genes (all with adjp < 0.05) you get very different clustering of samples?

Is there a way to choose the best set of DE genes (within the ones I get from limma) that separates the two sample groups the best? Thanks a lot,


ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by lirongrossmann40

Thank you very much!


ADD REPLYlink written 2.2 years ago by lirongrossmann40

I will try using coolmap(). Thanks!

I don't know if that should be an issue as well in explaining why I don't see a clear clustering but the groups I compare are very unequal in their size (one has 6 samples and the other has 100 samples). Could that be contributing to the problem?




ADD REPLYlink written 2.2 years ago by lirongrossmann40

Possibly, depending on the idiosyncrasies of the clustering algorithm. However, all of these things depend on the strength of separation between groups in the first place. If the separation is strong, technical issues such as the number of samples, choice of clustering algorithm, number of genes, etc. should have little effect.

Going back to your original post; the later DE genes (i.e., with larger p-values) will, by definition, have lower log-fold changes between groups relative to the within-group variability. For these genes, if you plot a histogram of expression values for each group, you can imagine that the distributions are intermingling even though the centres of the distributions are still distinct. In comparison, the top 30 DE genes probably don't have much overlap in the distributions between groups.

This means that you get good separation with the very best DE genes; more intermingling as you include weaker DE genes; and eventually, no separation at all when you include more genes. The sensitivity of your clustering to the number of genes reflects the weakness of the underlying separation between groups, as I said above.

I must say that I don't really know what you want to show by clustering. If you're trying to show there's differences between groups, you've already done that with a DE analysis. If you want to visualize the DE genes, just make a heatmap with fixed column order - there's no need to empirically cluster the samples if you already know their identities.

ADD REPLYlink written 2.2 years ago by Aaron Lun25k
Answer: Choosing genes for clustering
gravatar for chris86
2.2 years ago by
UCL, United Kingdom
chris86390 wrote:

This is common. There are ways of calculating cluster strength and stability which you could apply to different numbers of DE genes, however, this is a danger of being a bit too selective here. I would just test a few different numbers, 30, 50, 100, etc. Then choose what you need, it depends on what you want to do with the data.

ADD COMMENTlink written 2.2 years ago by chris86390
Answer: Choosing genes for clustering
gravatar for Gordon Smyth
2.2 years ago by
Gordon Smyth39k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth39k wrote:

Your question assumes that there is some trick to choosing DE genes, but really there isn't. By definition, the DE analysis is choosing the genes that best separate the two groups. So the best set of DE genes to separate the two sample groups are just the top DE genes. The top gene separates the groups best of all, the second gene 2nd best, the third gene 3rd best, and so on.

I find it very surprising that you could cluster on significantly DE genes but not separate the groups. Did you do a simple DE analysis between two groups or did you include any batch effects or factors other than the groups in the design matrix?

There are dozens of ways to run hierarchical clustering, and I wonder whether you are choosing a good way. Have you tried coolmap() in the limma package?

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Gordon Smyth39k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 415 users visited in the last hour