Question

Choosing genes for clustering

0

Entering edit mode

lirongrossmann ▴ 80

@lirongrossmann-13938

Last seen 5.2 years ago

Dear all,

I am running a differential gene expression between 2 groups and got 124 differentially expressed genes using the limma package.

When I run hierarchal clustering on the dataset using 30 top genes I get pretty clear separation between the 2 groups. When I increase the number of genes to 50 the separation is not so clear and with 124 genes, I don't see the separation on heatmap between the 2 groups.

Has anyone come across a similar situation when you choose different number of differentially expressed genes (all with adjp < 0.05) you get very different clustering of samples?

Is there a way to choose the best set of DE genes (within the ones I get from limma) that separates the two sample groups the best? Thanks a lot,

Liron

limma clustering differential gene expression hierarchical clustering feature selection • 4.0k views

ADD COMMENT • link 8.3 years ago lirongrossmann ▴ 80

0

Entering edit mode

Thank you very much!

Liron

ADD REPLY • link 8.3 years ago lirongrossmann ▴ 80

0

Entering edit mode

I will try using coolmap(). Thanks!

I don't know if that should be an issue as well in explaining why I don't see a clear clustering but the groups I compare are very unequal in their size (one has 6 samples and the other has 100 samples). Could that be contributing to the problem?

Thanks!

ADD REPLY • link 8.3 years ago lirongrossmann ▴ 80

0

Entering edit mode

Possibly, depending on the idiosyncrasies of the clustering algorithm. However, all of these things depend on the strength of separation between groups in the first place. If the separation is strong, technical issues such as the number of samples, choice of clustering algorithm, number of genes, etc. should have little effect.

Going back to your original post; the later DE genes (i.e., with larger p-values) will, by definition, have lower log-fold changes between groups relative to the within-group variability. For these genes, if you plot a histogram of expression values for each group, you can imagine that the distributions are intermingling even though the centres of the distributions are still distinct. In comparison, the top 30 DE genes probably don't have much overlap in the distributions between groups.

This means that you get good separation with the very best DE genes; more intermingling as you include weaker DE genes; and eventually, no separation at all when you include more genes. The sensitivity of your clustering to the number of genes reflects the weakness of the underlying separation between groups, as I said above.

I must say that I don't really know what you want to show by clustering. If you're trying to show there's differences between groups, you've already done that with a DE analysis. If you want to visualize the DE genes, just make a heatmap with fixed column order - there's no need to empirically cluster the samples if you already know their identities.

ADD REPLY • link 8.3 years ago Aaron Lun ★ 29k

score 0 · Answer 1 · 2017-09-13

This is common. There are ways of calculating cluster strength and stability which you could apply to different numbers of DE genes, however, this is a danger of being a bit too selective here. I would just test a few different numbers, 30, 50, 100, etc. Then choose what you need, it depends on what you want to do with the data.

score 0 · Answer 2 · 2017-09-18

Your question assumes that there is some trick to choosing DE genes, but really there isn't. By definition, the DE analysis is choosing the genes that best separate the two groups. So the best set of DE genes to separate the two sample groups are just the top DE genes. The top gene separates the groups best of all, the second gene 2nd best, the third gene 3rd best, and so on.

I find it very surprising that you could cluster on significantly DE genes but not separate the groups. Did you do a simple DE analysis between two groups or did you include any batch effects or factors other than the groups in the design matrix?

There are dozens of ways to run hierarchical clustering, and I wonder whether you are choosing a good way. Have you tried coolmap() in the limma package?