Question

Analysis of clusters of a heatmap with enrichGO of clusterProfiler: choice fo the categories to visualize

0

Entering edit mode

caroline.zanchi ▴ 10

@3f9f9566

Last seen 3 months ago

Germany

Hello ! This community has been extremely helpful before, so I am back with some questions :)

I generated a VERY large heatmap to represent RNAseq data with 1765 differentially expressed genes between 2 conditions applied to individual Drosophila flies.

The genes of this heatmap are split in 5 clusters (row_km=5). I am performing a GO enrichment analysis on each separate cluster, hoping to see if some processes are over-represented in some clusters. I chose to represent the categories as a barplot (y = GO term, x = number of genes). I was wondering :

There are a lot of differentially regulated genes whose gene product is an uncharacterized protein. Shall I filter these out before the GO enrichment analysis ?
Several people seem to plot the 10 most significant terms. I wonder why ? I have a lot of terms for which the adjusted p value is very low. How relevant is it that a very low P value is much lower than an other very low P value ? What does this p value actually represent ? I have seen it being called "enrichment score".
I was thinking of representing terms in which more than 10 % of the genes are represented instead. What are your thoughts on this ?

Thank you !

clusterProfiler enrichGO ComplexHeatmap • 5.1k views

ADD COMMENT • link updated 4 months ago by James W. MacDonald 68k • written 4 months ago by caroline.zanchi ▴ 10

score 1 · Answer 1 · 2025-06-25

The comment you responded to comes from a bot (which you can always tell because it has a link embedded that leads to a game site or whatever). Luckily we have AI these days so we can have credible looking responses like that...

Anyway, what you are doing may not make sense. Clustering the genes and then testing for enrichment of GO terms in the clusters is probably less informative than simply testing for significant GO terms in your set of significant genes and then maybe generating a heatmap using genes that are in one or more interesting GO terms. But anyway,

1.) This will automatically occur if you are using a reasonable software to do the GO enrichment (GOstats, topGO, goana from the limma package, etc)

2.) Who knows why people do what they do? I would not base what you do on what you saw other people doing, particularly if you don't understand their rationale. The usual goal for a GO test is to identify processes that might be perturbed in your experimental system. If the top 10 GO terms are not interesting to you, but 11-15 are, then probably you should use the interesting terms. Or maybe research the terms you find uninteresting to see if they are actually interesting. You know, do some science.

When you do a GO overrepresentation analysis, you are comparing the proportion of genes with a given GO term in your entire set of genes to the proportion of those genes in your set of significant genes. As an example, let's say a given term is appended to 3% of all genes that you measured, but it's higher (7%) in your significant genes. That could be due to A) the process underlying that GO term being perturbed in your experiment, so you get more of those genes that are significant, or B) random sampling of the genes that happened to grab more of those genes this time. The p-value estimates the proportion of the time you expect B to occur when the GO term isn't actually perturbed (one way to simulate that would be to randomly select 1765 genes many many times, and perform the GO hypergeometric test. Since it's a random selection of genes, it will approximate the null distribution). A p<0.05 indicates that under the null distribution you expect to see a result like that about 5% of the time. It's not an enrichment score, which is usually computed in the context of a gene set test rather than a hypergeometric test.

3.) That doesn't make sense, really. What if the underlying GO term is represented at 12% over all genes? There is an argument for filtering out small GO terms (say a term with < 10 genes), because it's easy to get significance with just one or two genes, but you should rely on the p-value and your biological knowledge to decide what the important GO terms are.