Analysis of clusters of a heatmap with enrichGO of clusterProfiler: choice fo the categories to visualize
1
0
Entering edit mode
@3f9f9566
Last seen 8 weeks ago
Germany

Hello ! This community has been extremely helpful before, so I am back with some questions :)

I generated a VERY large heatmap to represent RNAseq data with 1765 differentially expressed genes between 2 conditions applied to individual Drosophila flies.

The genes of this heatmap are split in 5 clusters (row_km=5). I am performing a GO enrichment analysis on each separate cluster, hoping to see if some processes are over-represented in some clusters. I chose to represent the categories as a barplot (y = GO term, x = number of genes). I was wondering :

  1. There are a lot of differentially regulated genes whose gene product is an uncharacterized protein. Shall I filter these out before the GO enrichment analysis ?

  2. Several people seem to plot the 10 most significant terms. I wonder why ? I have a lot of terms for which the adjusted p value is very low. How relevant is it that a very low P value is much lower than an other very low P value ? What does this p value actually represent ? I have seen it being called "enrichment score".

  3. I was thinking of representing terms in which more than 10 % of the genes are represented instead. What are your thoughts on this ?

Thank you !

clusterProfiler enrichGO ComplexHeatmap • 5.1k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 6 hours ago
United States

The comment you responded to comes from a bot (which you can always tell because it has a link embedded that leads to a game site or whatever). Luckily we have AI these days so we can have credible looking responses like that...

Anyway, what you are doing may not make sense. Clustering the genes and then testing for enrichment of GO terms in the clusters is probably less informative than simply testing for significant GO terms in your set of significant genes and then maybe generating a heatmap using genes that are in one or more interesting GO terms. But anyway,

1.) This will automatically occur if you are using a reasonable software to do the GO enrichment (GOstats, topGO, goana from the limma package, etc)

2.) Who knows why people do what they do? I would not base what you do on what you saw other people doing, particularly if you don't understand their rationale. The usual goal for a GO test is to identify processes that might be perturbed in your experimental system. If the top 10 GO terms are not interesting to you, but 11-15 are, then probably you should use the interesting terms. Or maybe research the terms you find uninteresting to see if they are actually interesting. You know, do some science.

When you do a GO overrepresentation analysis, you are comparing the proportion of genes with a given GO term in your entire set of genes to the proportion of those genes in your set of significant genes. As an example, let's say a given term is appended to 3% of all genes that you measured, but it's higher (7%) in your significant genes. That could be due to A) the process underlying that GO term being perturbed in your experiment, so you get more of those genes that are significant, or B) random sampling of the genes that happened to grab more of those genes this time. The p-value estimates the proportion of the time you expect B to occur when the GO term isn't actually perturbed (one way to simulate that would be to randomly select 1765 genes many many times, and perform the GO hypergeometric test. Since it's a random selection of genes, it will approximate the null distribution). A p<0.05 indicates that under the null distribution you expect to see a result like that about 5% of the time. It's not an enrichment score, which is usually computed in the context of a gene set test rather than a hypergeometric test.

3.) That doesn't make sense, really. What if the underlying GO term is represented at 12% over all genes? There is an argument for filtering out small GO terms (say a term with < 10 genes), because it's easy to get significance with just one or two genes, but you should rely on the p-value and your biological knowledge to decide what the important GO terms are.

0
Entering edit mode

It all makes sense now : I was being superficially AI-splained. I was really wondering what was the deal with that link. I guess I felt validated by the answer enough so I stopped overthinking. Thank you ! None of the GO terms are really linked to our initial research question actually, so I would not mind being more exploratory in the beginning by keeping all the GO terms and then later focus on a set of specific genes we were interested in and see how their expression varies according to the different levels of the condition, if it makes sense.

  1. the result was different whether the uncharacterized proteins were still in or not. I ran the enrichGO function of `clusterProfiler.

  2. Sorry I realize I was not being clear : I already filtered out non significant GO terms. So far I plotted -per cluster- the GO terms which were supported by at least 10 % of the genes of the cluster. I thought it would be good to account for cluster size in some way, since p values will be smaller in bigger clusters (higher number of genes = higher sample size). So I am filtering out the GO terms which are supported by only 1 or 2 genes, however, I do have a small cluster that contains 15 genes, for which I plotted the GO terms supported by at least 2 genes. There is only a small number of significant GO terms, since the p value is a reflection of the number of genes that are in the cluster.

ADD REPLY
1
Entering edit mode

1.) That's one way you can do the test. The heuristic way the hypergeometric test is explained in basic stats is to envision an urn that contains a mixture of white and black balls. You reach in without looking and grab a handful. Under the null, you expect a relatively consistent mixture of black and white balls in your hand as compared to what was in the urn. If it's a 50-50 mix, and you pull out 10 balls, the expectation is 5 of each. But you could end up with 6-4 or 4-6, or even 7-3 or 3-7. If you did the test a bazillion times, you are even likely to have some 2-8 or 1-9 or even 0-10 pulls. But those would be super rare. If you do one draw and you get a more rare combination, you either had a rare thing occur, or the draw was biased (in this case by biology intervening because the genes in the GO term are affected by the experiment). The p-value tells you how rare your observed result would be under the null, and if it's super rare you are likely to conclude that the draw was biased by biology rather than being simply a rare occurrence.

This brings me to my point, which is how you code the genes. You could either say that any gene in the GO term is 'black', and all other genes are 'white', OR you could say that any gene in the term is black, and any gene that is not in the term and also has a GO term appended is white. You can make philosophical arguments about how one should deal with the unannotated genes. Since they can never be black, regardless of the GO term, they could bias towards the null, and certainly are uninformative. I would have to check the code to make sure, but IIRC GOstats and topGO and goana and for sure goseq take the latter stance by ignoring/removing unannotated genes.

If that makes sense to you, then you can always exclude by hand. But since this is RNA-Seq data, you might want to use goseq anyway, as the gene length introduces bias that a conventional GO hypergeometric does not account for.

ADD REPLY

Login before adding your answer.

Traffic: 883 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6