The comment you responded to comes from a bot (which you can always tell because it has a link embedded that leads to a game site or whatever). Luckily we have AI these days so we can have credible looking responses like that...
Anyway, what you are doing may not make sense. Clustering the genes and then testing for enrichment of GO terms in the clusters is probably less informative than simply testing for significant GO terms in your set of significant genes and then maybe generating a heatmap using genes that are in one or more interesting GO terms. But anyway,
1.) This will automatically occur if you are using a reasonable software to do the GO enrichment (GOstats, topGO, goana from the limma package, etc)
2.) Who knows why people do what they do? I would not base what you do on what you saw other people doing, particularly if you don't understand their rationale. The usual goal for a GO test is to identify processes that might be perturbed in your experimental system. If the top 10 GO terms are not interesting to you, but 11-15 are, then probably you should use the interesting terms. Or maybe research the terms you find uninteresting to see if they are actually interesting. You know, do some science.
When you do a GO overrepresentation analysis, you are comparing the proportion of genes with a given GO term in your entire set of genes to the proportion of those genes in your set of significant genes. As an example, let's say a given term is appended to 3% of all genes that you measured, but it's higher (7%) in your significant genes. That could be due to A) the process underlying that GO term being perturbed in your experiment, so you get more of those genes that are significant, or B) random sampling of the genes that happened to grab more of those genes this time. The p-value estimates the proportion of the time you expect B to occur when the GO term isn't actually perturbed (one way to simulate that would be to randomly select 1765 genes many many times, and perform the GO hypergeometric test. Since it's a random selection of genes, it will approximate the null distribution). A p<0.05 indicates that under the null distribution you expect to see a result like that about 5% of the time. It's not an enrichment score, which is usually computed in the context of a gene set test rather than a hypergeometric test.
3.) That doesn't make sense, really. What if the underlying GO term is represented at 12% over all genes? There is an argument for filtering out small GO terms (say a term with < 10 genes), because it's easy to get significance with just one or two genes, but you should rely on the p-value and your biological knowledge to decide what the important GO terms are.
It all makes sense now : I was being superficially AI-splained. I was really wondering what was the deal with that link. I guess I felt validated by the answer enough so I stopped overthinking. Thank you ! None of the GO terms are really linked to our initial research question actually, so I would not mind being more exploratory in the beginning by keeping all the GO terms and then later focus on a set of specific genes we were interested in and see how their expression varies according to the different levels of the condition, if it makes sense.
the result was different whether the uncharacterized proteins were still in or not. I ran the
enrichGO
function of `clusterProfiler.Sorry I realize I was not being clear : I already filtered out non significant GO terms. So far I plotted -per cluster- the GO terms which were supported by at least 10 % of the genes of the cluster. I thought it would be good to account for cluster size in some way, since p values will be smaller in bigger clusters (higher number of genes = higher sample size). So I am filtering out the GO terms which are supported by only 1 or 2 genes, however, I do have a small cluster that contains 15 genes, for which I plotted the GO terms supported by at least 2 genes. There is only a small number of significant GO terms, since the p value is a reflection of the number of genes that are in the cluster.
1.) That's one way you can do the test. The heuristic way the hypergeometric test is explained in basic stats is to envision an urn that contains a mixture of white and black balls. You reach in without looking and grab a handful. Under the null, you expect a relatively consistent mixture of black and white balls in your hand as compared to what was in the urn. If it's a 50-50 mix, and you pull out 10 balls, the expectation is 5 of each. But you could end up with 6-4 or 4-6, or even 7-3 or 3-7. If you did the test a bazillion times, you are even likely to have some 2-8 or 1-9 or even 0-10 pulls. But those would be super rare. If you do one draw and you get a more rare combination, you either had a rare thing occur, or the draw was biased (in this case by biology intervening because the genes in the GO term are affected by the experiment). The p-value tells you how rare your observed result would be under the null, and if it's super rare you are likely to conclude that the draw was biased by biology rather than being simply a rare occurrence.
This brings me to my point, which is how you code the genes. You could either say that any gene in the GO term is 'black', and all other genes are 'white', OR you could say that any gene in the term is black, and any gene that is not in the term and also has a GO term appended is white. You can make philosophical arguments about how one should deal with the unannotated genes. Since they can never be black, regardless of the GO term, they could bias towards the null, and certainly are uninformative. I would have to check the code to make sure, but IIRC
GOstats
andtopGO
andgoana
and for suregoseq
take the latter stance by ignoring/removing unannotated genes.If that makes sense to you, then you can always exclude by hand. But since this is RNA-Seq data, you might want to use
goseq
anyway, as the gene length introduces bias that a conventional GO hypergeometric does not account for.