Hi there!
I have a differential expression data set, consisting of a list of gene handles, their log2-fold expression change between two conditions, a p-value for that change, and annotated GO terms. It looks basically like this: https://www.dropbox.com/s/bswzujqx07t4lr2/VirtualBox_ubuntubox_11_11_2020_12_44_16.png?dl=0
I am currently working in R, and would like to do a GO term enrichment analysis to get a ranking of GO terms associated with genes that are most up/downregulated. However, I cant find any software/package that is doing that. I have started to get into the "topGO" package, but it seems there is no way to include fold-changes with that one.
Does anybody know how I can achieve my goal or can point me towards something to help me along the way?
Thanks so much!
This I don't understand. Why does a fold change not matter? I.e. in my data set I have both significantly upregulated and downregulated genes. Would I have to divide my dataset to determine which GO terms are enriched in the upregulated genes, and which in the downregulated ones?
Also, having spent the day reading through package descriptions, there are software solutions that do use the fold change, e.g. gage and foldGO. What is the difference to topGO? Sorry if these are stupid questions, I am completely overwhelmed by this topic..
Fold change doesn't matter for a hypergeometric because what you are testing for is an enrichment of a particular set of genes in your significant results. In other words, say the proteins appended to GO term XXX represent 1% of the proteins that you measured. And let's assume that 5% of the proteins that you say are significant are appended to that GO term.
In that scenario, you have some evidence that GO term XXX is enriched in your set of significant proteins (e.g., there are proportionally more of those proteins in your set of significant proteins than in the 'universe' of proteins you measured).
You could argue (I wouldn't, but you could) that it's only interesting if all of the proteins in that GO term are differentially expressed in the same direction, in which case something like foldGO might be of interest. But why would you think that? There is not always a concept of directionality for GO terms.
You could have a term like 'apoptotic process' that includes two child terms, 'positive regulation of apoptotic process', and 'negative regulation of apoptotic process'. And even within those two child terms there might be some proteins that are negative regulators of the process! So you might be able to argue (again, I wouldn't) that all the proteins in 'positive regulation of apoptotic process' have to be up-regulated, and all the proteins in 'negative regulation of apoptotic process' have to be down-regulated, but what do you argue for the parent term? Why would you expect any directionality at all?
In the context of a KeGG pathway, where there is a defined relationship between all the constituents of the pathway you can make the argument (like the SPIA package does) that you should use the directionality of the changes to infer something larger about the pathway itself. But in the context of GO terms, where the underlying proteins in a term are simply thought to be involved with a particular process, I think it is a very strong assumption to think that directionality is particularly useful.
removed
That does make some sense to me, thank you. I very much appreciate you taking time out of your day for this.
One more thought though: I could imagine a gene set that is equally distributed in the differentially and not differentially expressed genes but the genes that ARE significantly differentially expressed have a HUGE fold change. The GO terms associated with these genes would be ignored by this kind of analysis, would they not?
You are confusing concepts a bit. What you are describing is more akin to a gene set test, particularly a self-contained gene set test, which is intended to say if there is evidence that a particular set of genes is differentially expressed.
GO hypergeometric analyses are a different concept, where you are taking a set of genes that you think are differentially expressed and then testing to see if any particular gene set (e.g. GO term) is enriched in that set of significant genes. The heuristic that they tend to use when explaining the hypergeometric is to imagine a jar filled with black and white marbles (say 50/50 mix) if you reach in and grab 20 marbles, your expectation should be that you have about 50% of each (10 black, 10 white), and increasing numbers of either color get less and less likely. So it would be super rare to be able to get 5 and 15 and almost impossible to get 20 of one color.
The GO test is analogous. You have N proteins (sorry I keep vacillating between genes and proteins) that you measured, and M of them are in a particular GO term. Your set of significant proteins represents the handful of marbles, and you look to see if the proportion of proteins in your 'handful' that are appended to a given GO term are at a significantly different proportion than you would expect, given the proportions for all the proteins you measured. So if you have 5% of the proteins that you measured in a particular GO term, and 5% of the proteins in your set of significant proteins are also in that GO terms AND they are super differentially expressed, it still doesn't matter, because under the null you expect 5% to be from that GO term.
James's point that GO terms are usually not directional is correct. But note that the goana() and kegga() functions in limma and edgeR routinely separate the up-regulated and down-regulated genes when a fit object is supplied and return separate GO analyses for the two sets. There are many workflows illustrating this, for example:
RnaSeqGeneEdgeRQL workflow
If you wanted to give more weight to genes that are very strongly DE then you would use a gene set test like camera(), see for example:
RnaSeq123 workflow