I have been doing some hypergeometric tests for Oryza sativa japonica MSU7, following Marc Carlson's vignette "How to use GOstats and Category to do hypergeometric testing with unsupported organims", dated October 13, 2014.
Now I would like to find the genes corresponding to the significant GO terms. I have found previous help for supported model organisms, but is there a way to do this for unsupported organisms?
and followed the instructions on page 2 of your vignette to create a GeneSetCollection called gsc, which I included as a parameter for GSEAGOHyperGParams, and ran hypergeometric tests.
I then looked up the significant GO terms in my goslim and wrote out the summary of my object and the corresponding genes from my goslim. Here is one example for a conditional test for CC:
I was puzzled because GO:0043229 and GO:0043227 are not in my goslim and so I thought I needed some additional annotation. Now I am wondering if it is my lack of understanding of GOstats. So I have two questions:
How can I see GO terms that are not in my goslim. When GOstats finds a significant GO term, does it also test all direct descendents and antecedants? If so, could I find additional annotation by doing what you suggest in your reply to my first message?
I was expecting to find the same number of genes as in the ‘Count’ column. Although there is only one gene in this example, in other examples I find more. In one case the ‘Count’ was 10 and I found 9 genes with that GO term in my goslim. But it seems to be fewer than the count. Can you explain why this is?
So I think that for both of these questions you need to remember that GO is a directed acyclic graph. So for any point in the graph, there could be more genes to consider than the ones that exactly match the specific node in question... For example, there could be more specifically labeled genes that would also match the term (even though they are labeled with a different but more specific term than the one you are asking about). This is going to be especially true for a very general term like 'intracellular'.
And just in case it helps, you can also see more about this by looking at the GOstats vignette here:
So it sounds like you want to get genes mapped to GO terms. We used to get those from blast2GO. But with my most recent attempt to make new annotations, it appears that they may have gone commercial on us. :( So how we (as a project) will get those terms mapped in the future is currently unknown. But right now we still have some reasonably current mappings from back when they still were sharing them.
And you can get organism annotations for a whole range of things by using the development version of AnnotationHub like this (please note that for this to work you have to be using the devel branch as AnnotationHub has changed DRASTICALLY). Anyhow here goes:
library(AnnotationHub)
ah = AnnotationHub()
unique(ah$rdataclass)
ahs = subset(ah, ah$rdataclass=="OrgDb")
## Then look at the available taxonomy IDs:
availSpecies = unique(ahs$species)
## Then choose the one you want (hopefully it's in there) and do this:
finalAh = subset(ahs, ahs$species=="Pseudomonas mendocina_NK-01")
org = finalAh[[1]]
## Then you can get data from this object in the usual way (like so):
columns(org)
keytypes(org)
k = head(keys(org, keytype='ENTREZID'))
head(select(org, k, 'GO', 'ENTREZID'))
Anyhow this is currently the widest range of pre-made OrgDb objects that we provide access to. But if someone could point me to a more complete resource for GO to gene mappings we could probably do even better.
Hi Marc
Thank you for your speedy reply.
I am realising just how confused I am by all this GO stuff, but it was late on this side of the Atlantic when I sent my message last night.
I had already got a goslim from here:
ftp://ftp.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_7.0/all.dir/
and followed the instructions on page 2 of your vignette to create a GeneSetCollection called gsc, which I included as a parameter for GSEAGOHyperGParams, and ran hypergeometric tests.
I then looked up the significant GO terms in my goslim and wrote out the summary of my object and the corresponding genes from my goslim. Here is one example for a conditional test for CC:
GOPID
Pvalue
OddsRatio
ExpCount
Count
Size
Term
Genes
1
GO:0005622
0.011265055
0.405231784
16.90556104
10
9633
intracellular
LOC_Os11g47620.1//ZOS11-09 - C2H2 zinc finger protein, expressed
2
GO:0043229
0.030091569
0.447231871
13.65536909
8
7781
intracellular organelle
3
GO:0043227
0.042152233
0.472956811
13.2201382
8
7533
membrane-bounded organelle
4
GO:0005634
0.047673968
0.195677219
4.522540309
1
2577
nucleus
LOC_Os07g31750.1//chalcone synthase, putative, expressed
I was puzzled because GO:0043229 and GO:0043227 are not in my goslim and so I thought I needed some additional annotation. Now I am wondering if it is my lack of understanding of GOstats. So I have two questions:
Thank you for you help.
Regards
Krys
Hi Krys,
So I think that for both of these questions you need to remember that GO is a directed acyclic graph. So for any point in the graph, there could be more genes to consider than the ones that exactly match the specific node in question... For example, there could be more specifically labeled genes that would also match the term (even though they are labeled with a different but more specific term than the one you are asking about). This is going to be especially true for a very general term like 'intracellular'.
And just in case it helps, you can also see more about this by looking at the GOstats vignette here:
http://bioconductor.org/packages/devel/bioc/vignettes/GOstats/inst/doc/GOstatsHyperG.pdf
Marc