Question: GO Stats results contain GO terms NOT present in the gene universe
1
8 months ago by
sven.schenk10
Vienna
sven.schenk10 wrote:

hi,

i used GOstats to check for over and under represented GO terms in my gene lists using a conditional test.

i then combined the three output files, i.e. the files for MF, BP and CC, into one list containing 175 under represented terms:

> head(GO.Mat,2)
GOBPID   Pvalue OddsRatio  ExpCount Count Size
1 GO:0006260 2.78e-21 0.1447017  74.52345    14  314
2 GO:0034641 3.38e-20 0.6109688 713.01916   539 3019
Term      test type subset
1                              DNA replication under    P allMat
2 cellular nitrogen compound metabolic process  under    P allMat

> nrow(GO.Mat)
[1] 175

then i pulled out all unique GO terms contained in the universe (16921 genes) resulting in 2472 terms:

> allGOterms<-subset(gU_frame.new, !(duplicated(gU_frame.new$frame.go_id)), select="frame.go_id") > nrow(allGOterms) [1] 2472 i now merged both lists, i.e. the list containing the 175 under represented terms (GO.Mat) and the unique GO terms (allGOterms) from my universe (the idea was to get an overview of the under represented terms in form of a heatmap). as result i was expecting to get 2472 entries in this merged list, however, this is not the case: > nrow(allGOterms) #these are the unique GO terms from the universe [1] 2472 > nrow(GO.Mat) #this is the resluts list with the under rep. GOterms [1] 175 > allGOa.b<-merge(allGOterms, GO.Mat, by=1, all=T) > nrow(allGOa.b) [1] 2551 this leaves me with 79 GO terms in my result list (GO.Mat), which are not present in the universe. unless i fundamentally misunderstood the principle of GOstats, this doesnt seem logic. does anybody know, why this happens? and how to deal with it? To test for under and over representation i followed the guideline for unsupported organisms (https://bioconductor.statistik.tu-dortmund.de/packages/3.4/bioc/vignettes/GOstats/inst/doc/GOstatsForUnsupportedOrganisms.pdf). thanks a lot sven > sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 attached base packages: [1] grid stats4 parallel stats graphics grDevices datasets utils [9] methods base other attached packages: [1] GSEABase_1.34.1 Rgraphviz_2.16.0 xtable_1.8-2 RColorBrewer_1.1-2 [5] genefilter_1.54.2 annotate_1.50.1 XML_3.98-1.9 GO.db_3.3.0 [9] hgu95av2.db_3.2.3 org.Hs.eg.db_3.3.0 ALL_1.14.0 GOstats_2.40.0 [13] graph_1.50.0 Category_2.38.0 Matrix_1.2-12 AnnotationDbi_1.34.4 [17] IRanges_2.6.1 S4Vectors_0.10.3 Biobase_2.32.0 BiocGenerics_0.18.0 [21] installr_0.18.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.14 bitops_1.0-6 digest_0.6.12 [4] bit_1.1-12 RSQLite_2.0 memoise_1.1.0 [7] tibble_1.3.4 lattice_0.20-35 pkgconfig_2.0.1 [10] rlang_0.1.4 DBI_0.7 bit64_0.9-7 [13] RBGL_1.48.1 survival_2.41-3 blob_1.1.0 [16] splines_3.3.2 AnnotationForge_1.14.2 RCurl_1.95-4.8 > gostats • 238 views ADD COMMENTlink modified 8 months ago by emmascottgm0 • written 8 months ago by sven.schenk10 addendum: my GOStats code would look something like this: >goFrame_gU.new.H = GOFrame(gU_frame.new, organism="Homo sapiens") >goALLFrame.gU.new.H = GOAllFrame(goFrame_gU.new.H) >ids <-c("comp2258926_c0_seq3","comp2258476_c0_seq2","comp2246178_c0_seq1","comp2268190_c1_seq4") >universe<-as.character(gU_frame.new$frame.gene_id)

>genes<-ids

>params.BP_under <- GSEAGOHyperGParams(name="Under_Represented BP_Conditional Params",
geneSetCollection=gsc.gU.H,
geneIds = genes,
universeGeneIds = universe,
ontology = "BP",
pvalueCutoff = 0.05,
conditional = TRUE,
testDirection = "under")

>genes.BP_under <- hyperGTest(params.BP_under)
genes.BP_under
>x<-summary(genes.BP_under)

cheers

Sven

Answer: A: GO Stats results contain GO terms NOT present in the gene universe
1
8 months ago by
United States
James W. MacDonald48k wrote:

You are using a very outdated version of R/Bioconductor. Do note that every release we update GO.db to contain the current set of IDs. If you have a temporal mismatch between what is in the GO.db you have installed and the source of GO IDs, then it's to be expected that things will be different. These things change, and more rapidly than you might imagine. The first step is to install a current version of R and Bioconductor, and see if that helps.

Answer: A: GO Stats results contain GO terms NOT present in the gene universe
0
8 months ago by
sven.schenk10
Vienna
sven.schenk10 wrote:

hi,

thanks for your answer. so i updated my GO.db from 3.3.0 to 3.6.0 and rerun the analysis, however the problem persits (it even gets more pronounced). so i think the problem is the time of annotaion of my transcriptome which was in July 16 and my old GO.db (3.3.0) was form October 16 wich led to the differences and mow with the new db version it gets even "worse". however, what i dont understand is why this happens. i thought the "to-test-gene list" was tested against the universe, and if a GO term is not present in the universe, then it cannot appear in the result. i would understand if this to happe if i would provide sequences but that i don't do; i just provide sequence identifiers from a de nove transcriptome assembly.

so i guess i would have to check for an even older db than 3.3.0 to resolve the issue (or re-annotate the transcriptome).

cheers

sven

We just (like days ago) changed the link below to say this:

If you want to provide a solution to the original question, click Add Answer .

Now to your question. I previously said that GO is a moving target and terms come in and come out with some regularity, and said you probably needed to update your installation. Implicit in that (I thought) is the idea that this included updating the mappings you have between whatever IDs you started with, and the GO terms appended to those IDs. If you update the GO.db package yet continue to use old mappings for your transcripts are you really surprised that it gets even worse? That would be my expectation!

I should also note that what you wrote implies to me that you simply updated the GO.db package, rather than the entirety of your R/Bioc installation. Do note that the whole premise of the Bioconductor project is to be able to supply a set of coherent packages that are all guaranteed to work together, and to that end we have an installer called biocLite that ensures that all the packages installed come from the same development cycle.

You are free to mix and match things as you wish, but in so doing you are signaling to us that you think you know what you are doing. That's perfectly fine, but it's hard enough for people to help those with conventional installations, and we cannot provide support for those who have unconventional installations (of which there are nearly a limitless number).

hi again,

sorry for not using the COMMENT/REPLY button.

no, i did not just update GO.db, i updated the whole installation from 3.3.2 to 3.5.0 it wouldn't work otherwise. yes, you're right that it would be expected that i find more non-matching GO terms if i use a newer DB, however, even if i use a matching DB (e.g. October or April 2016) i.e. when i annotated my sequences the problem that i find GO terms in my result that werent in the universe are in the results persits.

i still think that i should only be able to find in the results what was also annotated in the universe, right?

cheers

sven

1

OK, I think I get what you are asking. You have this data.frame with GO terms that you are starting with, and when you do the hypergeometric test you end up with GO terms that weren't in that original data.frame, right? This is because you aren't using the data in your original data.frame any more. When you use GOFrame and GOAllFrame, you are making queries to the GO.db package to recreate the GO directed acyclic graph. This will add GO terms that might not have been in your original data.frame. As an example, we can use the example from ?GOFrame:

> example(GOFrame)

GOFram>   ## Make up an example
GOFram>   genes = c(1,10,100)

GOFram>   evi = c("ND","IEA","IDA")

GOFram>   GOIds = c("GO:0008150","GO:0008152","GO:0001666")

GOFram>   frameData = data.frame(cbind(GOIds,evi,genes))

GOFram>   library(AnnotationDbi)

GOFram>   frame=GOFrame(frameData,organism="Homo sapiens")

GOFram>   allFrame=GOAllFrame(frame)

GOFram>   getGOFrameData(allFrame)
go_id evidence gene_id
1  GO:0001666      IDA     100
2  GO:0006950      IDA     100
3  GO:0008150      IDA     100
4  GO:0008150      IEA      10
5  GO:0008150       ND       1
6  GO:0008152      IEA      10
7  GO:0009628      IDA     100
8  GO:0036293      IDA     100
9  GO:0050896      IDA     100
10 GO:0070482      IDA     100

So we started with three GO IDs and gained five more.

And note that the universe here isn't a set of GO IDs, but instead some other IDs. So talking about GO terms that aren't in the universe doesn't really make sense.

hi,

thanks for your answer. i came to the same conclusion, when checking whats happening during building the GOallframe - in this I have something like 4700 Go terms insteadt of 2472 in my input list. so it was indeed a misunderstanding how GOstats works.