I want to use the groupGO function from cluster profiler, but do not exactly understand the level argument. The help page (hence ?groupGO) says: "Specific GO level." When I play around with it (giving it different numbers), I still don't seem to understand what and how this argument works.
> ggo_all <- groupGO(gene = All_genes, keytype = "ENSEMBL", OrgDb = org.Hs.eg.db, ont = "BP", level = 20, readable = F) > dim(as.data.frame(ggo_all)) [1] 2 5 > ggo_all <- groupGO(gene = All_genes, keytype = "ENSEMBL", OrgDb = org.Hs.eg.db, ont = "BP", level = 2, readable = F) > dim(as.data.frame(ggo_all)) [1] 33 5 > ggo_all <- groupGO(gene = All_genes, keytype = "ENSEMBL", OrgDb = org.Hs.eg.db, ont = "BP", level = 3, readable = F) > dim(as.data.frame(ggo_all)) [1] 585 5 > ggo_all <- groupGO(gene = All_genes, keytype = "ENSEMBL", OrgDb = org.Hs.eg.db, ont = "BP", level = 10, readable = F) > dim(as.data.frame(ggo_all)) [1] 13782 5 > ggo_all <- groupGO(gene = All_genes, keytype = "ENSEMBL", OrgDb = org.Hs.eg.db, ont = "BP", level = 15, readable = F) > dim(as.data.frame(ggo_all)) [1] 1046 5
What I would like to do is, to get all GO terms for my genes. But I don't understand what the level argument should be set on to achieve this.
> sessionInfo() R version 3.4.3 (2017-11-30) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] grid parallel stats4 stats graphics grDevices utils [8] datasets methods base other attached packages: [1] Rgraphviz_2.22.0 topGO_2.30.1 SparseM_1.77 [4] GO.db_3.5.0 graph_1.56.0 BiocInstaller_1.28.0 [7] org.Hs.eg.db_3.5.0 AnnotationDbi_1.40.0 IRanges_2.12.0 [10] S4Vectors_0.16.0 Biobase_2.38.0 BiocGenerics_0.24.0 [13] clusterProfiler_3.6.0 DOSE_3.4.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.15 pillar_1.2.1 compiler_3.4.3 [4] plyr_1.8.4 tools_3.4.3 digest_0.6.15 [7] bit_1.1-12 lattice_0.20-35 RSQLite_2.0 [10] memoise_1.1.0 tibble_1.4.2 gtable_0.2.0 [13] pkgconfig_2.0.1 rlang_0.2.0 fastmatch_1.1-0 [16] igraph_1.1.2 DBI_0.8 rvcheck_0.0.9 [19] fgsea_1.4.1 gridExtra_2.3 stringr_1.3.0 [22] bit64_0.9-7 glue_1.2.0 qvalue_2.10.0 [25] data.table_1.10.4-3 BiocParallel_1.12.0 GOSemSim_2.4.1 [28] purrr_0.2.4 tidyr_0.8.0 DO.db_2.9 [31] ggplot2_2.2.1 reshape2_1.4.3 blob_1.1.0 [34] magrittr_1.5 matrixStats_0.53.1 splines_3.4.3 [37] scales_0.5.0 colorspace_1.3-2 labeling_0.3 [40] stringi_1.1.6 lazyeval_0.2.1 munsell_0.4.3

Thanks for the explanation and example. So how do I get all GO terms for all levels? If you look at my examples, you'll see that with
level = 15(and20)the number of terms seem to decrease compared withlevel = 10. Why does this decrease?Have you read the vignette? It seems pretty straightforward. The goal for this support site is to help those who are truly stuck. While R and Bioconductor are free to download and install, they are not truly free. You will have to be willing to invest the necessary time and effort to figure out how things work and what you should be doing.
One thing this support site is not intended to be is a conventional support service like what you get with paid software. The vast majority of the helpers on this site are giving their time and experience to you for free, and if it seems like you are taking advantage and not doing your own homework, you run the very real risk of being considered 'one of those people' who end up being routinely ignored.
Thanks for your comment James, and yes I have read the vignette but still can't figure out how level should be set to get all the terms, or how it is possible that the number of terms decreases with very high level.
I don't completely understand your irritation, because I don't understand why I give the impression that I take advantage of this support site, or that I don't do my homework. But if I gave you that impression I am sorry for the inconvenience.
I'm not irritated. I'm trying to explain something to you. Evidently not well.
So let me try to explain. I know nothing about clusterProfiler. But if I go to the vignette and click on the link for GO Analysis, I see this:
In clusterProfiler,
groupGOis designed for gene classification based on GO distribution at a specific level.And immediately following that, in the next paragraph, I see
GO over-representation test
Over-representation test3 were implemented in clusterProfiler. For calculation details and explanation of paramters, please refer to the vignette of DOSE.
ego <- enrichGO(gene = gene, universe = names(geneList), OrgDb = org.Hs.eg.db, ont = "CC", pAdjustMethod = "BH", pvalueCutoff = 0.01, qvalueCutoff = 0.05, readable = TRUE) head(ego)So without knowing anything about this package, and taking like two minutes of my time, I now know that the function
groupGOis for doing a hypergeometric test at a specific level, and thatenrichGOis for doing a hypergeometric using all levels.The expectation is that you could have done that yourself, rather than having somebody read the section and parse it for you. So to reiterate my earlier point, if people start to think that you are unwilling to read and understand the documentation provided to you, particularly when it is laid out in clear and understandable text, they may themselves become unwilling to offer help when you ask for it. That is not intended to imply some irritation with you on my part, but to tell you how things work, and to offer some constructive criticism.
Thanks again, I understand what you mean. Sorry I gave you that impression. I was actually not asking about a simple hypergeometric test, I understand how to do that with this package. I thought I was clear about that in my question, but probably I was not. I was interested in getting all GO terms, for all my genes (which is a list of all ensemble genes). I would like to get this gene 2 GO term data frame for further analysis (see how many genes overlap per term). Thanks for your time, I think I will stick to GOseq, where I do understand the vignette better.
So you buried your actual question in a bunch of other stuff. If you want to know all the GO IDs for each ensembl gene ID, you should just use biomaRt. Trying to use a package for some side effect is not really the way to go.
> getBM(c("ensembl_gene_id","go_id"), "ensembl_gene_id", "ENSG00000139618", mart) ensembl_gene_id go_id 1 ENSG00000139618 GO:0006281 2 ENSG00000139618 GO:0000724 3 ENSG00000139618 GO:0005634 4 ENSG00000139618 GO:0005737 5 ENSG00000139618 GO:0005829 6 ENSG00000139618 GO:0005654 7 ENSG00000139618 GO:0003677 8 ENSG00000139618 GO:0005856 9 ENSG00000139618 GO:0005515 10 ENSG00000139618 GO:0006310 11 ENSG00000139618 GO:0006974 12 ENSG00000139618 GO:0007049 13 ENSG00000139618 GO:0045893 14 ENSG00000139618 GO:0005813 15 ENSG00000139618 GO:0005815 16 ENSG00000139618 GO:0042802 17 ENSG00000139618 GO:0032991 18 ENSG00000139618 GO:0000784 19 ENSG00000139618 GO:0043967 20 ENSG00000139618 GO:0002020 21 ENSG00000139618 GO:0000281 22 ENSG00000139618 GO:0030141 23 ENSG00000139618 GO:0070200 24 ENSG00000139618 GO:0010485 25 ENSG00000139618 GO:0003697 26 ENSG00000139618 GO:0000731 27 ENSG00000139618 GO:0000732 28 ENSG00000139618 GO:0043966 29 ENSG00000139618 GO:0051298 30 ENSG00000139618 GO:0006302 31 ENSG00000139618 GO:0043015 32 ENSG00000139618 GO:0008022 33 ENSG00000139618 GO:0006289 34 ENSG00000139618 GO:0033600 35 ENSG00000139618 GO:0000800 36 ENSG00000139618 GO:0010484 37 ENSG00000139618 GO:0000722 38 ENSG00000139618 GO:0001556 39 ENSG00000139618 GO:0001833 40 ENSG00000139618 GO:0006978 41 ENSG00000139618 GO:0007141 42 ENSG00000139618 GO:0007283 43 ENSG00000139618 GO:0007420 44 ENSG00000139618 GO:0007569 45 ENSG00000139618 GO:0008283 46 ENSG00000139618 GO:0008585 47 ENSG00000139618 GO:0008630 48 ENSG00000139618 GO:0010165 49 ENSG00000139618 GO:0010225 50 ENSG00000139618 GO:0010332 51 ENSG00000139618 GO:0030097 52 ENSG00000139618 GO:0032465 53 ENSG00000139618 GO:0042771 54 ENSG00000139618 GO:0043009 55 ENSG00000139618 GO:0045931 56 ENSG00000139618 GO:0048478 57 ENSG00000139618 GO:0051276 58 ENSG00000139618 GO:1990426 59 ENSG00000139618 GO:0033593 60 ENSG00000139618From the vignette here:
In clusterProfiler, groupGO is designed for gene classification based on GO distribution at a specific level.
Over-representation test were implemented in clusterProfiler. For calculation details and explanation of parameters, please refer to the vignette of DOSE. enrichGO test the whole GO corpus, ...
The decrease of GO terms when increasing the levels is (AFAIK) due the fact that not all trees have an equal 'depth'. In other words, some trees stop at e.g. level 15, others already at level 10.
Moreover, the nature of the GO classification also makes things complex. As James indicated, the GO classification is a directed acyclic graph. GO terms do not occupy strict fixed levels in the hierarchy. Because GO is structured as a graph, terms would appear at different 'levels' if different paths were followed through the graph. This is especially true if one mixes the different relations used to connect terms. In other words, a problem is that GO hierarchy is set-up in a way where you can take multiple paths to reach the same term. Let's say a gene has the term, neuron differentiation. You can reach that term through neuron development or cell differentiation. Depending on the path, the level (depth) could be different. (Modified from this post here at Biostars)
Thanks for your comment. So due to the complexity of the GO graph structure, it is probably not possible to find a level where all GO terms are given. I didn't realize it, thanks for your time.