Question: What does "level" argument mean in groupGO function (cluster profiler)?
0
19 months ago by
b.nota340
Netherlands
b.nota340 wrote:

I want to use the groupGO function from cluster profiler, but do not exactly understand the level argument. The help page (hence ?groupGO) says: "Specific GO level." When I play around with it (giving it different numbers), I still don't seem to understand what and how this argument works.

> ggo_all <- groupGO(gene = All_genes, keytype = "ENSEMBL", OrgDb = org.Hs.eg.db, ont = "BP", level = 20, readable = F)
> dim(as.data.frame(ggo_all))
[1] 2 5
> ggo_all <- groupGO(gene = All_genes, keytype = "ENSEMBL", OrgDb = org.Hs.eg.db, ont = "BP", level = 2, readable = F)
> dim(as.data.frame(ggo_all))
[1] 33  5
> ggo_all <- groupGO(gene = All_genes, keytype = "ENSEMBL", OrgDb = org.Hs.eg.db, ont = "BP", level = 3, readable = F)
> dim(as.data.frame(ggo_all))
[1] 585   5
> ggo_all <- groupGO(gene = All_genes, keytype = "ENSEMBL", OrgDb = org.Hs.eg.db, ont = "BP", level = 10, readable = F)
> dim(as.data.frame(ggo_all))
[1] 13782     5
> ggo_all <- groupGO(gene = All_genes, keytype = "ENSEMBL", OrgDb = org.Hs.eg.db, ont = "BP", level = 15, readable = F)
> dim(as.data.frame(ggo_all))
[1] 1046    5

What I would like to do is, to get all GO terms for my genes. But I don't understand what the level argument should be set on to achieve this.

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] grid      parallel  stats4    stats     graphics  grDevices utils
[8] datasets  methods   base

other attached packages:
[1] Rgraphviz_2.22.0      topGO_2.30.1          SparseM_1.77
[4] GO.db_3.5.0           graph_1.56.0          BiocInstaller_1.28.0
[7] org.Hs.eg.db_3.5.0    AnnotationDbi_1.40.0  IRanges_2.12.0
[10] S4Vectors_0.16.0      Biobase_2.38.0        BiocGenerics_0.24.0
[13] clusterProfiler_3.6.0 DOSE_3.4.0

loaded via a namespace (and not attached):
[1] Rcpp_0.12.15        pillar_1.2.1        compiler_3.4.3
[4] plyr_1.8.4          tools_3.4.3         digest_0.6.15
[7] bit_1.1-12          lattice_0.20-35     RSQLite_2.0
[10] memoise_1.1.0       tibble_1.4.2        gtable_0.2.0
[13] pkgconfig_2.0.1     rlang_0.2.0         fastmatch_1.1-0
[16] igraph_1.1.2        DBI_0.8             rvcheck_0.0.9
[19] fgsea_1.4.1         gridExtra_2.3       stringr_1.3.0
[22] bit64_0.9-7         glue_1.2.0          qvalue_2.10.0
[25] data.table_1.10.4-3 BiocParallel_1.12.0 GOSemSim_2.4.1
[28] purrr_0.2.4         tidyr_0.8.0         DO.db_2.9
[31] ggplot2_2.2.1       reshape2_1.4.3      blob_1.1.0
[34] magrittr_1.5        matrixStats_0.53.1  splines_3.4.3
[37] scales_0.5.0        colorspace_1.3-2    labeling_0.3
[40] stringi_1.1.6       lazyeval_0.2.1      munsell_0.4.3
go clusterprofiler groupgo • 542 views
modified 19 months ago by James W. MacDonald51k • written 19 months ago by b.nota340
Answer: What does "level" argument mean in groupGO function (cluster profiler)?
0
19 months ago by
United States
James W. MacDonald51k wrote:

The Gene Ontology is a directed acyclic graph, so there are levels to the graph. At level 1, for e.g., the BP ontology, there is only one term, biological process. At the next level, there are something like 33 terms, that are more specific than biological process, but less specific than the next level. The lower the level, all things equal, the more specific the term becomes, and the more terms there are, and the fewer genes there are for a term. Using data from example(groupGO),

> as.data.frame(groupGO(gcSample[[1]], "org.Hs.eg.db", ont = "BP", level = 1))[,1:2]
ID        Description
GO:0008150 biological_process

And if we choose level 2

> z <- as.data.frame(groupGO(gcSample[[1]], "org.Hs.eg.db", ont = "BP", level = 2))
> dim(z)
[1] 33  5

> z[1:10,2]
[1] reproduction
[2] metabolic process
[3] cell killing
[4] immune system process
[5] sulfur utilization
[6] phosphorus utilization
[7] growth
[8] behavior
[9] cell proliferation
[10] carbohydrate utilization


And if we choose level 3

> zz <- as.data.frame(groupGO(gcSample[[1]], "org.Hs.eg.db", ont = "BP", level = 3))
> dim(zz)
[1] 585   5
> zz[1:10,2]
[1] sexual reproduction
[2] asexual reproduction
[3] reproductive process
[4] multicellular organism reproduction
[5] reproduction of a single-celled organism
[6] reproduction of symbiont in host
[7] oxidation-reduction process
[8] nitrogen compound metabolic process
[9] catabolic process
[10] biosynthetic process          

And note that the first 6 terms here are children of 'reproduction', which itself is a biological process.

Thanks for the explanation and example. So how do I get all GO terms for all levels? If you look at my examples, you'll see that with level = 15 (and 20) the number of terms seem to decrease compared with level = 10. Why does this decrease?

Have you read the vignette? It seems pretty straightforward. The goal for this support site is to help those who are truly stuck. While R and Bioconductor are free to download and install, they are not truly free. You will have to be willing to invest the necessary time and effort to figure out how things work and what you should be doing.

One thing this support site is not intended to be is a conventional support service like what you get with paid software. The vast majority of the helpers on this site are giving their time and experience to you for free, and if it seems like you are taking advantage and not doing your own homework, you run the very real risk of being considered 'one of those people' who end up being routinely ignored.

Thanks for your comment James, and yes I have read the vignette but still can't figure out how level should be set to get all the terms, or how it is possible that the number of terms decreases with very high level.

I don't completely understand your irritation, because I don't understand why I give the impression that I take advantage of this support site, or that I don't do my homework. But if I gave you that impression I am sorry for the inconvenience.

I'm not irritated. I'm trying to explain something to you. Evidently not well.

So let me try to explain. I know nothing about clusterProfiler. But if I go to the vignette and click on the link for GO Analysis, I see this:

In clusterProfiler, groupGO is designed for gene classification based on GO distribution at a specific level.

And immediately following that, in the next paragraph, I see

## GO over-representation test

Over-representation test3 were implemented in clusterProfiler. For calculation details and explanation of paramters, please refer to the vignette of DOSE.

ego <- enrichGO(gene          = gene,
universe      = names(geneList),
OrgDb         = org.Hs.eg.db,
ont           = "CC",
pvalueCutoff  = 0.01,
qvalueCutoff  = 0.05,
head(ego)

So without knowing anything about this package, and taking like two minutes of my time, I now know that the function groupGO is for doing a hypergeometric test at a specific level, and that enrichGO is for doing a hypergeometric using all levels.

The expectation is that you could have done that yourself, rather than having somebody read the section and parse it for you. So to reiterate my earlier point, if people start to think that you are unwilling to read and understand the documentation provided to you, particularly when it is laid out in clear and understandable text, they may themselves become unwilling to offer help when you ask for it. That is not intended to imply some irritation with you on my part, but to tell you how things work, and to offer some constructive criticism.

Thanks again, I understand what you mean. Sorry I gave you that impression. I was actually not asking about a simple hypergeometric test, I understand how to do that with this package. I thought I was clear about that in my question, but probably I was not. I was interested in getting all GO terms, for all my genes (which is a list of all ensemble genes). I would like to get this gene 2 GO term data frame for further analysis (see how many genes overlap per term). Thanks for your time, I think I will stick to GOseq, where I do understand the vignette better.

So you buried your actual question in a bunch of other stuff. If you want to know all the GO IDs for each ensembl gene ID, you should just use biomaRt. Trying to use a package for some side effect is not really the way to go.

> getBM(c("ensembl_gene_id","go_id"), "ensembl_gene_id", "ENSG00000139618", mart)
ensembl_gene_id      go_id
1  ENSG00000139618 GO:0006281
2  ENSG00000139618 GO:0000724
3  ENSG00000139618 GO:0005634
4  ENSG00000139618 GO:0005737
5  ENSG00000139618 GO:0005829
6  ENSG00000139618 GO:0005654
7  ENSG00000139618 GO:0003677
8  ENSG00000139618 GO:0005856
9  ENSG00000139618 GO:0005515
10 ENSG00000139618 GO:0006310
11 ENSG00000139618 GO:0006974
12 ENSG00000139618 GO:0007049
13 ENSG00000139618 GO:0045893
14 ENSG00000139618 GO:0005813
15 ENSG00000139618 GO:0005815
16 ENSG00000139618 GO:0042802
17 ENSG00000139618 GO:0032991
18 ENSG00000139618 GO:0000784
19 ENSG00000139618 GO:0043967
20 ENSG00000139618 GO:0002020
21 ENSG00000139618 GO:0000281
22 ENSG00000139618 GO:0030141
23 ENSG00000139618 GO:0070200
24 ENSG00000139618 GO:0010485
25 ENSG00000139618 GO:0003697
26 ENSG00000139618 GO:0000731
27 ENSG00000139618 GO:0000732
28 ENSG00000139618 GO:0043966
29 ENSG00000139618 GO:0051298
30 ENSG00000139618 GO:0006302
31 ENSG00000139618 GO:0043015
32 ENSG00000139618 GO:0008022
33 ENSG00000139618 GO:0006289
34 ENSG00000139618 GO:0033600
35 ENSG00000139618 GO:0000800
36 ENSG00000139618 GO:0010484
37 ENSG00000139618 GO:0000722
38 ENSG00000139618 GO:0001556
39 ENSG00000139618 GO:0001833
40 ENSG00000139618 GO:0006978
41 ENSG00000139618 GO:0007141
42 ENSG00000139618 GO:0007283
43 ENSG00000139618 GO:0007420
44 ENSG00000139618 GO:0007569
45 ENSG00000139618 GO:0008283
46 ENSG00000139618 GO:0008585
47 ENSG00000139618 GO:0008630
48 ENSG00000139618 GO:0010165
49 ENSG00000139618 GO:0010225
50 ENSG00000139618 GO:0010332
51 ENSG00000139618 GO:0030097
52 ENSG00000139618 GO:0032465
53 ENSG00000139618 GO:0042771
54 ENSG00000139618 GO:0043009
55 ENSG00000139618 GO:0045931
56 ENSG00000139618 GO:0048478
57 ENSG00000139618 GO:0051276
58 ENSG00000139618 GO:1990426
59 ENSG00000139618 GO:0033593
60 ENSG00000139618          

From the vignette here:

In clusterProfiler, groupGO is designed for gene classification based on GO distribution at a specific level.

Over-representation test were implemented in clusterProfiler. For calculation details and explanation of parameters, please refer to the vignette of DOSE. enrichGO test the whole GO corpus, ...

The decrease of GO terms when increasing the levels is (AFAIK) due the fact that not all trees have an equal 'depth'. In other words, some trees stop at e.g. level 15, others already at level 10.

Moreover, the nature of the GO classification also makes things complex. As James indicated, the GO classification is a directed acyclic graph. GO terms do not occupy strict fixed levels in the hierarchy. Because GO is structured as a graph, terms would appear at different 'levels' if different paths were followed through the graph. This is especially true if one mixes the different relations used to connect terms. In other words, a problem is that GO hierarchy is set-up in a way where you can take multiple paths to reach the same term. Let's say a gene has the term, neuron differentiation. You can reach that term through neuron development or cell differentiation. Depending on the path, the level (depth) could be different. (Modified from this post here at Biostars)