clusterProfiler: BgRatio N value does not match the size of the gene universe that has GO annotations.
2
0
Entering edit mode
Fatima • 0
@c51160bd
Last seen 19 hours ago
France

I am performing a GO enrichment analysis using genes from the transcriptome as the reference universe, consisting of 23,761 genes. I then conducted an enrichGO analysis on my differentially expressed genes (DEGs).

However, I noticed that only 17,800 genes from my universe are annotated in the org.Mm.eg.db database, as shown in the results below:

ID  Description GeneRatio   BgRatio pvalue  p.adjust    qvalue
GO:0001505 GO:0001505   regulation of neurotransmitter levels   25/207  208/17800   2.039483e-18    4.729560e-15    3.896485e-15
GO:0006836 GO:0006836   neurotransmitter transport  25/207  218/17800   6.415275e-18    7.438512e-15    6.128276e-15
GO:0007269 GO:0007269   neurotransmitter secretion  21/207  166/17800   4.292311e-16    2.488467e-13    2.050143e-13

To understand this difference, I checked how many genes from my universe are properly annotated in the org.Mm.eg.db database using the following code:

# list of gene symbols from the dataframe, representing the universe of genes.
universe <- df$mgi_symbol

# list of all mouse gene symbols from the org.Mm.eg.db database.
gene_symbols <- keys(org.Mm.eg.db, keytype = "SYMBOL")

# Retrieve the associated GO terms for each of the mouse gene symbols.
go_terms <- mapIds(org.Mm.eg.db,
                   keys = gene_symbols,
                   column = "GO",
                   keytype = "SYMBOL",
                   multiVals = "list")

# Filter genes that do not have valid or non-empty GO terms.
empty_or_na_go_terms <- go_terms[sapply(go_terms, is_empty_or_na)]

# Function to check if a vector of GO terms is valid (not empty, non-null, and non-NA).
has_valid_go_terms <- function(go_terms_vector) {
  !is.null(go_terms_vector) && !all(is.na(go_terms_vector)) && length(go_terms_vector) > 0
}

# Filter genes with valid GO terms by applying the function to the list of GO terms.
valid_go_terms <- go_terms[sapply(go_terms, has_valid_go_terms)]

# Extract the names of genes that have valid GO terms from the db.
genes_with_go_terms <- names(valid_go_terms)
print(length(genes_with_go_terms))  #29,891.

# Find the intersection between the genes in the universe and the genes that have valid GO terms in db.
common_genes <- intersect(universe, genes_with_go_terms)
print(length(common_genes))  #18,359.

I found 18,359 genes from my universe that are properly annotated in the org.Mm.eg.db database. However, the enrichGO results indicate only 17,800 annotated genes in the analysis. Where could this difference come from?

```

GO clusterProfiler enr • 1.0k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 9 hours ago
United States

The way you are doing it includes all three ontologies. Unless you used "ALL" as the ontology, the number of genes annotated should be less (I am assuming you used BP). Also, enrichGO uses the GOALL table, which maps genes to the direct GO term as well as any child terms. I don't have your data to check, but here is how you could be getting the list of genes:

library(org.Mm.eg.db)
library(RSQLite)
con <- org.Mm.eg_dbconn()
ids <- dbGetQuery(con, "select distinct(_id) from go_bp_all;")[,1]
sum(ids %in% universe)

Which is likely to return the 17800 you see.

0
Entering edit mode
Fatima • 0
@c51160bd
Last seen 19 hours ago
France

In fact, I used the BP ontology and I am using gene symbols rather than identifiers (IDs). Is there a way to verify how many genes from the universe are annotated in the GOALL table?

> universe
   [1] "Gnai3"         "Cdc45"         "H19"           "Scml2"         "Apoh"          "Narf"          "Cav2"          "Klf6"         
   [9] "Scmh1"         "Cox5a"         "Tbx2"          "Tbx4"          "Ngfr"          "Wnt3"          "Wnt9a"         "Fer"          
  [17] "Xpo6"          "Tfe3"          "Axin2"         "Brat1"         "Gna12"         "Slc22a18"      "Itgb2l"        "Igsf5"
ADD COMMENT
0
Entering edit mode

If you want to comment on an answer, please choose ADD COMMENT rather than ADD ANSWER, since you are not in fact adding an answer.

universe <- mapIds(org.Mm.eg.db, universe, "ENTREZID","ALIAS")
sum(ids %in% universe)

Ideally you wouldn't be using gene symbols because they are not unique.

ADD REPLY
0
Entering edit mode

I identified 3804 genes with GO annotations. I put my liste of genes in this link enter link description here

universe <- mapIds(org.Mm.eg.db, df$mgi_symbol, "ENTREZID","ALIAS")
df_universe <- data.frame(Gene = names(universe), ID = as.character(universe), stringsAsFactors = FALSE)
con <- org.Mm.eg_dbconn()
ids <- dbGetQuery(con, "select distinct(_id) from go_bp_all;")[,1]
print(sum(ids %in% df_universe$ID))
#3804
ADD REPLY
1
Entering edit mode

By using a slightly different way I get much larger numbers; both for the total number of input that could be annotated (= mapped to IDs), and the number of unique genes with GO-BP annotation.

This partly may be due to the fact that ALIAS is used as keytype (input); see for example below that H19 maps to 2 genes (ID 14955 and 50884). This demonstrates what James said: SYMBOL let alone ALIAS are not unique... and thus not preferred as input.

Step-by-step:

> library(org.Mm.eg.db)
> 
> ## read list of provided symbols
> universe.input <- read.csv("universe_genes.csv")
> dim(universe.input)
[1] 24464     1
> head(universe.input)
  mgi_symbol
1      Gnai3
2      Cdc45
3        H19
4      Scml2
5       Apoh
6       Narf
> 
> ## remove NAs from input
> ## note: input contained 680 NAs (= 24464-23784)
> universe.input <- universe.input[!is.na(universe.input$mgi_symbol), , drop = FALSE]
> dim(universe.input)
[1] 23784     1
> head(universe.input)
  mgi_symbol
1      Gnai3
2      Cdc45
3        H19
4      Scml2
5       Apoh
6       Narf
> 
> 
> ## map to GOALL
> results <- AnnotationDbi::select(org.Mm.eg.db,
+                                  keys=universe.input$mgi_symbol,
+                                  columns = c('ENTREZID','SYMBOL','GOALL'),
+                                  keytype = "ALIAS")
'select()' returned many:many mapping between keys and columns
> 
> ## total number of GOALL annotations: 3,295,564!
> ## how many unique genes have (at least one!) GOALL annotation?
> ## A: 20932 (thus 23784-20932 = 2852 of input has no GOALL annotation
> dim(results)
[1] 3295564       6
> dim(results[!duplicated(results$ENTREZID),])
[1] 20932     6
> head(results)
  ALIAS ENTREZID SYMBOL      GOALL EVIDENCEALL ONTOLOGYALL
1 Gnai3    14679  Gnai3 GO:0000139         ISO          CC
2 Gnai3    14679  Gnai3 GO:0000166         IEA          MF
3 Gnai3    14679  Gnai3 GO:0000166         ISO          MF
4 Gnai3    14679  Gnai3 GO:0001664         IBA          MF
5 Gnai3    14679  Gnai3 GO:0001664         ISO          MF
6 Gnai3    14679  Gnai3 GO:0003674         IBA          MF
> head(results[!duplicated(results$ENTREZID),])
     ALIAS ENTREZID SYMBOL      GOALL EVIDENCEALL ONTOLOGYALL
1    Gnai3    14679  Gnai3 GO:0000139         ISO          CC
363  Cdc45    12544  Cdc45 GO:0000228         ISO          CC
520    H19    14955    H19 GO:0000003         IMP          BP
660    H19    50884 Nckap1 GO:0000578         IMP          BP
990  Scml2   107815  Scml2 GO:0000785         IDA          CC
1106  Apoh    11818   Apoh GO:0001525         ISO          BP
> 
> 
> ## how many unique genes have (at least one!) GOALL-BP annotation?
> ## A: 9577 genes
> dim(results[!duplicated(results$ENTREZID) & results$ONTOLOGYALL == 'BP',])
[1] 9577    6
> 
> 
> packageVersion('org.Mm.eg.db')
[1] '3.19.1'
> 
> 
> 

Note: edited post to better show difference between total number of GOALL annotations, and number of unique genes that in the end were annotated.

ADD REPLY
1
Entering edit mode

I gave you busted SQL code that returned the central key for the DB rather than the gene IDs. Here's the corrected version.

> mgis <- scan("clipboard","c")[-1]
Read 24465 items
> head(mgis)
[1] "Gnai3" "Cdc45" "H19"   "Scml2"
[5] "Apoh"  "Narf" 
> ncbi <- mapIds(org.Mm.eg.db, mgis, "ENTREZID", "ALIAS")
'select()' returned 1:many
mapping between keys and columns

> ncbi <- ncbi[!is.na(ncbi)]
> ids <- dbGetQuery(con, "select distinct(gene_id) from go_bp_all inner join genes using(_id);")[,1]

> length(intersect(ncbi, ids))
[1] 17809
ADD REPLY
1
Entering edit mode

Hi James,

I am likely missing something obvious, but why is the code that I used returning a different number of genes having a GOALL-BP annotation than yours? Your result: 17809 unique genes, my result 9576 genes.

Thanks!

> library(org.Mm.eg.db)
> 
> universe.input <- read.csv("universe_genes.csv")
> dim(universe.input)
[1] 24464     1
> 
> ncbi <- AnnotationDbi::select(org.Mm.eg.db,
+                               keys=universe.input$mgi_symbol,
+                               columns = c('ENTREZID'),
+                               keytype = 'ALIAS')
'select()' returned many:many mapping between keys and columns
> dim(ncbi)
[1] 24546     2
> head(ncbi)
  ALIAS ENTREZID
1 Gnai3    14679
2 Cdc45    12544
3   H19    14955
4   H19    50884
5 Scml2   107815
6  Apoh    11818
>  
> ncbi <- ncbi[!is.na(ncbi$ENTREZID),]
> dim(ncbi)
[1] 21631     2
> 
> 
> 
> res <- AnnotationDbi::select(org.Mm.eg.db,
+                              keys=ncbi$ENTREZID,
+                              columns = c('GOALL'),
+                              keytype = "ENTREZID")
'select()' returned many:many mapping between keys and columns
> 
> head(res)
  ENTREZID      GOALL EVIDENCEALL ONTOLOGYALL
1    14679 GO:0000139         ISO          CC
2    14679 GO:0000166         IEA          MF
3    14679 GO:0000166         ISO          MF
4    14679 GO:0001664         IBA          MF
5    14679 GO:0001664         ISO          MF
6    14679 GO:0003674         IBA          MF
> 
> dim(res[!duplicated(res$ENTREZID) & res$ONTOLOGYALL == 'BP',])
[1] 9576    4
> 
> 

EDIT: Note that the number of unique genes is not correctly determined in the last line of code above!

Number of unique genes is identical to the number calculated by James through his direct query of the database.

See post below: clusterProfiler: BgRatio N value does not match the size of the gene universe that has GO annotations.

ADD REPLY
2
Entering edit mode

Hi Guido,

It's that last filtering step. You are saying the NCBI Gene ID has to be unique AND the ontology has to be BP, which isn't what you want, because !duplicated is the same thing as 'return only the first instance of', and you can see that in your res object the first instance of that NCBI Gene ID is a CC ontology, so you exclude that one (and any other Gene ID where the first instance of that ID isn't also simultaneously a BP ontology).

> res2 <- subset(res, ONTOLOGYALL == "BP")
> length(unique(res2$ENTREZID))
[1] 17809
ADD REPLY
1
Entering edit mode

Thanks James! I indeed missed the obvious... all clear now!

ADD REPLY

Login before adding your answer.

Traffic: 717 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6