Question

AnnotationForge::makeOrgPackage GO Ids mistake

0

Entering edit mode

Saber_J ▴ 10

@e5bd6104

Last seen 23 months ago

China

Hello everyone, I'm trying to create a custom annotation for Locusta migratoria and then run an enrichment analysis with clusterProfiler. However, I met an error when running AnnotationForge::makeorgPackage(), which returns the following error:

Populating genes table:
genes table filled
Populating gene_info table:
gene_info table filled
Populating go table:
go table filled
table metadata filled
Error in makeOrgDbFromDataFrames(data, tax_id, genus, species, dbFileName,  : 
  'goTable' GO Ids must be formatted like 'GO:XXXXXXX'

My code is :

AnnotationForge::makeOrgPackage(gene_info=gene_info,
                                go=gene2go,
                                maintainer='liXY <XYZ_zz@163.com>',
                                author='liXY',
                                outputDir="E:/reKEGG/LocalOrgdb/20220416",
                                tax_id=7004,
                                genus='Locusta',
                                species='migratoria',
                                goTable="go",
                                version="1.0")

2 files I used:

> head(gene2go)
         GID         GO EVIDENCE
1 LOCMI17615 GO:0000003      IEA
2 LOCMI17615 GO:0003674      IEA
3 LOCMI17615 GO:0003824      IEA
4 LOCMI17615 GO:0005575      IEA
5 LOCMI17615 GO:0005622      IEA
6 LOCMI17615 GO:0005623      IEA

> head(gene_info)
         GID                                                                       Gene_Name
1 LOCMI17615 Methyltransf_11,Methyltransf_12,Methyltransf_23,Methyltransf_25,Methyltransf_31
2 LOCMI17599                                                        Lig_chan,Lig_chan-Glu_bd
3 LOCMI02434 Methyltransf_11,Methyltransf_12,Methyltransf_23,Methyltransf_25,Methyltransf_31
4 LOCMI05868                                                       N-SET,RRM_1,SET,SET_assoc
5 LOCMI15917                                                                       UCR_hinge
6 LOCMI08045                                                             EZH2_WD-Binding,SET

I don't understand why this error occurs, because my GO id matches the prompt "GO: XXXXXXXX"

AnnotationHubData AnnotationForge • 2.5k views

ADD COMMENT • link 24 months ago Saber_J ▴ 10

1

Entering edit mode

Have you checked the entire column of GO ids?

goData = gene2go
any(!grepl("^GO:", as.character(goData$GO)))

ADD REPLY • link 24 months ago shepherl 3.8k

1

Entering edit mode

Hi， I checked the entire column with your code. Here is the result:

> goData = gene2go
> any(!grepl("^GO:", as.character(goData$GO)))
[1] TRUE

ADD REPLY • link 24 months ago Saber_J ▴ 10

1

Entering edit mode

That means you have GO terms that don't start with GO: Which is why you get the error.

ADD REPLY • link 24 months ago James W. MacDonald 65k

0

Entering edit mode

After I got the annotation file from EGGNOG, I used the following code to extract the columns of GO annotations and then split the columns by using separate_rows() to get the previously mentioned gene2go

> emapper
# A tibble: 10,767 x 7
   GID        Gene_Symbol GO                           KO       Pathway                        OG    Gene_Name             
   <chr>      <chr>       <chr>                        <chr>    <chr>                          <chr> <chr>                 
 1 LOCMI17615 jhamt       GO:0000003,GO:0003674,GO:00~ ko:K107~ ko00981,map00981               S     Methyltransf_11,Methy~
 2 LOCMI17599 -           GO:0001101,GO:0003008,GO:00~ ko:K053~ -                              EPT   Lig_chan,Lig_chan-Glu~
 3 LOCMI02434 jhamt       GO:0000003,GO:0003674,GO:00~ ko:K107~ ko00981,map00981               S     Methyltransf_11,Methy~
 4 LOCMI05868 SETD1B      GO:0000228,GO:0000785,GO:00~ ko:K114~ ko00310,map00310               BK    N-SET,RRM_1,SET       
 5 LOCMI15917 UQCRH       GO:0003674,GO:0003824,GO:00~ ko:K004~ ko00190,ko01100,ko04260,ko047~ C     UCR_hinge             
 6 LOCMI08045 EZH1        GO:0000003,GO:0000122,GO:00~ ko:K114~ ko00310,ko05206,map00310,map0~ K     EZH2_WD-Binding,SET   
 7 LOCMI09053 ASH1L       GO:0000003,GO:0000785,GO:00~ ko:K061~ ko00310,map00310               K     BAH,Bromodomain,PHD,S~
 8 LOCMI09054 -           -                            -        -                              S     cNMP_binding          
 9 LOCMI09055 IKBKAP      GO:0000003,GO:0000049,GO:00~ ko:K113~ -                              K     IKI3                  
10 LOCMI11490 -           GO:0000902,GO:0000904,GO:00~ ko:K184~ -                              L     AAA_11,AAA_12         
# ... with 10,757 more rows

gene2go <- dplyr::select(emapper, GID, GO) %>%
  separate_rows(GO, sep = ',', convert = F) %>%
  filter(!is.na(GO)) %>%
  mutate(EVIDENCE = 'IEA') 

> gene2go
# A tibble: 1,207,964 x 3
   GID        GO         EVIDENCE
   <chr>      <chr>      <chr>   
 1 LOCMI17615 GO:0000003 IEA     
 2 LOCMI17615 GO:0003674 IEA     
 3 LOCMI17615 GO:0003824 IEA     
 4 LOCMI17615 GO:0005575 IEA     
 5 LOCMI17615 GO:0005622 IEA     
 6 LOCMI17615 GO:0005623 IEA     
 7 LOCMI17615 GO:0005737 IEA     
 8 LOCMI17615 GO:0006629 IEA     
 9 LOCMI17615 GO:0006714 IEA     
10 LOCMI17615 GO:0006716 IEA     
# ... with 1,207,954 more rows

Please help me to see if there is something wrong with this way of handling it? Not getting GO terms that start with GO:

ADD REPLY • link 24 months ago Saber_J ▴ 10

0

Entering edit mode

filter(!is.na(GO)) will not filter out something like row 8 of emapper that is not NA as the "-" is treated as a value. Filter out any GO that does not start with GO

gene2go[grepl("^GO:", as.character(gene2go$GO)),]

ADD REPLY • link 24 months ago shepherl 3.8k

0

Entering edit mode

Thank you so much. “-” is the reason for the error. I didn't double-check the filtered data, which was obviously a stupid mistake. (X﹏X) This problem has been bothering me for days. Thank you again.

Populating genes table:
genes table filled
Populating gene_info table:
gene_info table filled
Populating go table:
go table filled
table metadata filled
'select()' returned many:1 mapping between keys and columns
Dropping GO IDs that are too new for the current GO.db
Populating go table:
go table filled
Populating go_bp table:
go_bp table filled
Populating go_cc table:
go_cc table filled
Populating go_mf table:
go_mf table filled
'select()' returned many:1 mapping between keys and columns
Populating go_bp_all table:
go_bp_all table filled
Populating go_cc_all table:
go_cc_all table filled
Populating go_mf_all table:
go_mf_all table filled
Populating go_all table:
go_all table filled
Creating package in E:/reKEGG/LocalOrgdb/org.Lmigratoria.eg.db

ADD REPLY • link 24 months ago Saber_J ▴ 10

0

Entering edit mode

Not an answer to your question, but since you indicated that your ultimate goal is to use the OrgDb with clusterProfiler it is good to know that 'self-made' OrgDbs are (apparently) not (yet?) compatible with clusterProfiler. This is due to lack of the GOALL column in the OrgDbs. See here: Use of clusterProfiler : Error in testForValidKeytype(x, keytype) [Also note that you may want to check whether a 'prefab' OrgDb is available through the AnnotationHub!]

Having said this, making use of your GO-to-gene mapping is certainly possible with clusterProfiler, but then you will need to make use of the generic enricher() function and the argument TERM2GENE. See for example here: clusterProfiler-GO enrichment Error (option 2).

ADD REPLY • link 24 months ago Guido Hooiveld ★ 3.9k

0

Entering edit mode

Hi Guido, I have used clusterProfiler and my custom OrgDb to complete GO enrichment analysis, and due to the recent update of the EGGNOG database (to V2), I would like to rebuild a new Orgdb for this species. The former OrgDb I built has a GOALL column.

> keytypes(org.Lmigratoria.eg.db)
[1] "EVIDENCE"    "EVIDENCEALL" "Gene_Name"   "GID"         "GO"          "GOALL"       "ONTOLOGY"    "ONTOLOGYALL"

The code used this time is still the same as before, but I don't know why the above error appears. Regarding your mention of the argument TERM2GENE， I used the following code to finish the GO and KEGG enrichment analysis. Plots with barplot().

GO_result <- enrichGO(gene = gene,
                   OrgDb = org.Lmigratoria.eg.db,
                   keyType = 'GID',
                   ont = 'ALL',
                   qvalueCutoff = 0.05,
                   pvalueCutoff = 0.05)

    pathway2gene <- dplyr::select(emapper, Pathway, GID) %>%
      separate_rows(Pathway, sep = ',', convert = F) %>%
      filter(str_detect(Pathway, 'ko')) %>%
      mutate(Pathway = str_remove(Pathway, 'ko'))


   library(magrittr)
    get_path2name <- function(){
      keggpathid2name.df <- clusterProfiler:::kegg_list("pathway")
      keggpathid2name.df[,1] %<>% gsub("path:map", "", .)
      colnames(keggpathid2name.df) <- c("path_id","path_name")
      return(keggpathid2name.df)
    }

pathway2name <- get_path2name()

KEGG_result <- enricher(gene,
                   TERM2GENE = pathway2gene,
                   TERM2NAME = pathway2name,
                   pvalueCutoff = 0.05,
                   qvalueCutoff = 0.05)

ADD REPLY • link 24 months ago Saber_J ▴ 10