EnrichGO not working with db created from annotationhub
3
0
Entering edit mode
Lucía ▴ 30
@16997962
Last seen 2.5 years ago
Canada

Hi, I am running clusterprofiler using my differentially expressed genes from DESeq2. I am working with Cannabis sativa. I created the db using annotation hub. Then I changed the symbols from my gene matrix to entrez id. I did that for both my genes of interest and my gene universe. And then I attempted to run enrichGO, but for some reason I get the error message below, and I have no idea why. Please help

library(AnnotationDbi)
library(clusterProfiler)
library(ReactomePA)

ah <- AnnotationHub()

AnnotationHub::query(ah, c("Cannabis", "Sativa"))
Csativa <- ah[["AH101262"]]
columns(Csativa)
keytypes(Csativa)
select(Csativa, head(keys(Csativa)), c("SYMBOL", "GENENAME", "GO")) 

annotated_significant_res_aga <- significant_res_aga

annotated_significant_res_aga$symbol <- mapIds(
  Csativa,
  keys = rownames(annotated_significant_res_aga),
  keytype = "ALIAS",
  column = "SYMBOL",
  multiVals = "first"
)

annotated_significant_res_aga$entrez <- mapIds(
  Csativa,
  keys = rownames(annotated_significant_res_aga),
  keytype = "ALIAS",
  column = "ENTREZID",
  multiVals = "first"
)

annotated_res_aga <- res_aga

annotated_res_aga$symbol <- mapIds(
  Csativa,
  keys = rownames(annotated_res_aga),
  keytype = "ALIAS",
  column = "SYMBOL",
  multiVals = "first"
)

annotated_res_aga$entrez <- mapIds(
  Csativa,
  keys = rownames(annotated_res_aga),
  keytype = "ALIAS",
  column = "ENTREZID",
  multiVals = "first"
)


ego_bp_aga <- enrichGO(gene          = as.character(unique(annotated_significant_res_aga$entrez)),
                       universe      = as.character(unique(annotated_res_aga$entrez)),
                       OrgDb         = Csativa,
                       keyType       = "ENTREZID",
                       ont           = "BP",
                       pAdjustMethod = "BH",
                       pvalueCutoff  = 0.01,
                       qvalueCutoff  = 0.05,
                       readable      = TRUE) 

--> No gene can be mapped....
--> Expected input gene ID: 23630686,27215454,27215500,24573811,27215495,27215452
--> return NULL...

sessionInfo( )

```R version 4.2.0 (2022-04-22 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] GOstats_2.62.0 graph_1.74.0 Category_2.62.0 Matrix_1.4-1
[5] GO.db_3.15.0 clusterProfiler_4.4.1 AnnotationDbi_1.58.0 RColorBrewer_1.1-3
[9] pheatmap_1.0.12 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.9
[13] purrr_0.3.4 readr_2.1.2 tidyr_1.2.0 tibble_3.1.7
[17] ggplot2_3.3.6 tidyverse_1.3.1 DESeq2_1.36.0 SummarizedExperiment_1.26.1 [21] Biobase_2.56.0 MatrixGenerics_1.8.0 matrixStats_0.62.0 GenomicRanges_1.48.0
[25] GenomeInfoDb_1.32.1 IRanges_2.30.0 S4Vectors_0.34.0 AnnotationHub_3.4.0
[29] BiocFileCache_2.4.0 dbplyr_2.1.1 BiocGenerics_0.42.0 BiocManager_1.30.17

loaded via a namespace (and not attached): [1] shadowtext_0.1.2 readxl_1.4.0 backports_1.4.1
[4] fastmatch_1.1-3 plyr_1.8.7 igraph_1.3.1
[7] lazyeval_0.2.2 GSEABase_1.58.0 splines_4.2.0
[10] BiocParallel_1.30.0 digest_0.6.29 yulab.utils_0.0.4
[13] htmltools_0.5.2 GOSemSim_2.22.0 viridis_0.6.2
[16] fansi_1.0.3 magrittr_2.0.3 memoise_2.0.1
[19] tzdb_0.3.0 Biostrings_2.64.0 annotate_1.74.0
[22] graphlayouts_0.8.0 modelr_0.1.8 vroom_1.5.7
[25] enrichplot_1.16.0 colorspace_2.0-3 blob_1.2.3
[28] rvest_1.0.2 rappdirs_0.3.3 ggrepel_0.9.1
[31] haven_2.5.0 crayon_1.5.1 RCurl_1.98-1.6
[34] jsonlite_1.8.0 scatterpie_0.1.7 genefilter_1.78.0
[37] ape_5.6-2 survival_3.3-1 glue_1.6.2
[40] polyclip_1.10-0 gtable_0.3.0 zlibbioc_1.42.0
[43] XVector_0.36.0 DelayedArray_0.22.0 Rgraphviz_2.40.0
[46] scales_1.2.0 DOSE_3.22.0 DBI_1.1.2
[49] Rcpp_1.0.8.3 viridisLite_0.4.0 xtable_1.8-4
[52] tidytree_0.3.9 gridGraphics_0.5-1 bit_4.0.4
[55] AnnotationForge_1.38.0 httr_1.4.3 fgsea_1.22.0
[58] ellipsis_0.3.2 pkgconfig_2.0.3 XML_3.99-0.9
[61] farver_2.1.0 locfit_1.5-9.5 utf8_1.2.2
[64] ggplotify_0.1.0 tidyselect_1.1.2 labeling_0.4.2
[67] rlang_1.0.2 reshape2_1.4.4 later_1.3.0
[70] munsell_0.5.0 BiocVersion_3.15.2 cellranger_1.1.0
[73] tools_4.2.0 cachem_1.0.6 downloader_0.4
[76] cli_3.3.0 generics_0.1.2 RSQLite_2.2.14
[79] broom_0.8.0 fastmap_1.1.0 yaml_2.3.5
[82] ggtree_3.4.0 bit64_4.0.5 fs_1.5.2
[85] tidygraph_1.2.1 KEGGREST_1.36.0 ggraph_2.0.5
[88] RBGL_1.72.0 nlme_3.1-157 mime_0.12
[91] aplot_0.1.4 DO.db_2.9 xml2_1.3.3
[94] compiler_4.2.0 rstudioapi_0.13 filelock_1.0.2
[97] curl_4.3.2 png_0.1-7 interactiveDisplayBase_1.34.0 [100] treeio_1.20.0 reprex_2.0.1 tweenr_1.0.2
[103] geneplotter_1.74.0 stringi_1.7.6 lattice_0.20-45
[106] vctrs_0.4.1 pillar_1.7.0 lifecycle_1.0.1
[109] data.table_1.14.2 bitops_1.0-7 patchwork_1.1.1
[112] httpuv_1.6.5 qvalue_2.28.0 R6_2.5.1
[115] promises_1.2.0.1 gridExtra_2.3 MASS_7.3-57
[118] assertthat_0.2.1 withr_2.5.0 GenomeInfoDbData_1.2.8
[121] parallel_4.2.0 hms_1.1.1 grid_4.2.0
[124] ggfun_0.0.6 ggforce_0.3.3 shiny_1.7.1
[127] lubridate_1.8.0 ```

clusterProfiler AnnotationHubData GO • 3.6k views
ADD COMMENT
0
Entering edit mode
Guido Hooiveld ★ 4.1k
@guido-hooiveld-2020
Last seen 25 days ago
Wageningen University, Wageningen, the …

Did you notice the first message that was returned after you ran enrichGO()? Thus: --> No gene can be mapped....

This basically means that, ehh, none of your input genes are valid entrez ids... so you should check whether this is the case. So, what is the output of head( as.character(unique(annotated_significant_res_aga$entrez)) ) and head( as.character(unique(annotated_res_aga$entrez)) )?

Also, just to show that 'it' works:

> library(AnnotationHub)
> library(clusterProfiler)

> ah <- AnnotationHub()
snapshotDate(): 2022-04-21
> AnnotationHub::query(ah, c("Cannabis", "Sativa"))
AnnotationHub with 1 record
# snapshotDate(): 2022-04-21
# names(): AH101262
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Cannabis sativa
# $rdataclass: OrgDb
# $rdatadateadded: 2022-04-21
# $title: org.Cannabis_sativa.eg.sqlite
# $description: NCBI gene ID based annotations about Cannabis sativa
# $taxonomyid: 3483
# $genome: NCBI genomes
# $sourcetype: NCBI/UniProt
# $sourceurl: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, ftp://ftp.uniprot.or...
# $sourcesize: NA
# $tags: c("NCBI", "Gene", "Annotation") 
# retrieve record with 'object[["AH101262"]]' 
> Csativa <- ah[["AH101262"]]
>
> # as example, select the first 150 Csativa IDs as input
> # these correspond to your 'annotated_significant_res_aga$entrez'
> foreground.genes <- keys(Csativa) [1:150]
> head(foreground.genes)
[1] "23630667" "23630669" "23630670" "23630672" "23630673" "23630676"
> 
> # as background (universe), use all genes
> # these correspond to your 'annotated_res_aga$entrez
> background.genes <- keys(Csativa)
> head(background.genes)
[1] "23630667" "23630669" "23630670" "23630672" "23630673" "23630676"
> 
> # run enrichGO, but since random foreground genes are used
> # without any significance cuoff!
> ego_bp_aga <- enrichGO(gene          = foreground.genes,
+                        universe      = background.genes,
+                        OrgDb         = Csativa,
+                        keyType       = "ENTREZID",
+                        ont           = "BP",
+                        pAdjustMethod = "BH",
+                        pvalueCutoff  = 1,
+                        qvalueCutoff  = 1,
+                        readable      = TRUE)
> 
> 
> ego_bp_aga
#
# over-representation test
#
#...@organism    Cannabis sativa 
#...@ontology    BP 
#...@keytype     ENTREZID 
#...@gene        chr [1:150] "23630667" "23630669" "23630670" "23630672" "23630673" ...
#...pvalues adjusted by 'BH' with cutoff <1 
#...59 enriched terms found
'data.frame':   59 obs. of  9 variables:
 $ ID         : chr  "GO:0008152" "GO:0044237" "GO:0015979" "GO:0009059" ...
 $ Description: chr  "metabolic process" "cellular metabolic process" "photosynthesis" "macromolecule biosynthetic process" ...
 $ GeneRatio  : chr  "121/127" "121/127" "50/127" "56/127" ...
 $ BgRatio    : chr  "156/171" "156/171" "58/171" "68/171" ...
 $ pvalue     : num  0.00345 0.00345 0.00734 0.03557 0.05349 ...
 $ p.adjust   : num  0.102 0.102 0.144 0.418 0.418 ...
 $ qvalue     : num  0.0999 0.0999 0.1417 0.41 0.41 ...
 $ geneID     : chr  "psbA/matK/rps16/psbK/psbI/atpI/rps2/rpoC2/rpoC1/rpoB/petN/psbM/psbD/psbC/lhbA/rps14/psaB/psaA/ycf3/rps4/rbcL/ac"| __truncated__ "psbA/matK/rps16/psbK/psbI/atpI/rps2/rpoC2/rpoC1/rpoB/petN/psbM/psbD/psbC/lhbA/rps14/psaB/psaA/ycf3/rps4/rbcL/ac"| __truncated__ "psbA/psbK/psbI/petN/psbM/psbD/psbC/lhbA/psaB/psaA/ycf3/rbcL/psaI/ycf4/petA/psbJ/psbL/psbF/psbE/petL/petG/psaJ/p"| __truncated__ "rps16/rps2/rpoC2/rpoC1/rpoB/rps14/rps4/rpl33/rps18/rpl20/rps12/rpoA/rps11/rpl36/rps8/rpl14/rpl16/rps3/rpl22/rps"| __truncated__ ...
 $ Count      : int  121 121 50 56 63 63 61 61 66 57 ...
#...Citation
 T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
 clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
 The Innovation. 2021, 2(3):100141 

> 
> head( as.data.frame(ego_bp_aga) )
                   ID                            Description GeneRatio
GO:0008152 GO:0008152                      metabolic process   121/127
GO:0044237 GO:0044237             cellular metabolic process   121/127
GO:0015979 GO:0015979                         photosynthesis    50/127
GO:0009059 GO:0009059     macromolecule biosynthetic process    56/127
GO:0009058 GO:0009058                   biosynthetic process    63/127
GO:1901576 GO:1901576 organic substance biosynthetic process    63/127
           BgRatio      pvalue  p.adjust     qvalue
GO:0008152 156/171 0.003449869 0.1017711 0.09986464
GO:0044237 156/171 0.003449869 0.1017711 0.09986464
GO:0015979  58/171 0.007344356 0.1444390 0.14173319
GO:0009059  68/171 0.035570236 0.4177884 0.40996189
GO:0009058  78/171 0.053485337 0.4177884 0.40996189
GO:1901576  78/171 0.053485337 0.4177884 0.40996189
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       geneID
GO:0008152 psbA/matK/rps16/psbK/psbI/atpI/rps2/rpoC2/rpoC1/rpoB/petN/psbM/psbD/psbC/lhbA/rps14/psaB/psaA/ycf3/rps4/rbcL/accD/psaI/ycf4/petA/psbJ/psbL/psbF/psbE/petL/petG/psaJ/rpl33/rps18/rpl20/rps12/psbT/psbN/psbH/petB/petD/rpoA/rps11/rpl36/rps8/rpl14
GO:0044237 psbA/matK/rps16/psbK/psbI/atpI/rps2/rpoC2/rpoC1/rpoB/petN/psbM/psbD/psbC/lhbA/rps14/psaB/psaA/ycf3/rps4/rbcL/accD/psaI/ycf4/petA/psbJ/psbL/psbF/psbE/petL/petG/psaJ/rpl33/rps18/rpl20/rps12/psbT/psbN/psbH/petB/petD/rpoA/rps11/rpl36/rps8/rpl14
GO:0015979                                                                                                                                                                                                                                                                                                                                                                                                          psbA/psbK/psbI/petN/psbM/psbD/psbC/lhbA/psaB/psaA/ycf3/rbcL/psaI/ycf4/petA/psbJ/psbL/psbF/psbE/petL/petG/psaJ/psbT/psbN/psbH/petB/petD/psaC/ycf3/psaA/psbB/psbT/psaC/rbcL/psaI/petD/psbC/psbJ/petA/psaB/petL/petB/psbI/psbH/psbL/psbE/psbF/ycf4/psbN/petG
GO:0009059                                                                                                                                                                                                                                                                                                                                      rps16/rps2/rpoC2/rpoC1/rpoB/rps14/rps4/rpl33/rps18/rpl20/rps12/rpoA/rps11/rpl36/rps8/rpl14/rpl16/rps3/rpl22/rps19/rpl2/rpl23/rps7/rpl32/rps15/rps7/rpl23/rpl2/rps19/rps12/rps12/rpoC1/rpoB/rpl32/rps7/rps16/rpl23/rps19/rps3/rpoA/rps15/rps11/rps2/rpoC2/rpl22/rpl36/rpl14/rps19/rpl16/rps12/rpl20/rpl2/rpl2/rps18/rpl33/rps8
GO:0009058                                                                                                                                                                                                                                                                                                   rps16/atpI/rps2/rpoC2/rpoC1/rpoB/rps14/rps4/rbcL/accD/rpl33/rps18/rpl20/rps12/rpoA/rps11/rpl36/rps8/rpl14/rpl16/rps3/rpl22/rps19/rpl2/rpl23/rps7/rpl32/rps15/rps7/rpl23/rpl2/rps19/rps12/rps12/rpoC1/rpoB/rpl32/rps7/rps16/atpF/rpl23/rps19/rbcL/accD/rps3/rpoA/rps15/rps11
GO:1901576                                                                                                                                                                                                                                                                                                   rps16/atpI/rps2/rpoC2/rpoC1/rpoB/rps14/rps4/rbcL/accD/rpl33/rps18/rpl20/rps12/rpoA/rps11/rpl36/rps8/rpl14/rpl16/rps3/rpl22/rps19/rpl2/rpl23/rps7/rpl32/rps15/rps7/rpl23/rpl2/rps19/rps12/rps12/rpoC1/rpoB/rpl32/rps7/rps16/atpF/rpl23/rps19/rbcL/accD/rps3/rpoA/rps15/rps11
           Count
GO:0008152   121
GO:0044237   121
GO:0015979    50
GO:0009059    56
GO:0009058    63
GO:1901576    63
> 
> 

> sessionInfo()
R version 4.2.0 Patched (2022-05-12 r82348 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] AnnotationDbi_1.58.0  IRanges_2.30.0        S4Vectors_0.34.0     
[4] Biobase_2.56.0        clusterProfiler_4.4.1 AnnotationHub_3.4.0  
[7] BiocFileCache_2.4.0   dbplyr_2.1.1          BiocGenerics_0.42.0  
ADD COMMENT
0
Entering edit mode

Hi, thanks so much for your reply. This is the output of the commands you asked about:

head( as.character(unique(annotated_significant_res_aga$entrez)) )

[1] "115694767" "115694769" "115694771" "115694784" "115694787" "115694843"

and

head( as.character(unique(annotated_res_aga$entrez)) )

[1] NA "115694672" "115694673" "115694674" "115694675" "115694676"

I guess that means the ids are not matching? I don't understand why this could be, since I took the gtf file from nbci and the C. sativa reference is from ncbi?

ADD REPLY
0
Entering edit mode

No, basically what is meant is that your input ids should match with those in the OrgDb. Also your significant genes should be part of all annotated genes. That seems not to be the case now.

Also, I noticed that you have one or more NA in your gene vectors. Try to remove these before running enrichGO().

Lastly, does my example code work for you?

ADD REPLY
0
Entering edit mode

I don't understand why they don't match. I got the gtf and fasta from the ncbi website, the cs10 genome assembly, which is the same as that of annotationhub. I got my significant genes from running DESeq2, so they are definitely a subset of all the genes tested.

ADD REPLY
0
Entering edit mode

How can I find out the annotation file used by annotation hub? Like exactly what gtf/gff file and from where?

ADD REPLY
0
Entering edit mode

AFAIK the annotation file present at the AnnotationHub is not generated from a GTF/GFF file, but from data directly downloaded from the NCBI FTP server. If you look at my first post above, you will notice:

> AnnotationHub::query(ah, c("Cannabis", "Sativa"))
AnnotationHub with 1 record
# snapshotDate(): 2022-04-21
# names(): AH101262
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Cannabis sativa
# $rdataclass: OrgDb
# $rdatadateadded: 2022-04-21
# $title: org.Cannabis_sativa.eg.sqlite
# $description: NCBI gene ID based annotations about Cannabis sativa
# $taxonomyid: 3483
# $genome: NCBI genomes
# $sourcetype: NCBI/UniProt
# $sourceurl: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, ftp://ftp.uniprot.or...
# $sourcesize: NA
# $tags: c("NCBI", "Gene", "Annotation") 
# retrieve record with 'object[["AH101262"]]' 

Thus: the annotation data is derived from ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/. If you lookup all info NCBI has on the organism with taxonomy id 3483, you will see that the number of genes currently annotated by NCBI is 31,443. Note that this equals the number of genes in background.genes:

> length(background.genes)
[1] 31443

so these numbers match...

To reverse the question: which GFT file did you use? Where did you get it from?

ADD REPLY
0
Entering edit mode

I used the gtf file from ncbi

Here: https://www.ncbi.nlm.nih.gov/assembly/GCF_900626175.2/

Everything was taken from there

ADD REPLY
0
Entering edit mode
Lucía ▴ 30
@16997962
Last seen 2.5 years ago
Canada

I've gone through my code and I'm pretty sure the issue is not the code, but that some gene IDs (all the ones that start with LOC) are just not being recognized. Actually, all my DEGs start with LOC... Does this mean I can't do GO for them?

ADD COMMENT
0
Entering edit mode

I had a closer look at this, and the error you got is actually somewhat misleading; it is not that you are using the wrong ids (ids that are not recognized by the OrgDb from the AnnotationHub), but rather that only a very limited number of genes do have a GO annotation... only 35 (based on gene symbols) or 53 (based on entrez ids) genes of the 29,813 genes present in the GFF file...

BTW: if you would use all 31,443 genes present in the OrgDb, then (only) 171 genes do have a GO annotation... See column BgRatio in my first post above.

So when all input genes (foreground genes) have no GO annotation, it will result in an error. This is the case for all LOCs,hence the error. Having at least 2 genes with a GO annotation is needed to prevent the error (although one may wonder about the relevance of the results).

Because of the lack of GO annotation, you cannot make use of the OrgDb and the function enrichGO(). However, if you do have such GO annotation info yourselves, or are able to retrieve it from another database than NCBI, than you can use the generic function enricher() with the arguments/mapping file TERM2GENE and TERM2NAME to perform GO over-representation analysis. See this post for some more info on that: AnnotationForge::makeOrgPackage GO Ids mistake. That whole thread may be relevant to you as well!

For completeness below the code that shows that the error is due to lack of GO annotations in the foreground genes, and that the OrgDb contains a very limited number of GO annotations.:

Part 1.

> library("rtracklayer")
> library("GenomicFeatures")
>
> # downloaded GFF from NCBI through link above.
> cs10.gff <- import.gff3("GCF_900626175.2_cs10_genomic.gff")
> # extract all genes
> my_genes <- cs10.gff[cs10.gff$type == "gene"]
> 
> # extract GFF-based annotation info for all genes
> gene.info <- mcols(my_genes)[c("ID","Dbxref", "Name","gene", "gene_biotype")]
> 
> dim(gene.info)  #29813 genes are present in GFF
[1] 29813     5
>
> head(gene.info)
DataFrame with 6 rows and 5 columns
                 ID           Dbxref         Name         gene
        <character>  <CharacterList>  <character>  <character>
1 gene-LOC115705401 GeneID:115705401 LOC115705401 LOC115705401
2 gene-LOC115705987 GeneID:115705987 LOC115705987 LOC115705987
3 gene-LOC115706290 GeneID:115706290 LOC115706290 LOC115706290
4 gene-LOC115707251 GeneID:115707251 LOC115707251 LOC115707251
5 gene-LOC115705027 GeneID:115705027 LOC115705027 LOC115705027
6 gene-LOC115705026 GeneID:115705026 LOC115705026 LOC115705026
    gene_biotype
     <character>
1         lncRNA
2 protein_coding
3 protein_coding
4 protein_coding
5 protein_coding
6 protein_coding
>
> tail(gene.info)
DataFrame with 6 rows and 5 columns
                      ID          Dbxref        Name        gene
             <character> <CharacterList> <character> <character>
[29808,] gene-A5N79_gr03 GeneID:27215487        rrnL        rrnL
[29809,] gene-A5N79_gr02 GeneID:27215494        rrn5        rrn5
[29810,] gene-A5N79_gr01 GeneID:27215493        rrnS        rrnS
[29811,] gene-A5N79_gp02 GeneID:27215495        sdh3        sdh3
[29812,] gene-A5N79_gp28 GeneID:27215502        nad2        nad2
[29813,] gene-A5N79_gp28 GeneID:27215502        nad2        nad2
           gene_biotype
            <character>
[29808,]           rRNA
[29809,]           rRNA
[29810,]           rRNA
[29811,] protein_coding
[29812,] protein_coding
[29813,] protein_coding
>
> # entries in column name/gene are thus symbols; the same you should have
>
> # define some fore- and background genes (as symbols)
> all.genes.symbol <- gene.info$gene
> foreground.genes.symbol <- all.genes.symbol[1:150]
> 
> # run enrichGO using gene symbols, and use the first 150 genes as foreground.
> # note that these 150 genes are all LOCs....
> head(foreground.genes.symbol)
[1] "LOC115705401" "LOC115705987" "LOC115706290" "LOC115707251"
[5] "LOC115705027" "LOC115705026"
>
> tail(foreground.genes.symbol)
[1] "LOC115707561" "LOC115706584" "LOC115706588" "LOC115706580"
[5] "LOC115707605" "LOC115706586"
>
> ego_bp_aga <- enrichGO(gene           = foreground.genes.symbol,
+                         universe      = all.genes.symbol,
+                         OrgDb         = Csativa,
+                         keyType       = "SYMBOL",
+                         ont           = "BP",
+                         pAdjustMethod = "BH",
+                         pvalueCutoff  = 1,
+                         qvalueCutoff  = 1,
+                         readable      = TRUE)
--> No gene can be mapped....
--> Expected input gene ID: ndhB,atpF,rpoC1,petB,psbB,nad4L
--> return NULL...
>
> # !! got same error as reported !!
>
> # ... but do these 150 LOCs have a GO annotation at all?
> # answer = NO!
ADD REPLY
0
Entering edit mode

Part 2; continuation of Part 1 above...

> # !! got same error as reported !!
>
> # ... but do these 150 LOCs have a GO annotation at all?
> # answer = NO!
> 
> # OrgDb 'Csativa' downloaded from AnnotationHub as per code in 1st post. 
> AnnotationDbi::select(
+   Csativa,
+   keys = foreground.genes.symbol,
+   keytype = "SYMBOL",
+   column = c("SYMBOL", "ENTREZID", "GO")
+ )
'select()' returned 1:1 mapping between keys and columns
          SYMBOL  ENTREZID   GO
1   LOC115705401 115705401 <NA>
2   LOC115705987 115705987 <NA>
3   LOC115706290 115706290 <NA>
4   LOC115707251 115707251 <NA>
5   LOC115705027 115705027 <NA>
<<snip>>
146 LOC115706584 115706584 <NA>
147 LOC115706588 115706588 <NA>
148 LOC115706580 115706580 <NA>
149 LOC115707605 115707605 <NA>
150 LOC115706586 115706586 <NA>
>
>
> # Mmm, but how many genes do then have a GO annotation?
> annotation.all.genes <- AnnotationDbi::select(
+   Csativa,
+   keys = all.genes.symbol,
+   keytype = "SYMBOL",
+   column = c("SYMBOL", "ENTREZID", "GO")
+ )
'select()' returned many:many mapping between keys and columns
>
> dim(annotation.all.genes) # 35796 genes now (because multiple GO annotations per gene)
[1] 35796     3
> 
> # how many unique entrez genes do have a GO annotation? 53 only!!
> dim ( annotation.all.genes[!is.na(annotation.all.genes$GO) & !duplicated(annotation.all.genes$ENTREZID ), ] )
[1] 53  3
> annotation.all.genes[!is.na(annotation.all.genes$GO) & !duplicated(annotation.all.genes$ENTREZID ), ]
      SYMBOL ENTREZID         GO
29755   nad1 27215501 GO:0016021
29761  rpl10 27215482 GO:0005739
29763   nad9 27215469 GO:0005739
<<snip>>
> # when based on symbols it is even a smaller number..! 35 only
> dim ( annotation.all.genes[!is.na(annotation.all.genes$GO) & !duplicated(annotation.all.genes$SYMBOL ), ] )
[1] 35  3
> annotation.all.genes[!is.na(annotation.all.genes$GO) & !duplicated(annotation.all.genes$SYMBOL ), ]
      SYMBOL ENTREZID         GO
29755   nad1 27215501 GO:0016021
29761  rpl10 27215482 GO:0005739
29763   nad9 27215469 GO:0005739
29774   atp1 27215489 GO:0005739
<<snip>>
> 
> # proof-of-concept it works when using genes that have a GO annotation: success!!
> all.genes.symbol.2 <- annotation.all.genes[!is.na(annotation.all.genes$GO) & !duplicated(annotation.all.genes$SYMBOL ), ]$SYMBOL
> foreground.genes.symbol.2 <- all.genes.symbol.2[1:10]
> ego_bp_aga <- enrichGO(gene           = foreground.genes.symbol.2,
+                         universe      = all.genes.symbol.2, #can also be all.genes.symbol (=all genes)
+                         OrgDb         = Csativa,
+                         keyType       = "SYMBOL",
+                         ont           = "BP",
+                         pAdjustMethod = "BH",
+                         pvalueCutoff  = 1,
+                         qvalueCutoff  = 1,
+                         readable      = TRUE)
> # Success!
> ego_bp_aga
#
# over-representation test
#
#...@organism    Cannabis sativa 
#...@ontology    BP 
#...@keytype     SYMBOL 
#...@gene        chr [1:10] "nad1" "rpl10" "nad9" "atp1" "mttB" "ccmB" "nad4L" "atp4" ...
#...pvalues adjusted by 'BH' with cutoff <1 
#...15 enriched terms found
'data.frame':   15 obs. of  9 variables:
 $ ID         : chr  "GO:0046034" "GO:0009058" "GO:0044249" "GO:0044271" ...
 $ Description: chr  "ATP metabolic process" "biosynthetic process" "cellular biosynthetic process" "cellular nitrogen compound biosynthetic process" ...
 $ GeneRatio  : chr  "3/6" "2/6" "2/6" "2/6" ...
 $ BgRatio    : chr  "10/27" "12/27" "12/27" "12/27" ...
 $ pvalue     : num  0.387 0.861 0.861 0.861 0.861 ...
 $ p.adjust   : num  1 1 1 1 1 1 1 1 1 1 ...
 $ qvalue     : num  1 1 1 1 1 1 1 1 1 1 ...
 $ geneID     : chr  "nad4L/atp4/atp6" "atp4/atp6" "atp4/atp6" "atp4/atp6" ...
 $ Count      : int  3 2 2 2 2 2 2 2 2 2 ...
#...Citation
 T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
 clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
 The Innovation. 2021, 2(3):100141 

 > # 2nd proof-of-concept: if at least 2 genes with GO annotation are added to the LOCs, also succes!
> foreground.genes.symbol.add <- c("nad1", "atp4", foreground.genes.symbol)
>  ego_bp_aga <- enrichGO(gene           = foreground.genes.symbol.add,
+                          universe      = all.genes.symbol,
+                          OrgDb         = Csativa,
+                          keyType       = "SYMBOL",
+                          ont           = "BP",
+                          pAdjustMethod = "BH",
+                          pvalueCutoff  = 1,
+                          qvalueCutoff  = 1,
+                          readable      = TRUE)
> # Success!
> ego_bp_aga
#
# over-representation test
#
#...@organism    Cannabis sativa 
#...@ontology    BP 
#...@keytype     SYMBOL 
#...@gene        chr [1:152] "nad1" "atp4" "LOC115705401" "LOC115705987" "LOC115706290" ...
#...pvalues adjusted by 'BH' with cutoff <1 
#...15 enriched terms found
'data.frame':   15 obs. of  9 variables:
 $ ID         : chr  "GO:0046034" "GO:0009058" "GO:0044249" "GO:0044271" ...
 $ Description: chr  "ATP metabolic process" "biosynthetic process" "cellular biosynthetic process" "cellular nitrogen compound biosynthetic process" ...
 $ GeneRatio  : chr  "1/1" "1/1" "1/1" "1/1" ...
 $ BgRatio    : chr  "10/27" "12/27" "12/27" "12/27" ...
 $ pvalue     : num  0.37 0.444 0.444 0.444 0.444 ...
 $ p.adjust   : num  0.722 0.722 0.722 0.722 0.722 ...
 $ qvalue     : num  0.722 0.722 0.722 0.722 0.722 ...
 $ geneID     : chr  "atp4" "atp4" "atp4" "atp4" ...
 $ Count      : int  1 1 1 1 1 1 1 1 1 1 ...
#...Citation
 T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
 clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
 The Innovation. 2021, 2(3):100141 

> 
ADD REPLY
0
Entering edit mode

I have this, can this be used for GO in R?

115709282       LOC115709282    phenylalanine ammonia-lyase
115699364       LOC115699364    4-coumarate--CoA ligase 1
115725705       LOC115725705    U4/U6.U5 small nuclear ribonucleoprotein 27 kDa protein
115725515       LOC115725515    uncharacterized LOC115725515
115725459       LOC115725459    histone H2A.6
115725416       LOC115725416    protein-S-isoprenylcysteine O-methyltransferase A
115725321       LOC115725321    ubiquitin carboxyl-terminal hydrolase 23
115725219       LOC115725219    tryptophan decarboxylase TDC2
115725203       LOC115725203    YTH domain-containing protein ECT4
115725202       LOC115725202    cyclic nucleotide-gated ion channel 4
ADD REPLY
0
Entering edit mode
Lucía ▴ 30
@16997962
Last seen 2.5 years ago
Canada

Hi, thanks so much for all your replies and running the code and confirming.

I have not seen any GO term annotations for Cannabis so far. Do you happen to know a good program to get the GO category mappings in the first place? Any advice would be appreciated.

ADD COMMENT
0
Entering edit mode

Since I almost exclusively work with data from model organisms, I never needed to do this. So I don't have any hands-on experience with this. I know from the past (literature) that the tool Blast2Go could do this. However, the use of Blast2Go is not for free anymore.

Recently I heard about the tool TOA (Taxonomy-oriented Annotation). See here for paper, and here for accompanying code. Might be useful. Idem for the web application TRAPID 2.0.

Good luck!

ADD REPLY

Login before adding your answer.

Traffic: 403 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6