Question

It's possible to use clusterProfiler without an annotationDBI package

0

Entering edit mode

nromerov • 0

@3ef12fd6

Last seen 3 days ago

Colombia

Hello Im trying to do an enrichment analisis of a set of differential expressed genes after using DESeq2 my genes are from Hydractinia symbiolongicarpus a non conventional model cnidarian and come from an immujne challegne i did but my species dont have a annotationDBI package associated and i was trying to use clusterProfiler i saw here that is possible to develop an annotation dbi package but im very naive at programming so its possible to use clusterProfiler without the annotationDBI package?

clusterProfiler • 634 views

ADD COMMENT • link 5 months ago nromerov • 0

score 0 · Answer 1 · 2023-11-27

0

Entering edit mode

Guido Hooiveld ★ 3.9k

@guido-hooiveld-2020

Last seen 1 day ago

Wageningen University, Wageningen, the …

Yes, you can use clusterProfiler without having a (formal) annotationDBI package. Having just a table with annotation information is sufficient for clusterProfiler to use as input. This is a table linking gene ids with e.g. Gene Ontology categories. For clusterProfiler this means using the generic enrichment functions enricher or GSEA together with a data.frame that should be provided for the argument TERM2GENE. This post of mine may give you some leads and inspiration: GO enrichment analysis on Solanum lycopersicum proteomics dataset (UniProt IDs)

Obviously, that does not mean you can do any functional enrichment analysis without such annotation information! This information is always required, should be provided, and might be obtained through e.g. databases/sites specific for your species. Yet, having a quick look at NCBI seems to suggest such functional annotation is not available for Hydractinia symbiolongicarpus. (https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?lvl=0&id=13093)

However, the KEGG database seems to have annotation (pathway) information available: https://www.genome.jp/kegg-bin/show_organism?menu_type=pathway_maps&org=hsy. To use this in clusterProfiler, check out the functions enrichKEGG and gseKEGG.

ADD COMMENT • link 5 months ago Guido Hooiveld ★ 3.9k

0

Entering edit mode

Hello yes i saw a post someone else send me that you posted before here and i was following it but it retrieves me this error when i use enrichr

> geneList = eggnog_degs[,"log2FoldChange"]
> names(geneList) = as.character(eggnog_degs[,"gene_id"])
> gene <- names(geneList)[abs(geneList) > 2]
> geneList = sort(geneList, decreasing = TRUE)
> x <- enricher(gene, TERM2GENE = GO)
--> No gene can be mapped....
--> Expected input gene ID: GO:0055080,GO:0031267,GO:0044430,GO:0016192,GO:0044265,GO:0009889
--> return NULL...

Gen is like this:

>print(gene)
  [1] "LOC130612230" "LOC130612541" "LOC130613351" "LOC130613446" "LOC130613454"
  [6] "LOC130613474" "LOC130613483" "LOC130613503" "LOC130613561" "LOC130614295"
 [11] "LOC130614308" "LOC130614690" "LOC130621222" "LOC130621262" "LOC130621439"
 [16] "LOC130621674" "LOC130621944" "LOC130622477" "LOC130623434" "LOC130624017"
 [21] "LOC130624087" "LOC130624108" "LOC130624137" "LOC130624243" "LOC130626199"
 [26] "LOC130628649" "LOC130628695" "LOC130628907" "LOC130629018" "LOC130629420"

and GO its like this

> print(GO)
# A tibble: 322,067 × 2
   gene_id      GO        
   <chr>        <chr>     
 1 LOC130612038 GO:0001085
 2 LOC130612038 GO:0001725
 3 LOC130612038 GO:0002026
 4 LOC130612038 GO:0003008
 5 LOC130612038 GO:0003012
 6 LOC130612038 GO:0003013
 7 LOC130612038 GO:0003015
 8 LOC130612038 GO:0003300
 9 LOC130612038 GO:0003674
10 LOC130612038 GO:0003779

And when i seach by KEGG clusterProfiler returns me this:

> search_kegg_organism('hsy', by='kegg_code')
[1] kegg_code       scientific_name common_name    
<0 rows> (or 0-length row.names)

Im not good at coding so im quite desperate rigt now as i dont know how to fix this

ADD REPLY • link 5 months ago nromerov • 0

0

Entering edit mode

You are almost there...

Yet

pay attention to the error message (--> No gene can be mapped.... --> Expected input gene ID: GO:0055080, ....): you should infer that you just will have to swap the order of the columns in your tibble GO.
always read the help pages of each function; then you would have seen that for the function search_kegg_organism the argument use_internal_data by default is set to TRUE. This means it will use the KEGG information provided by the package clusterProfiler, which likely is outdated; you better set it to FALSE (from the help page: use_internal_data: logical, use kegg_species.rda or latest online KEGG data).
Also: you are working with so-called tibbles, but many R/Bioconductor packages are not compatible with it. So you'd better also try/use the 'classical' data.frames as input.

ADD REPLY • link 5 months ago Guido Hooiveld ★ 3.9k

0

Entering edit mode

Code to show it works:

> library(clusterProfiler)
> 
> ## use provided 10 IDs, plus as last one add ID that has a GO annotation.
> gene <- c("LOC130612230","LOC130612541","LOC130613351","LOC130613446","LOC130613454",
+           "LOC130613474","LOC130613483","LOC130613503","LOC130613561","LOC130614295",
+           "LOC130614308","LOC130614690","LOC130621222","LOC130621262","LOC130621439",
+           "LOC130621674","LOC130621944","LOC130622477","LOC130623434","LOC130624017",
+           "LOC130624087","LOC130624108","LOC130624137","LOC130624243","LOC130626199",
+           "LOC130628649","LOC130628695","LOC130628907","LOC130629018","LOC130629420",
+           "LOC130612038") #only last gene has GO annotation in data.frame GO below
> 
> GO <- data.frame("gene_id"=c("LOC130612038","LOC130612038","LOC130612038","LOC130612038","LOC130612038",
+                              "LOC130612038","LOC130612038","LOC130612038","LOC130612038","LOC130612038"),
+                  "GO"=c("GO:0001085","GO:0001725","GO:0002026","GO:0003008","GO:0003012","GO:0003013",
+                         "GO:0003015","GO:0003300","GO:0003674","GO:0003779") )
> 
> ## TERM2GENE: 1st columns shouls be GOID, 2nd column gene ID!
> ## in order to check coding is OK: set minGSSize to 1, and pvalueCutoff to 1, so no
> ## filtering at all is occuring.
> x <- enricher(
+   gene =gene,
+   pvalueCutoff = 1,
+   pAdjustMethod = "BH",
+   universe = NULL,
+   minGSSize = 1,
+   maxGSSize = 500,
+   qvalueCutoff = 1,
+   TERM2GENE=GO[,c("GO","gene_id")] )
> 
> as.data.frame(x)
                   ID Description GeneRatio BgRatio pvalue p.adjust qvalue
GO:0001085 GO:0001085  GO:0001085       1/1     1/1      1        1      1
GO:0001725 GO:0001725  GO:0001725       1/1     1/1      1        1      1
GO:0002026 GO:0002026  GO:0002026       1/1     1/1      1        1      1
GO:0003008 GO:0003008  GO:0003008       1/1     1/1      1        1      1
GO:0003012 GO:0003012  GO:0003012       1/1     1/1      1        1      1
GO:0003013 GO:0003013  GO:0003013       1/1     1/1      1        1      1
GO:0003015 GO:0003015  GO:0003015       1/1     1/1      1        1      1
GO:0003300 GO:0003300  GO:0003300       1/1     1/1      1        1      1
GO:0003674 GO:0003674  GO:0003674       1/1     1/1      1        1      1
GO:0003779 GO:0003779  GO:0003779       1/1     1/1      1        1      1
                 geneID Count
GO:0001085 LOC130612038     1
GO:0001725 LOC130612038     1
GO:0002026 LOC130612038     1
GO:0003008 LOC130612038     1
GO:0003012 LOC130612038     1
GO:0003013 LOC130612038     1
GO:0003015 LOC130612038     1
GO:0003300 LOC130612038     1
GO:0003674 LOC130612038     1
GO:0003779 LOC130612038     1
> 
> ## set argument 'use_internal_data' to FALSE to use the up-to-date
> ## online information; the KEGG data provided with the package is outdated!
> 
> search_kegg_organism('hsy', by='kegg_code', use_internal_data = FALSE)
    kegg_code               scientific_name                   common_name
597       hsy Hydractinia symbiolongicarpus Hydractinia symbiolongicarpus
> 
> ## be sure to strip off "LOC" from the provided ids, since the KEGG ids do not contain this string!
> y <- enrichKEGG(
+   gene= sub("LOC", "", gene ),
+   organism = "hsy",
+   keyType = "kegg",
+   pvalueCutoff = 1,
+   pAdjustMethod = "BH",
+   minGSSize = 1,
+   maxGSSize = 500,
+   qvalueCutoff = 1,
+   use_internal_data = FALSE)
> 
> as.data.frame(y)
           category                        subcategory       ID
hsy00511 Metabolism Glycan biosynthesis and metabolism hsy00511
hsy00513 Metabolism Glycan biosynthesis and metabolism hsy00513
                                                                    Description
hsy00511               Other glycan degradation - Hydractinia symbiolongicarpus
hsy00513 Various types of N-glycan biosynthesis - Hydractinia symbiolongicarpus
         GeneRatio BgRatio      pvalue   p.adjust qvalue    geneID Count
hsy00511       1/1 18/3350 0.005373134 0.01074627     NA 130612230     1
hsy00513       1/1 42/3350 0.012537313 0.01253731     NA 130612230     1

ADD REPLY • link 5 months ago Guido Hooiveld ★ 3.9k

0

Entering edit mode

Also do you know if the gaphical functions as dotplot, enrichmap and cnetplot works on this generic analysis? still sorry for bothering you

ADD REPLY • link 5 months ago nromerov • 0

0

Entering edit mode

Did you try it yourselves? But the answer is yes.

ADD REPLY • link 5 months ago Guido Hooiveld ★ 3.9k

0

Entering edit mode

Yeah i tryed last night and it returned an error, i seach and found somo functions are deprecated, tanks a lot for your help

ADD REPLY • link 5 months ago nromerov • 0

0

Entering edit mode

Thank you very much for your help iu was really deseperate about this that i wasn't thinking too well yesterday im really gratefull with you

And sorry for bothering you again but do you know why this error courred?

>gene <- names(geneList)[abs(geneList) > 2]
>geneList = sort(geneList, decreasing = TRUE)
> y <- GSEA(geneList, TERM2GENE = TERM2GENE)
preparing geneSet collections...
GSEA analysis...
Error in serialize(data, node$con) : error writing to connection
In addition: Warning messages:
1: In serialize(data, node$con) :
  'package:stats' may not be available when loading
2: In serialize(data, node$con) :
  'package:stats' may not be available when loading
3: In serialize(data, node$con) :
  'package:stats' may not be available when loading
4: In serialize(data, node$con) :
  'package:stats' may not be available when loading
Error in serialize(data, node$con) : error writing to connection

ADD REPLY • link 5 months ago nromerov • 0

0

Entering edit mode

No, I don't.

Confirm that geneList has the expected format. What is the outcome of geneList[1:5], str(geneList) and class(geneList)?

What is you sessionInfo()? Please make sure to use the latest versions of R, Bioconductor and (if you are using that) R-studio!

ADD REPLY • link 5 months ago Guido Hooiveld ★ 3.9k