I would like to do the enrichment analysis using clusterProfiler on Candida auris proteomics data. I have tried to look for genome wide annotation for Candida auris but I have not managed to get it. Could you point me to the right direction where I can obtain the Bioconductor genome wide annotation for the Candida auris?
You have tagged AnnotationHub in your question, so I presume you know about that. Is there no C. auris data available? Also, what do you mean by 'genome wide annotation'? Are you looking for genetic position information or gene type annotations?
After importing the *.goa file I was able to perform a GO overrepresentation analysis (using the code I linked to and some random UniProt IDs).
GAF <- readGAF(filename="4144447.[_auris_B8441.goa")
<<snip>>
dotplot(res.GO.ora, showCategory=10)
clusterProfiler also allows you to easily analyze your data using pathway information from the KEGG database. Although your organism is 'known' by KEGG (here;organism id = caur), somehow C. Auris UniProt ids have not been translated to KEGG identifiers (nor it seems C. Auris-specific pathways are available). Compare for example this (empty) conversion output for C. Auris with the same output for S cerevisiae. In other words, KEGG patwhay analysis for C.Auris is not possible.
There isn't an OrgDb for C. auris on the AnnotationHub, so you will need to generate one yourself, using makeOrgPackageFromNCBI in the AnnotationForge package. You can emulate the example in ?makeOrgPackageFromNCBI, substituting Candida and auris for genus and species, and 498019 for the tax_id. You could also use your actual email and name, but it doesn't really matter so long as the maintainer field has a name and email, and the email is bracketed by < and >. If you don't do that the package won't install.
Make sure you have the current version of AnnotationForge (AnnotationForge_1.35.2). I patched a bug recently that would cause it to error out on a species like this. And be prepared to wait for a while - that function downloads and parses a huge amount of data.
Once you have the package built you can then do install.packages("org.Cauris.eg.db", repos = NULL) and if you are on Windows, add a type = "source" to the install.packages arguments.
@James this is the clusterProfiler Rscript I use for enrichment analysis. The data being analyzed is from Candidas auris proteomic data. The data can be downloaded from differential analysis data. I will appreciate if you can help in solving this problem
setwd("C:\\Users\\Javan\\Desktop\\NelsonSoares\\candidaProject\\DifferentialsPx")
library(clusterProfiler)
#library(org.Hs.eg.db)
library(enrichplot)
library(dplyr)
library(pathview)
library(proteus)
library(org.Sc.sgd.db)
#library(org.Mm.eg.db)
keytypes(org.Sc.sgd.db) #Show the database keytypes
data <- read.csv("SA01-SB01.csv",header = T,sep = ',')
colnames(data)
data <- dplyr::select(data, X,EffectSize,pValue) ; dim(data)
data = subset(data,EffectSize >= 1.5 | pValue < 0.05 ) ;dim(data)#| EffectSize <= -1);dim(data)
gene <- data$X# extract Gene names
# this translates the protein IDs to ENTREZID
gene.df <- bitr(gene, fromType = "UNIPROT", toType = "ENTREZID",OrgDb = org.Sc.sgd.db) ; dim(gene.df) # This is the stage which is failing.
# Make a geneList for some future functions
geneList <- gene.df$ENTREZID
names(geneList) <- as.character(gene.df$UNIPROT)
geneList <- sort(geneList, decreasing = TRUE)
# gene enrichment analysis cnplots are commented out as they look crazy with a large number of proteins
## BP
ego_BP2 <- enrichGO(gene = gene.df$ENTREZID,
OrgDb = org.Sc.sgd.db
ont = "BP",
pAdjustMethod = "BH",
readable = TRUE,
pvalueCutoff = 0.01,
qvalueCutoff = 0.05)
head(ego_BP2,10) #check the first 10 entries from ego_BP
df = as.data.frame(ego_BP2)
write.csv(df,"LTBI_B1vsPPD_GO_BP.csv")
ego2 <- simplify(ego_BP2) ; dim(ego2) # remove redundant GO terms first
dotplot(ego2, showCategory=24,x="count",font.size = 9,title=" ")
#Barplot
barplot(ego2,
drop = TRUE,
showCategory = 20,
title = " ",
font.size = 9,
x="count")
If you have C. auris data, then you need to create an OrgDb, as I have already noted in my previous answer. The org.Sc.eg.db package contains data for S. cerevisiae. While both are yeasts, I would imagine that the UniProt IDs for S cerevisiae are not the same as for C. auris. You probably need to follow my existing advice to build an OrgDb package for the actual species you are working with.
@James I did create the org.Cauris.eg.db but the Uniprot key name is missing when I do keytypes(org.Cauris.eg.db) as below. Is there a way I can add this uniprot information in the package. FYI this is my first Bioconductor package and I am happy.
Ah, I get it. No, the UniProt data aren't added to a package generated using makeOrgPackageFromNCBI. That would require an additional download step, and it's already painful enough as it is. Looking at uniprot.org, it doesn't seem like many (if any) of the genes have an NCBI Gene ID, or a GID for that matter.
What do you get from
length(keys(org.Cauris.eg.db))
## and
head(keys(org.Cauris.eg.db))
If you have UniProt KB IDs, you can do a test to see what UniProt has for them, by doing something like
## use a subset of your genes
genesub <- gene[1:500] ## or something smaller
URL <- paste0("https://www.uniprot.org/mapping/?from=ACC&to=P_ENTREZGENEID&format=tab&query=", paste(genesub, collapse = "%20"))
read.table(URL, sep = "\t", fill = TRUE, header = TRUE)
If you just get something like
[1] From To
<0 rows> (or 0-length row.names)
That means UniProt doesn't have the mappings, which makes it tough to do.
You have tagged
AnnotationHub
in your question, so I presume you know about that. Is there no C. auris data available? Also, what do you mean by 'genome wide annotation'? Are you looking for genetic position information or gene type annotations?@James I am using clusterProfiler: https://learn.gencore.bio.nyu.edu/rna-seq-analysis/gene-set-enrichment-analysis/ for the enrichment analysis and you need to provide organisms genome annotation from the bioconductor.
Since you would like to use
clusterProfiler
with proteomics data, I would like to refer you to one of my previous posts: GO enrichment analysis on Solanum lycopersicum proteomics dataset (UniProt IDs) Key is that you can make use of the UniProt-based Gene Ontology annotation information that is compiled by the GOA group. For your organism such GO annotations (in GAF format) are luckily also available! It is the file4144447.[_auris_B8441.goa
, available here (direct link: http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/4144447.[_auris_B8441.goa). Please note that there apparently is a spelling error in the file name... the[
should obviously be aC
!After importing the
*.goa
file I was able to perform a GO overrepresentation analysis (using the code I linked to and some random UniProt IDs).clusterProfiler
also allows you to easily analyze your data using pathway information from the KEGG database. Although your organism is 'known' by KEGG (here;organism id =caur
), somehow C. Auris UniProt ids have not been translated to KEGG identifiers (nor it seems C. Auris-specific pathways are available). Compare for example this (empty) conversion output for C. Auris with the same output for S cerevisiae. In other words, KEGG patwhay analysis for C.Auris is not possible.