Some time ago I had a similar challenge (i.e. having Uniprot ids as main identifier, and wanting to perform GO analysis with clusterProfiler
), and in the end I got it working.
The key thing is to know about the existence of the Gene Ontology Annotation (GOA) group at the EBI. These are the people that provide the GO annotation to the UniProt database. Very conveniently, they also provide annotation sets for full (inferred) proteomes for download, including tomato!
To find the file for your organism of interest, you should check here (website; https://www.ebi.ac.uk/GOA/proteomes) or here (FTP server; http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ ); file for tomato is called 284179.S_lycopersicum_genome_cv_Heinz_1706.goa
(direct link (http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/284179.S_lycopersicum_genome_cv_Heinz_1706.goa)). Note: ftp links don't work anymore in Chome or Firefox!?
The next challenge could be the import of such GOA file, but luckily you can make use of some infrastructure already existing in Bioconductor, for example the package mgsa
here.
With respect to clusterProfiler
(here), the section on 'Supported organisms' states when there is no suitable OrgDb
object is available, functions (enricher()
and gseGO()
) can also be run when GO annotation is available in data frame format (with 2 columns).
Also notice statement regarding the difference between direct and indirect GO annotation, and in case of the former the need to use the function buildGOmap()
to annotate ancestor GO nodes. Since the function readGAF()
apparently also does this (see ?readGAF
), there was no need to run buildGOmap()
in the code below. Yet, since I am not 100% certain about this, I have included few (commented) lines of code to show how to do this, for when needed...
Below find some code that I used (adapted for tomato). Realize this is one way of doing it, but certainly not the only one. Also check whether you may need to change (default) values of the arguments for each function.
library(mgsa)
GAF <- readGAF(filename="284179.S_lycopersicum_genome_cv_Heinz_1706.goa")
mapping.index <- GAF@itemName2ItemIndex
ID.annotations <- itemAnnotations(GAF)
GO.sets <- GAF@sets
GO.annotation <- setAnnotations(GAF)
GO.df <- data.frame("GOID" = rep(names(GO.sets), sapply(GO.sets, length)),
"ID.index" = unlist(GO.sets), row.names = NULL)
GO.annotation <- GO.annotation[GO.annotation[,"term"] != "all", ]
GO.annotation[,"GOID"] <- rownames(GO.annotation)
GO.df <- GO.df[GO.df[,"GOID"] != "all", ]
GO.df[,"UNIPROTKB"] <- names(mapping.index [GO.df[,"ID.index"] ])
head(GO.annotation)
term
GO:0000002 mitochondrial genome maintenance
GO:0000003 reproduction
GO:0000009 alpha-1,6-mannosyltransferase activity
GO:0000012 single strand break repair
GO:0000014 single-stranded DNA endodeoxyribonuclease activity
GO:0000015 phosphopyruvate hydratase complex
definition
GO:0000002 The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome.
GO:0000003 The production of new individuals that contain some portion of genetic material inherited from one or more parent organisms.
GO:0000009 Catalysis of the transfer of a mannose residue to an oligosaccharide, forming an alpha-(1->6) linkage.
GO:0000012 The repair of single strand breaks in DNA. Repair of such breaks is mediated by the same enzyme systems as are used in base excision repair.
GO:0000014 Catalysis of the hydrolysis of ester linkages within a single-stranded deoxyribonucleic acid molecule by creating internal breaks.
GO:0000015 A multimeric enzyme complex, usually a dimer or an octamer, that catalyzes the conversion of 2-phospho-D-glycerate to phosphoenolpyruvate and water.
GOID
GO:0000002 GO:0000002
GO:0000003 GO:0000003
GO:0000009 GO:0000009
GO:0000012 GO:0000012
GO:0000014 GO:0000014
GO:0000015 GO:0000015
head(GO.df)
GOID ID.index UNIPROTKB
22438 GO:0000002 20346 B1N670
22439 GO:0000003 22 A0A3Q7E6Y9
22440 GO:0000003 81 A0A3Q7E826
22441 GO:0000003 129 A0A3Q7E8L7
22442 GO:0000003 221 A0A3Q7E9I2
22443 GO:0000003 273 A0A3Q7EA62
library(clusterProfiler)
proteins.all <- rownames(ID.annotations)
proteins.random <- base::sample(proteins.all, 1750)
res.GO.ora <- enricher(
gene=proteins.random,
pvalueCutoff = 1,
pAdjustMethod = "none",
universe=proteins.all,
minGSSize = 10,
maxGSSize = 500,
qvalueCutoff = 1,
TERM2GENE = GO.df[ ,c("GOID","UNIPROTKB")],
TERM2NAME = GO.annotation[ ,c("GOID", "term")]
)
as.data.frame(res.GO.ora)[1:15,]
library(enrichplot)
barplot(res.GO.ora, showCategory=10)
dotplot(res.GO.ora, showCategory=10)
cnetplot(res.GO.ora, showCategory=10)
heatplot(res.GO.ora, showCategory=10)
emapplot( pairwise_termsim(res.GO.ora), node_scale=1.5,layout="kk" )
uni2kegg <- bitr_kegg( proteins.all, fromType="uniprot",
toType='kegg', organism='sly')
kk <- enrichKEGG(
gene = proteins.random,
organism = "sly",
keyType = "uniprot",
pvalueCutoff = 1,
pAdjustMethod = "none",
universe = proteins.all,
minGSSize = 10,
maxGSSize = 500,
qvalueCutoff = 1
)
as.data.frame(kk)[1:15,]
protList <- runif(n = 17500, min = -15, max = 15)
names(protList) <- base::sample( proteins.all, 17500)
protList <- sort(protList, decreasing = TRUE)
kk2 <- gseKEGG(geneList = protList,
organism = "sly",
keyType = "uniprot",
eps = 0.0,
minGSSize = 10,
maxGSSize = 500,
pAdjustMethod = "none",
pvalueCutoff = 0.9,
verbose = FALSE)
as.data.frame(kk2)[1:15,]
clusterData <- list(
"Cluster 1" = base::sample(proteins.all, 19500),
"Cluster 2" = base::sample(proteins.all, 18900),
"Cluster 3" = base::sample(proteins.all, 22000),
"Cluster 4" = base::sample(proteins.all, 20500) )
xx <- compareCluster(clusterData, fun="enrichKEGG",
organism="sly", keyType = "uniprot",
pAdjustMethod = "none", pvalueCutoff=1.0)
emapplot( pairwise_termsim(xx) )
Minor comment: the
AnnotationHub
ID fororg.Sl.eg.db
is"AH85737"
(and not"AH80808"
)! But this will unfortunately not change your 'problem'...