GO enrichment analysis on Solanum lycopersicum proteomics dataset (UniProt IDs)
Entering edit mode
Alberto • 0
Last seen 2.4 years ago

Hello! I am analyzing a proteomic dataset of Solanum lycopersicum that is organized as following:

Treated vs Control - in a time series: TP1; TP2; TP3...

On this dataset I need to perform a GeneOntology enrichment analysis in R, and I was planning to do it with clusterProfiler.

I want to perform both ORA, (enrichment analysis on over represented tomato proteins in treated vs control on each time point) and GESA (Gene set enrichment analysis, that in this case would be the whole proteome set for each time point). Now I have two problems that are kind of related to each other.

1) Solanum lycopersicum is not among the org.db model organism supported by clusterprofiler. I tried to import the AnnotationHub Solanum as following but it doesn't seem to work with clusterprofiler.

hub <- AnnotationHub()
query(hub, c("Solanum"))
org.Sl.eg.db <- hub[["AH80808"]]

2) Since I am starting with a proteomics dataset my query IDs are Uniprot-ID and these seem not supported in the org.db of solanum lycopersicum I am trying to make from AnnotationHub

 [1] "ACCNUM"      "ALIAS"       "ENTREZID"    "EVIDENCE"    "EVIDENCEALL" "GENENAME"    "GID"         "GO"          "GOALL"       "ONTOLOGY"    "ONTOLOGYALL" "PMID"  "REFSEQ"      "SYMBOL"      "UNIGENE"

I was trying to play around and/or find the way to use clusterProfiler from non model organism, and see if i can start from this kind of table format

uniprotID   GO  ONTOLOGY

Please any suggestion/tip is very welcome.

clusterProfiler AnnotationHub GO uniprot • 4.8k views
Entering edit mode

Minor comment: the AnnotationHub ID for org.Sl.eg.db is "AH85737" (and not "AH80808")! But this will unfortunately not change your 'problem'...

> query(hub, c("Solanum"))
AnnotationHub with 8 records
# snapshotDate(): 2020-10-27
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, Inparanoid8
# $species: Solanum tuberosum, Solanum lycopersicum, Solanum pennellii, Sola...
# $rdataclass: OrgDb, Inparanoid8Db
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH10593"]]' 

  AH10593 | hom.Solanum_lycopersicum.inp8.sqlite              
  AH10606 | hom.Solanum_tuberosum.inp8.sqlite                 
  AH85602 | org.Solanum_pennellii.eg.sqlite                   
  AH85603 | org.Solanum_pennelli.eg.sqlite                    
  AH85660 | org.Solanum_tuberosum.eg.sqlite                   
  AH85736 | org.Solanum_esculentum.eg.sqlite                  
  AH85737 | org.Solanum_lycopersicum.eg.sqlite                
  AH85738 | org.Solanum_lycopersicum_var._humboldtii.eg.sqlite
Entering edit mode
Guido Hooiveld ★ 3.9k
Last seen 8 hours ago
Wageningen University, Wageningen, the …

Some time ago I had a similar challenge (i.e. having Uniprot ids as main identifier, and wanting to perform GO analysis with clusterProfiler), and in the end I got it working. The key thing is to know about the existence of the Gene Ontology Annotation (GOA) group at the EBI. These are the people that provide the GO annotation to the UniProt database. Very conveniently, they also provide annotation sets for full (inferred) proteomes for download, including tomato!

To find the file for your organism of interest, you should check here (website; https://www.ebi.ac.uk/GOA/proteomes) or here (FTP server; http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ ); file for tomato is called 284179.S_lycopersicum_genome_cv_Heinz_1706.goa (direct link (http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/284179.S_lycopersicum_genome_cv_Heinz_1706.goa)). Note: ftp links don't work anymore in Chome or Firefox!?

The next challenge could be the import of such GOA file, but luckily you can make use of some infrastructure already existing in Bioconductor, for example the package mgsa here.

With respect to clusterProfiler (here), the section on 'Supported organisms' states when there is no suitable OrgDb object is available, functions (enricher() and gseGO()) can also be run when GO annotation is available in data frame format (with 2 columns). Also notice statement regarding the difference between direct and indirect GO annotation, and in case of the former the need to use the function buildGOmap() to annotate ancestor GO nodes. Since the function readGAF() apparently also does this (see ?readGAF), there was no need to run buildGOmap() in the code below. Yet, since I am not 100% certain about this, I have included few (commented) lines of code to show how to do this, for when needed...

Below find some code that I used (adapted for tomato). Realize this is one way of doing it, but certainly not the only one. Also check whether you may need to change (default) values of the arguments for each function.

## Prepare GO annotation

# read GO annotation file
# Note that I manually downloaded the *.goa file, and is in working directory.
GAF <- readGAF(filename="284179.S_lycopersicum_genome_cv_Heinz_1706.goa")

# extract relevant info.
# unfortunately could not find accessor functions for all required
# info, thus sometimes had to utilize object slots directly (with @)
mapping.index <-  GAF@itemName2ItemIndex
ID.annotations <- itemAnnotations(GAF)

GO.sets <- GAF@sets
GO.annotation <- setAnnotations(GAF)

# create a 2-column data frame with GOID and ID index
# after little further processing, this will be used as input for clusterProfiler
GO.df <- data.frame("GOID" = rep(names(GO.sets), sapply(GO.sets, length)),
        "ID.index" = unlist(GO.sets),  row.names = NULL)

# do some processing for objects GO.annotation and GO.df
# in both remove category 'all',
# and to GO.df also add column with Uniprot ids

# GO.annotation
GO.annotation <- GO.annotation[GO.annotation[,"term"] != "all", ]
GO.annotation[,"GOID"] <- rownames(GO.annotation)

# GO.df
GO.df <- GO.df[GO.df[,"GOID"] != "all", ]
GO.df[,"UNIPROTKB"] <- names(mapping.index [GO.df[,"ID.index"] ])

## objects are now ready for use with clusterProfiler!
GO:0000002                   mitochondrial genome maintenance
GO:0000003                                       reproduction
GO:0000009             alpha-1,6-mannosyltransferase activity
GO:0000012                         single strand break repair
GO:0000014 single-stranded DNA endodeoxyribonuclease activity
GO:0000015                  phosphopyruvate hydratase complex
GO:0000002    The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome.
GO:0000003                         The production of new individuals that contain some portion of genetic material inherited from one or more parent organisms.
GO:0000009                                               Catalysis of the transfer of a mannose residue to an oligosaccharide, forming an alpha-(1->6) linkage.
GO:0000012         The repair of single strand breaks in DNA. Repair of such breaks is mediated by the same enzyme systems as are used in base excision repair.
GO:0000014                   Catalysis of the hydrolysis of ester linkages within a single-stranded deoxyribonucleic acid molecule by creating internal breaks.
GO:0000015 A multimeric enzyme complex, usually a dimer or an octamer, that catalyzes the conversion of 2-phospho-D-glycerate to phosphoenolpyruvate and water.
GO:0000002 GO:0000002
GO:0000003 GO:0000003
GO:0000009 GO:0000009
GO:0000012 GO:0000012
GO:0000014 GO:0000014
GO:0000015 GO:0000015

            GOID ID.index  UNIPROTKB
22438 GO:0000002    20346     B1N670
22439 GO:0000003       22 A0A3Q7E6Y9
22440 GO:0000003       81 A0A3Q7E826
22441 GO:0000003      129 A0A3Q7E8L7
22442 GO:0000003      221 A0A3Q7E9I2
22443 GO:0000003      273 A0A3Q7EA62

# if buildGOmap() needs to be applied:
# note that column GOID should be be renamed to GO!
# colnames(GO.df)[1] <- c("GO")
# GO.df2 <- buildGOmap(GO.df[,c("UNIPROTKB", "GO")])

## Functional analysis with clusterProfiler.

# For function analysis some test input is needed. Let's generate that first.
# For ORA analysis, generate the 'background/universe' list (= all
# proteins identified in the tomato proteome), plus some test input
# (='foreground' list), which usually is identified after performing
# statistics on a data set.

# all proteins
proteins.all <- rownames(ID.annotations)
# random selection of 1750 proteins
proteins.random <- base::sample(proteins.all, 1750)

# Perform GO ORA analysis
# Note the use of arguments TERM2GENE and TERM2NAME. Their column
# order is important!
# Also: to show proof-of-principle no cutoff on significance
# is applied! i.e Cutoff=1

res.GO.ora <- enricher(
    pvalueCutoff = 1,
    pAdjustMethod = "none",
    minGSSize = 10,
    maxGSSize = 500,
    qvalueCutoff = 1,
    TERM2GENE = GO.df[ ,c("GOID","UNIPROTKB")],
    TERM2NAME = GO.annotation[ ,c("GOID", "term")]

# check

# generate plots

barplot(res.GO.ora, showCategory=10)
dotplot(res.GO.ora, showCategory=10)
cnetplot(res.GO.ora, showCategory=10)
heatplot(res.GO.ora, showCategory=10)
emapplot( pairwise_termsim(res.GO.ora), node_scale=1.5,layout="kk" )

# Also perform KEGG-based analyses.
# note: only 10841 (~52%) uniprot ids have been annotated by KEGG!
uni2kegg <- bitr_kegg( proteins.all, fromType="uniprot",
    toType='kegg', organism='sly')

kk <- enrichKEGG(
    gene =  proteins.random,
    organism = "sly",
    keyType = "uniprot",
    pvalueCutoff = 1,
    pAdjustMethod = "none",
    universe = proteins.all,
    minGSSize = 10,
    maxGSSize = 500,
    qvalueCutoff = 1

# check

# object kk can be used to generate same plots as above!

# GSEA with KEGG pathways.
# for GSEA a per-gene metric is required that is used for ranking.
# assume dataset consists of 17.5k uniprot ids.
# make up fold changes (= ranking metric) ranging from -15 to 15.

protList <- runif(n = 17500, min = -15, max = 15)
names(protList) <- base::sample( proteins.all, 17500)
# for GSEA, sort on FC
protList <- sort(protList, decreasing = TRUE)

kk2 <- gseKEGG(geneList = protList,
    organism     = "sly",
    keyType      = "uniprot",
    eps          = 0.0,
    minGSSize    = 10,
    maxGSSize    = 500,
    pAdjustMethod = "none",
    pvalueCutoff = 0.9,
    verbose      = FALSE)

# check

# again, object kk2 can be used to generate same plots as above!

# Lastly, compare multiple clusters using KEGG.
# First generate list with multiple clusters
clusterData <- list(
    "Cluster 1" = base::sample(proteins.all, 19500),
    "Cluster 2" = base::sample(proteins.all, 18900),
    "Cluster 3" = base::sample(proteins.all, 22000),
    "Cluster 4" = base::sample(proteins.all, 20500) )

# run compareCluster
xx <- compareCluster(clusterData, fun="enrichKEGG",
                     organism="sly",  keyType = "uniprot",
                     pAdjustMethod = "none", pvalueCutoff=1.0)

# plot
emapplot( pairwise_termsim(xx) )
Entering edit mode

@guido I am having trouble downloading the goa annotation file, I have tried wget ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/4144447 but it not downloading. I just get the html file. Could you help with this

Entering edit mode

Could you also help me with SARS-COV-2 goa ftp link? I tried to get it from https://www.ebi.ac.uk/GOA/index but still I can locate the exact file

Entering edit mode

That should be: http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/4378222.W_seafood_market_pneumonia_virus_1.goa.

Found through the GOA proteome website (https://www.ebi.ac.uk/GOA/proteomes [note: takes few second to load!]), and then searched for the taxonomy id of SARS-CoV-2, which is 2697049.

Entering edit mode

I managed to download the goa file now. I have the following question, I have differentially expressed proteins following DE analysis. I cannot see where to pass this information because ID.annotations information is from the goa database.

Entering edit mode

Have a look at the code after the text ## Functional analysis with clusterProfiler. You basically will need to define two inputs; which I called proteins.all and proteins.random.

proteins.all are all proteins that have a GO annotation (=background), and proteins.random are the proteins you selected/find interesting (e.g. after performing a statistical test) (=foreground). FYI: I selected these 'foreground' proteins randomly, hence its naming.

Entering edit mode

Hello. Is it possible to somehow adapt this for ChIPpeakAnno?

Entering edit mode

how can I use this code to filter my results on the ontology (MF, CC, BP)?

Entering edit mode

To only use a sub-category of the Gene Ontology, I would simply filter the input used for the function enricher, specifically the input used for the argument TERM2GENE.

I would obtain the GO sub-category annotation information (i.e. Biological Process (BP), Cellular Compartment (CC), Molecular Function (MF)) from the Bioconductor Gene Ontology annotation library GO.db (here).

Assuming you generated the 2 objects GO.annotation and GO.df (as well as as proteins.all and proteins.random) per my code in the 2nd post of this thread (GO enrichment analysis on Solanum lycopersicum proteomics dataset (UniProt IDs)), you could, for example, achieve this by running this step-by-step code:

> ## load the Gene Ontology database
> library(GO.db)
> ## extract all GO category IDs
> all.go.ids <- keys(GO.db)
> length(all.go.ids)
[1] 42888
> head(all.go.ids)
[1] "GO:0000001" "GO:0000002" "GO:0000003" "GO:0000006" "GO:0000007"
[6] "GO:0000009"
> ## extract all information for all GO category IDs
> all.go.info <- AnnotationDbi::select(GO.db, keys=all.go.ids,
+          columns=c("DEFINITION","GOID","ONTOLOGY","TERM"), keytype="GOID")
'select()' returned 1:1 mapping between keys and columns
> ## keep only those categories belonging to Biological Process sub-category
> bp.go.ids <- all.go.info[all.go.info[,"ONTOLOGY"] == "BP", "GOID"]
> ## filter (tomato) GO.df, obtained in 2nd post of this tread, to only include the
> ## IDs of the Biological Process sub-category
> ## post = https://support.bioconductor.org/p/p134523/#p134554
> head(GO.df)
            GOID ID.index  UNIPROTKB
22973 GO:0000002     1430 A0A3Q7ENS0
22974 GO:0000002     7638 A0A3Q7G9F6
22975 GO:0000002    13917 A0A3Q7HWC2
22976 GO:0000002    18173 A0A3Q7J098
22977 GO:0000002    18572 A0A3Q7J3I2
22978 GO:0000002    19062 A0A3Q7J8C8
> GO.df.bp <- GO.df[GO.df[,"GOID"] %in% bp.go.ids, ]
> ## perform GO ORA using only the GO-BP IDs
> ## note that only input for argument TERM2GENE has changed,
> ## TERM2NAME remained the same
> res.GO.bp.ora <- enricher(
+     gene=proteins.random,
+     pvalueCutoff = 1,
+     pAdjustMethod = "none",
+     universe=proteins.all,
+     minGSSize = 10,
+     maxGSSize = 500,
+     qvalueCutoff = 1,
+     TERM2GENE = GO.df.bp[ ,c("GOID","UNIPROTKB")],
+     TERM2NAME = GO.annotation[ ,c("GOID", "term")]
+ )
> res.GO.bp.ora
# over-representation test
#...@organism    UNKNOWN 
#...@ontology    UNKNOWN 
#...@gene        chr [1:1750] "A0A3Q7H4H5" "A0A3Q7EA18" "A0A3Q7GAP0" "A0A3Q7EEW6" ...
#...pvalues adjusted by 'none' with cutoff <1 
#...1346 enriched terms found
'data.frame':   1346 obs. of  9 variables:
 $ ID         : chr  "GO:0120251" "GO:0046148" "GO:0042440" "GO:0043650" ...
 $ Description: chr  "hydrocarbon biosynthetic process" "pigment biosynthetic process" "pigment metabolic process" "dicarboxylic acid biosynthetic process" ...
 $ GeneRatio  : chr  "6/1233" "13/1233" "14/1233" "7/1233" ...
 $ BgRatio    : chr  "20/15948" "76/15948" "87/15948" "30/15948" ...
 $ pvalue     : num  0.00318 0.00519 0.00665 0.00681 0.00681 ...
 $ p.adjust   : num  0.00318 0.00519 0.00665 0.00681 0.00681 ...
 $ qvalue     : num  0.991 0.991 0.991 0.991 0.991 ...
 $ geneID     : chr  "P18485/P10967/Q43503/Q202I0/B2MWN0/A0A3Q7EQP6" "K4DEY3/Q43503/Q08307/A0A3Q7F8H5/A0A3Q7GXR4/A0A3Q7F3D7/A0A3Q7G876/Q202I0/A0A3Q7FY87/A0A3Q7GAE5/A0A3Q7FUB1/A0A3Q7IJM9/A0A3Q7HN24" "K4DEY3/Q94FW7/Q43503/Q08307/A0A3Q7F8H5/A0A3Q7GXR4/A0A3Q7F3D7/A0A3Q7G876/Q202I0/A0A3Q7FY87/A0A3Q7GAE5/A0A3Q7FUB1"| __truncated__ "A0A3Q7HNN9/Q42885/O65917/A0A3Q7FLA3/A0A3Q7HEK7/Q8RU74/Q42884" ...
 $ Count      : int  6 13 14 7 7 37 27 27 30 8 ...
 T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
 clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
 The Innovation. 2021, 2(3):100141 

> as.data.frame(res.GO.bp.ora)[1:3,]
                   ID                      Description GeneRatio  BgRatio
GO:0120251 GO:0120251 hydrocarbon biosynthetic process    6/1233 20/15948
GO:0046148 GO:0046148     pigment biosynthetic process   13/1233 76/15948
GO:0042440 GO:0042440        pigment metabolic process   14/1233 87/15948
                pvalue    p.adjust    qvalue
GO:0120251 0.003178799 0.003178799 0.9909389
GO:0046148 0.005192627 0.005192627 0.9909389
GO:0042440 0.006651856 0.006651856 0.9909389
GO:0120251                                                                                         P18485/P10967/Q43503/Q202I0/B2MWN0/A0A3Q7EQP6
GO:0046148        K4DEY3/Q43503/Q08307/A0A3Q7F8H5/A0A3Q7GXR4/A0A3Q7F3D7/A0A3Q7G876/Q202I0/A0A3Q7FY87/A0A3Q7GAE5/A0A3Q7FUB1/A0A3Q7IJM9/A0A3Q7HN24
GO:0042440 K4DEY3/Q94FW7/Q43503/Q08307/A0A3Q7F8H5/A0A3Q7GXR4/A0A3Q7F3D7/A0A3Q7G876/Q202I0/A0A3Q7FY87/A0A3Q7GAE5/A0A3Q7FUB1/A0A3Q7IJM9/A0A3Q7HN24
GO:0120251     6
GO:0046148    13
GO:0042440    14

Login before adding your answer.

Traffic: 308 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6