GSEA only returning 1 GO term - unable to map Ensembl IDs with decimals?
1
1
Entering edit mode
@0010df01
Last seen 2.3 years ago
United States

I'm trying to use clusterProfiler to perform gene set enrichment analysis. My differentially-expressed gene dataset has Ensembl IDs in addition to log2foldChange values.

First I ran the following chunk of code, and returned a "no term enriched under specific pvalueCutoff" error:

> gse_hfd <- gseGO(geneList = hfd_gene_list, ont = "ALL", keyType = "ENSEMBL", minGSSize = 3, verbose = TRUE, OrgDb = org.Mm.eg.db, pAdjustMethod = "none")
preparing geneSet collections...
GSEA analysis...
no term enriched under specific pvalueCutoff...
Warning message:
In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are ties in the preranked stats (16.93% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.

I re-ran the code with the p-value cutoff set to 1 and only returned 3 GO terms, each associated to one of the sub-ontologies:

> gse_hfd <- gseGO(geneList = hfd_gene_list, ont = "ALL", keyType = "ENSEMBL", minGSSize = 1, pvalueCutoff = 1.00, verbose = TRUE, OrgDb = org.Mm.eg.db, pAdjustMethod = "none")
preparing geneSet collections...
GSEA analysis...
leading edge analysis...
done...
Warning message:
In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are ties in the preranked stats (16.93% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
> print(gse_hfd)
#
# Gene Set Enrichment Analysis
#
#...@organism    Mus musculus 
#...@setType     GOALL 
#...@keytype     ENSEMBL 
#...@geneList    Named num [1:34284] 10.05 5.79 5.06 4.84 4.63 ...
 - attr(*, "names")= chr [1:34284] "ENSMUSG00000110631.1" "ENSMUSG00000117907.1" "ENSMUSG00000021803.10" "ENSMUSG00000118251.1" ...
#...nPerm    
#...pvalues adjusted by 'none' with cutoff <1 
#...3 enriched terms found
'data.frame':   3 obs. of  12 variables:
 $ ONTOLOGY       : chr  "MF" "CC" "BP"
 $ ID             : chr  "GO:0003674" "GO:0005575" "GO:0008150"
 $ Description    : chr  "molecular_function" "cellular_component" "biological_process"
 $ setSize        : int  1 1 1
 $ enrichmentScore: num  -0.719 -0.719 -0.719
 $ NES            : num  -0.962 -0.962 -0.962
 $ pvalue         : num  0.551 0.551 0.551
 $ p.adjust       : num  0.551 0.551 0.551
 $ qvalues        : num  0.551 0.551 0.551
 $ rank           : num  9632 9632 9632
 $ leading_edge   : chr  "tags=100%, list=28%, signal=72%" "tags=100%, list=28%, signal=72%" "tags=100%, list=28%, signal=72%"
 $ core_enrichment: chr  " " " " " "

When I tried to plot my results, I got a term similarity error, so I did the following:

> emapplot(gse_hfd)
Error in has_pairsim(x) : 
  Term similarity matrix not available. Please use pairwise_termsim function to deal with the results of enrichment analysis.
> pairwise_termsim(gse_hfd, method = "JC", semData = NULL, showCategory = 300)
#
# Gene Set Enrichment Analysis
#
#...@organism    Mus musculus 
#...@setType     GOALL 
#...@keytype     ENSEMBL 
#...@geneList    Named num [1:34284] 10.05 5.79 5.06 4.84 4.63 ...
 - attr(*, "names")= chr [1:34284] "ENSMUSG00000110631.1" "ENSMUSG00000117907.1" "ENSMUSG00000021803.10" "ENSMUSG00000118251.1" ...
#...nPerm    
#...pvalues adjusted by 'none' with cutoff <1 
#...3 enriched terms found
'data.frame':   3 obs. of  12 variables:
 $ ONTOLOGY       : chr  "MF" "CC" "BP"
 $ ID             : chr  "GO:0003674" "GO:0005575" "GO:0008150"
 $ Description    : chr  "molecular_function" "cellular_component" "biological_process"
 $ setSize        : int  1 1 1
 $ enrichmentScore: num  -0.719 -0.719 -0.719
 $ NES            : num  -0.962 -0.962 -0.962
 $ pvalue         : num  0.551 0.551 0.551
 $ p.adjust       : num  0.551 0.551 0.551
 $ qvalues        : num  0.551 0.551 0.551
 $ rank           : num  9632 9632 9632
 $ leading_edge   : chr  "tags=100%, list=28%, signal=72%" "tags=100%, list=28%, signal=72%" "tags=100%, list=28%, signal=72%"
 $ core_enrichment: chr  " " " " " "

I tried again with only the Biological Processes sub-ontology instead of using ALL as I had been doing earlier, and gain only returned the GO term associated with the sub-ontology:

> gse_hfd_bp <- gseGO(geneList = hfd_gene_list, ont = "BP", keyType = "ENSEMBL", minGSSize = 1, pvalueCutoff = 1.00, verbose = TRUE, OrgDb = org.Mm.eg.db, pAdjustMethod = "none")
preparing geneSet collections...
GSEA analysis...
leading edge analysis...
done...
Warning message:
In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are ties in the preranked stats (16.93% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
> print(gse_hfd_bp)
#
# Gene Set Enrichment Analysis
#
#...@organism    Mus musculus 
#...@setType     BP 
#...@keytype     ENSEMBL 
#...@geneList    Named num [1:34284] 10.05 5.79 5.06 4.84 4.63 ...
 - attr(*, "names")= chr [1:34284] "ENSMUSG00000110631.1" "ENSMUSG00000117907.1" "ENSMUSG00000021803.10" "ENSMUSG00000118251.1" ...
#...nPerm    
#...pvalues adjusted by 'none' with cutoff <1 
#...1 enriched terms found
'data.frame':   1 obs. of  11 variables:
 $ ID             : chr "GO:0008150"
 $ Description    : chr "biological_process"
 $ setSize        : int 1
 $ enrichmentScore: num -0.719
 $ NES            : num -0.957
 $ pvalue         : num 0.538
 $ p.adjust       : num 0.538
 $ qvalues        : logi NA
 $ rank           : num 9632
 $ leading_edge   : chr "tags=100%, list=28%, signal=72%"
 $ core_enrichment: chr " "

Now, when I began manually inputting my data into WebGestalt's online API, none of my genes were mapped until I started deleting the decimals from my Ensembl IDs (ex. ENSMUSG00000110631.1 -> ENSMUSG00000110631).

Only then did I start returning more "real" GO terms from my data. I'd really prefer to leave them untouched, but is my Ensembl ID formatting probably to blame for my inability to return more results?

AnnotationDbi GO clusterProfiler enrichplot ensembldb • 3.2k views
ADD COMMENT
1
Entering edit mode

You can get rid of gene number version :

library(stringr)
names(hfd_gene_list) <- str_replace( names(hfd_gene_list),
                        pattern = ".[0-9]+$",
                        replacement = "")

Then just perform your GSEA the same way.

ADD REPLY
0
Entering edit mode

A strict "." token match would be safer, i.e., "\\.[0-9]+$".

ADD REPLY
0
Entering edit mode
merv ▴ 140
@mmfansler-13248
Last seen 3 months ago
MSKCC | New York, NY

Yes, you need to strip the version numbers. Similar to previous comment, I do this with

library(stringr)
library(magrittr)

## strip Ensembl version numbers (if any)
names(hfd_gene_list) %<>% str_remove("\\.[0-9]+$")

Note the regex here strictly matches the "." token, whereas the commented one is unspecific. This makes the command idempotent and therefore safe to run even if there aren't any version numbers.

ADD COMMENT

Login before adding your answer.

Traffic: 717 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6