Question

GSEA only returning 1 GO term - unable to map Ensembl IDs with decimals?

1

Entering edit mode

simone.nevills ▴ 10

@0010df01

Last seen 20 months ago

United States

I'm trying to use clusterProfiler to perform gene set enrichment analysis. My differentially-expressed gene dataset has Ensembl IDs in addition to log2foldChange values.

First I ran the following chunk of code, and returned a "no term enriched under specific pvalueCutoff" error:

> gse_hfd <- gseGO(geneList = hfd_gene_list, ont = "ALL", keyType = "ENSEMBL", minGSSize = 3, verbose = TRUE, OrgDb = org.Mm.eg.db, pAdjustMethod = "none")
preparing geneSet collections...
GSEA analysis...
no term enriched under specific pvalueCutoff...
Warning message:
In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are ties in the preranked stats (16.93% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.

I re-ran the code with the p-value cutoff set to 1 and only returned 3 GO terms, each associated to one of the sub-ontologies:

> gse_hfd <- gseGO(geneList = hfd_gene_list, ont = "ALL", keyType = "ENSEMBL", minGSSize = 1, pvalueCutoff = 1.00, verbose = TRUE, OrgDb = org.Mm.eg.db, pAdjustMethod = "none")
preparing geneSet collections...
GSEA analysis...
leading edge analysis...
done...
Warning message:
In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are ties in the preranked stats (16.93% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
> print(gse_hfd)
#
# Gene Set Enrichment Analysis
#
#...@organism    Mus musculus 
#...@setType     GOALL 
#...@keytype     ENSEMBL 
#...@geneList    Named num [1:34284] 10.05 5.79 5.06 4.84 4.63 ...
 - attr(*, "names")= chr [1:34284] "ENSMUSG00000110631.1" "ENSMUSG00000117907.1" "ENSMUSG00000021803.10" "ENSMUSG00000118251.1" ...
#...nPerm    
#...pvalues adjusted by 'none' with cutoff <1 
#...3 enriched terms found
'data.frame':   3 obs. of  12 variables:
 $ ONTOLOGY       : chr  "MF" "CC" "BP"
 $ ID             : chr  "GO:0003674" "GO:0005575" "GO:0008150"
 $ Description    : chr  "molecular_function" "cellular_component" "biological_process"
 $ setSize        : int  1 1 1
 $ enrichmentScore: num  -0.719 -0.719 -0.719
 $ NES            : num  -0.962 -0.962 -0.962
 $ pvalue         : num  0.551 0.551 0.551
 $ p.adjust       : num  0.551 0.551 0.551
 $ qvalues        : num  0.551 0.551 0.551
 $ rank           : num  9632 9632 9632
 $ leading_edge   : chr  "tags=100%, list=28%, signal=72%" "tags=100%, list=28%, signal=72%" "tags=100%, list=28%, signal=72%"
 $ core_enrichment: chr  " " " " " "

When I tried to plot my results, I got a term similarity error, so I did the following:

> emapplot(gse_hfd)
Error in has_pairsim(x) : 
  Term similarity matrix not available. Please use pairwise_termsim function to deal with the results of enrichment analysis.
> pairwise_termsim(gse_hfd, method = "JC", semData = NULL, showCategory = 300)
#
# Gene Set Enrichment Analysis
#
#...@organism    Mus musculus 
#...@setType     GOALL 
#...@keytype     ENSEMBL 
#...@geneList    Named num [1:34284] 10.05 5.79 5.06 4.84 4.63 ...
 - attr(*, "names")= chr [1:34284] "ENSMUSG00000110631.1" "ENSMUSG00000117907.1" "ENSMUSG00000021803.10" "ENSMUSG00000118251.1" ...
#...nPerm    
#...pvalues adjusted by 'none' with cutoff <1 
#...3 enriched terms found
'data.frame':   3 obs. of  12 variables:
 $ ONTOLOGY       : chr  "MF" "CC" "BP"
 $ ID             : chr  "GO:0003674" "GO:0005575" "GO:0008150"
 $ Description    : chr  "molecular_function" "cellular_component" "biological_process"
 $ setSize        : int  1 1 1
 $ enrichmentScore: num  -0.719 -0.719 -0.719
 $ NES            : num  -0.962 -0.962 -0.962
 $ pvalue         : num  0.551 0.551 0.551
 $ p.adjust       : num  0.551 0.551 0.551
 $ qvalues        : num  0.551 0.551 0.551
 $ rank           : num  9632 9632 9632
 $ leading_edge   : chr  "tags=100%, list=28%, signal=72%" "tags=100%, list=28%, signal=72%" "tags=100%, list=28%, signal=72%"
 $ core_enrichment: chr  " " " " " "

I tried again with only the Biological Processes sub-ontology instead of using ALL as I had been doing earlier, and gain only returned the GO term associated with the sub-ontology:

> gse_hfd_bp <- gseGO(geneList = hfd_gene_list, ont = "BP", keyType = "ENSEMBL", minGSSize = 1, pvalueCutoff = 1.00, verbose = TRUE, OrgDb = org.Mm.eg.db, pAdjustMethod = "none")
preparing geneSet collections...
GSEA analysis...
leading edge analysis...
done...
Warning message:
In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are ties in the preranked stats (16.93% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
> print(gse_hfd_bp)
#
# Gene Set Enrichment Analysis
#
#...@organism    Mus musculus 
#...@setType     BP 
#...@keytype     ENSEMBL 
#...@geneList    Named num [1:34284] 10.05 5.79 5.06 4.84 4.63 ...
 - attr(*, "names")= chr [1:34284] "ENSMUSG00000110631.1" "ENSMUSG00000117907.1" "ENSMUSG00000021803.10" "ENSMUSG00000118251.1" ...
#...nPerm    
#...pvalues adjusted by 'none' with cutoff <1 
#...1 enriched terms found
'data.frame':   1 obs. of  11 variables:
 $ ID             : chr "GO:0008150"
 $ Description    : chr "biological_process"
 $ setSize        : int 1
 $ enrichmentScore: num -0.719
 $ NES            : num -0.957
 $ pvalue         : num 0.538
 $ p.adjust       : num 0.538
 $ qvalues        : logi NA
 $ rank           : num 9632
 $ leading_edge   : chr "tags=100%, list=28%, signal=72%"
 $ core_enrichment: chr " "

Now, when I began manually inputting my data into WebGestalt's online API, none of my genes were mapped until I started deleting the decimals from my Ensembl IDs (ex. ENSMUSG00000110631.1 -> ENSMUSG00000110631).

Only then did I start returning more "real" GO terms from my data. I'd really prefer to leave them untouched, but is my Ensembl ID formatting probably to blame for my inability to return more results?

AnnotationDbi GO clusterProfiler enrichplot ensembldb • 2.3k views

ADD COMMENT • link updated 19 months ago by merv ▴ 120 • written 20 months ago by simone.nevills ▴ 10

1

Entering edit mode

You can get rid of gene number version :

library(stringr)
names(hfd_gene_list) <- str_replace( names(hfd_gene_list),
                        pattern = ".[0-9]+$",
                        replacement = "")

Then just perform your GSEA the same way.

ADD REPLY • link 20 months ago Basti ▴ 780

0

Entering edit mode

A strict "." token match would be safer, i.e., "\\.[0-9]+$".

ADD REPLY • link 19 months ago merv ▴ 120

score 0 · Answer 1 · 2022-09-27

Yes, you need to strip the version numbers. Similar to previous comment, I do this with

library(stringr)
library(magrittr)

## strip Ensembl version numbers (if any)
names(hfd_gene_list) %<>% str_remove("\\.[0-9]+$")

Note the regex here strictly matches the "." token, whereas the commented one is unspecific. This makes the command idempotent and therefore safe to run even if there aren't any version numbers.