Question

clusterProfiler for KEGG enrichment (non-model species) Over-Representation Analysis

0

Entering edit mode

ruiqi.li ▴ 10

@5d6dd1ca

Last seen 13 months ago

United States

Hi there! I would like to perform KEGG enrichment with some differentially expressed gene data from RNAseq data. I am working on a non-model organism.

I have 1) KEGG to GeneName Mapping

head(expr5_FS_final)
    KEGG unigene_FS
1 K02727  FS_gene_1
2 K17277  FS_gene_3
3 K17307 FS_gene_10
4 K14453 FS_gene_11
5 K14700 FS_gene_11
6 K14701 FS_gene_11

2) A list of genes of interest (DEGs)

head(DEG)
  DEG
1 FS_gene_1
2 FS_gene_38
3 FS_gene_101

I used the the list of DEG to grep KEGG to GeneName Mapping file, then got a KEGG to GeneName Mapping file just for DEGs.

head(DEG_KEGG)
  KEGG DEG
1 K02727  FS_gene_1
2 K17277  FS_gene_1
3 K17307 FS_gene_38
...

Then I use enr_results <- enrichKEGG(DEG_KEGG$KEGG, organism='ko', minGSSize = 1, pvalueCutoff = 0.05, qvalueCutoff = 0.05) to do KEGG enrichment.

I also tried enr_results2 <- enricher(DEG$DEG , TERM2GENE=expr5_FS_final, pvalueCutoff = 0.05, pAdjustMethod = "BH", qvalueCutoff = 0.05, minGSSize = 5)

Which one is correct? Could some one help me have a look?

Thanks a lot!

KEGG clusterProfiler • 1.9k views

ADD COMMENT • link updated 3 months ago by Guido Hooiveld ★ 3.9k • written 13 months ago by ruiqi.li ▴ 10

1

Entering edit mode

In essence both functions should should give identical results, provided the input is also the same.

enricher is a universal function for gene set overrepresentation analysis (ORA), and requires all arguments to be provided by the user, including TERM2GENE. TERM2GENE links genes to gene sets, and these sets can be defined by Gene Ontology, KEGG, or whatever source.

enrichKEGG can be considered as a convenience function in the sense that it has been 'configured' to perform an ORA using gene sets defined by KEGG. In other words, there is no need for a user to provide input for the TERM2GENE (and TERM2NAME) arguments, since these will be automagically generated.

So, if the inputs are identical, the the results should be the same. Please note that the values that you used for minGSSize, pAdjustMethod and qvalueCutoff differ between your 2 function calls.

ADD REPLY • link 13 months ago Guido Hooiveld ★ 3.9k

0

Entering edit mode

Thank you so much! I am confused because that I think gene set KEGG overrepresentation analysis (ORA) should take 2 arguments: 1) The KEGG universe, in my case, the KEGG annotation of the whole transcriptome (KEGG to gene mappings) 2) A set of genes of interests, in my case, up-regulated gene list from DE comparison

However, enrichKEGG seems only takes the KEGG to gene mappings of the DEGs. How does it perform ORA without the "The KEGG universe (file 1)". Does it compare to the whole KEGG database?

ADD REPLY • link 13 months ago ruiqi.li ▴ 10

0

Entering edit mode

By default, the argument universe is indeed set to NULL.

If you check the source code of the functions, it becomes clear that in case enrichKEGG is run with universe = NULL, as universe (background) all genes are used that have been annotated to a gene set by the KEGG curators. For example, for human KEGG has annotated 6451 genes to gene sets (pathways). Also, genes that are in your input, but are not present in a gene set, are filtered out.

If you provide your own gene sets using the function enricher (through the argument TERM2GENE), with universe=NULL, then all genes that are present in the data frame TERM2GENE are used as universe (background genes).

Lastly, if you do provide a list of background genes (i.e. universe = names(geneList)), then the gene sets are filtered so that only genes that are present in the list of background genes are kept within each gene set. ORA will then be performed on the filtered gene sets.

See below for some code to illustrate these points.

> library(DOSE)
> library(clusterProfiler)
> 
> ## download all human KEGG gene sets
> kegg.data <- download_KEGG(species="hsa", keggType = "KEGG", keyType = "kegg")
Reading KEGG annotation online: "https://rest.kegg.jp/link/hsa/pathway"...
Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway"...
Warning messages:
1: In utils::download.file(url, quiet = TRUE, method = method, ...) :
  the 'wininet' method is deprecated for http:// and https:// URLs
> kegg.gene.sets <- kegg.data$KEGGPATHID2EXTID; colnames(kegg.gene.sets) <- c("PathwayID","GeneID")
> kegg.set.names <- kegg.data$KEGGPATHID2NAME; colnames(kegg.set.names) <- c("PathwayID","Description")
> 
> ## check
> head(kegg.gene.sets)
     PathwayID GeneID
5110  hsa01521  10000
5111  hsa01521  10018
5112  hsa01521   1950
5113  hsa01521   1956
5114  hsa01521   1977
5115  hsa01521   1978
> 
> ## How many *genes* are annotated in KEGG
> length(unique(kegg.gene.sets$GeneID)) ##6451
[1] 6451
> 
> ## How many gene *sets* are annotated in KEGG
> length(unique(kegg.gene.sets$PathwayID)) ##229
[1] 229
> 
> ## perform ORA using gseKEGG ##
> 
> ## Load some example data; this is a list of 12495 genes,
> ## of which some are annotated to a gene set, and others not.
> data(geneList, package="DOSE")
> length(geneList)
[1] 12495
> 
> ## Select the 1st 150 genes as being an example set of upregulated genes
> upreg.genes <- names(geneList)[1:150]
> 
> ## Check which gene sets are overrepresented in this list of 150 genes
> 
> ## First use convenience function gseKEGG(). Note that all genes annotated by KEGG (= 6451)
> ## are automatically used as background
> res1 <- enrichKEGG(gene=upreg.genes, organism = "hsa", keyType = "ncbi-geneid",
+     pvalueCutoff = 0.05, pAdjustMethod = "BH", universe=NULL, minGSSize = 10,
+     maxGSSize = 500, qvalueCutoff = 0.2)
Reading KEGG annotation online: "https://rest.kegg.jp/conv/ncbi-geneid/hsa"...
Warning message:
In utils::download.file(url, quiet = TRUE, method = method, ...) :
  the 'wininet' method is deprecated for http:// and https:// URLs
> 
> as.data.frame(res1)[1:3, ]
               ID                                                   Description
hsa04110 hsa04110                                                    Cell cycle
hsa04657 hsa04657                                       IL-17 signaling pathway
hsa04061 hsa04061 Viral protein interaction with cytokine and cytokine receptor
         GeneRatio  BgRatio       pvalue     p.adjust       qvalue
hsa04110     16/64 127/6451 4.302610e-14 5.464315e-12 5.163132e-12
hsa04657      9/64  94/6451 2.863277e-07 1.818181e-05 1.717966e-05
hsa04061      8/64 100/6451 5.441530e-06 2.303581e-04 2.176612e-04
                                                                           geneID
hsa04110 8318/991/9133/890/983/4085/7272/1111/891/4174/9232/4171/993/990/5347/701
hsa04657                             4312/6280/6279/6278/3627/2921/6364/8061/4318
hsa04061                                 3627/10563/6373/4283/6362/6355/2921/6364
         Count
hsa04110    16
hsa04657     9
hsa04061     8
> 
> ## Next provide your own mappings: kegg.gene.sets.
> ## in this case these are identical to the KEGG content, so the results(stats, number of genes, etc) are also identical!
> 
> res2 <- enricher(gene=upreg.genes, pvalueCutoff = 0.05, pAdjustMethod = "BH",
+   universe = NULL, minGSSize = 10, maxGSSize = 500, qvalueCutoff = 0.2,
+   TERM2GENE = kegg.gene.sets[, c("PathwayID","GeneID") ], #proper order columns
+   TERM2NAME = kegg.set.names[, c("PathwayID","Description") ], 
+   )
> 
> as.data.frame(res2)[1:3, ]
               ID                                                   Description
hsa04110 hsa04110                                                    Cell cycle
hsa04657 hsa04657                                       IL-17 signaling pathway
hsa04061 hsa04061 Viral protein interaction with cytokine and cytokine receptor
         GeneRatio  BgRatio       pvalue     p.adjust       qvalue
hsa04110     16/64 127/6451 4.302610e-14 5.464315e-12 5.163132e-12
hsa04657      9/64  94/6451 2.863277e-07 1.818181e-05 1.717966e-05
hsa04061      8/64 100/6451 5.441530e-06 2.303581e-04 2.176612e-04
                                                                           geneID
hsa04110 8318/991/9133/890/983/4085/7272/1111/891/4174/9232/4171/993/990/5347/701
hsa04657                             4312/6280/6279/6278/3627/2921/6364/8061/4318
hsa04061                                 3627/10563/6373/4283/6362/6355/2921/6364
         Count
hsa04110    16
hsa04657     9
hsa04061     8
> 
> ## Another example of providing your own mappings, but now only
> ## 65 of the 229 KEGG gene sets are used.
> ## Note that the number of background genes also changed, because for the ORA analysis
> ## only the genes are included that are represented in the gene sets.
> ## In other words, if a gene is not annotated to a gene set, it is discarded.
> ## This is reflected by 54 genes being annotated to e.g. gene set hsa04110 (res3, below), and not 64 anymore (res2, above) 
> my.gene.sets <- unique(kegg.gene.sets$PathwayID)[1:65]
> my.gene.sets <- kegg.gene.sets[kegg.gene.sets$PathwayID %in% my.gene.sets,]
> 
> ## How many *genes* are annotated in this subset of 65 KEGG gene sets?
> length(unique(my.gene.sets$GeneID)) ##3695
[1] 3695
> 
> res3 <- enricher(gene=upreg.genes, pvalueCutoff = 0.05, pAdjustMethod = "BH",
+   universe = NULL, minGSSize = 10, maxGSSize = 500, qvalueCutoff = 0.2,
+   TERM2GENE = my.gene.sets[, c("PathwayID","GeneID") ], #proper order columns
+   TERM2NAME = kegg.set.names[, c("PathwayID","Description") ], 
+   )
> 
> as.data.frame(res3)[1:3,]
               ID                                                   Description
hsa04110 hsa04110                                                    Cell cycle
hsa04061 hsa04061 Viral protein interaction with cytokine and cytokine receptor
hsa04218 hsa04218                                           Cellular senescence
         GeneRatio  BgRatio       pvalue     p.adjust       qvalue
hsa04110     16/54 127/3695 1.029010e-11 4.116038e-10 3.682771e-10
hsa04061      8/54 100/3695 8.079150e-05 1.615830e-03 1.445743e-03
hsa04218      9/54 156/3695 3.514153e-04 4.685537e-03 4.192322e-03
                                                                           geneID
hsa04110 8318/991/9133/890/983/4085/7272/1111/891/4174/9232/4171/993/990/5347/701
hsa04061                                 3627/10563/6373/4283/6362/6355/2921/6364
hsa04218                                2305/4605/9133/890/983/51806/1111/891/993
         Count
hsa04110    16
hsa04061     8
hsa04218     9
> 
> 
> ## Lastly; if you provide you own list of background genes, these are used.
> ## Genes in gene sets that are not in the list of background genes are filtered out.
> ## Note that the background is reduced to 3068 genes (res4) (from 3695; res3)
> 
> res4 <- enricher(gene=upreg.genes, pvalueCutoff = 0.05, pAdjustMethod = "BH",
+   universe = names(geneList), minGSSize = 10, maxGSSize = 500, qvalueCutoff = 0.2,
+   TERM2GENE = my.gene.sets[, c("PathwayID","GeneID") ], #proper order columns
+   TERM2NAME = kegg.set.names[, c("PathwayID","Description") ], 
+   )
> 
>  > as.data.frame(res4)[1:3,]


            ID                                                   Description
hsa04110 hsa04110                                                    Cell cycle
hsa04061 hsa04061 Viral protein interaction with cytokine and cytokine receptor
hsa04218 hsa04218                                           Cellular senescence
         GeneRatio  BgRatio       pvalue     p.adjust       qvalue
hsa04110     16/54 116/3068 3.919944e-11 1.567978e-09 1.402927e-09
hsa04061      8/54  86/3068 9.976321e-05 1.995264e-03 1.785236e-03
hsa04218      9/54 141/3068 6.420510e-04 8.560680e-03 7.659556e-03
                                                                           geneID
hsa04110 8318/991/9133/890/983/4085/7272/1111/891/4174/9232/4171/993/990/5347/701
hsa04061                                 3627/10563/6373/4283/6362/6355/2921/6364
hsa04218                                2305/4605/9133/890/983/51806/1111/891/993
         Count
hsa04110    16
hsa04061     8
hsa04218     9
>

ADD REPLY • link 13 months ago Guido Hooiveld ★ 3.9k

0

Entering edit mode

This is so helpful! Thank you so much!

Last question, how do you determine those 3 parameters minGSSize = 10, maxGSSize = 500, qvalueCutoff = 0.2?

I know qvalue is kind of similar to adjusted p value, would 0.2 be too high? What does minGSSize and maxGSSize mean? and are there any "gold standards" for it?

ADD REPLY • link 13 months ago ruiqi.li ▴ 10

0

Entering edit mode

I haven't been able to make a new post on the forum for some reason, but I had a question relating to the custome annotations in enricher. In TERM2GENE, I pass the genes targeted by a TF in column 2, and column 1 is the TF, which is the same for all rows. Input is a list of genes. I wanted to check if the overlap between these two lists is significant.

However, I get a NULL result - No gene sets have size between 0 and 50 .. > return NULL - for all TF target genes and input genes combination.

Could anyone point me in the right direction? Thanks.

ADD REPLY • link 3 months ago Mansi • 0

1

Entering edit mode

It is not appreciated to post the same question multiple times!

Also: what is not clear on the message No gene sets have size between 0 and 50?

ADD REPLY • link 3 months ago Guido Hooiveld ★ 3.9k