Hi there! I would like to perform KEGG enrichment with some differentially expressed gene data from RNAseq data. I am working on a non-model organism.
I have 1) KEGG to GeneName Mapping
head(expr5_FS_final) KEGG unigene_FS 1 K02727 FS_gene_1 2 K17277 FS_gene_3 3 K17307 FS_gene_10 4 K14453 FS_gene_11 5 K14700 FS_gene_11 6 K14701 FS_gene_11
2) A list of genes of interest (DEGs)
head(DEG) DEG 1 FS_gene_1 2 FS_gene_38 3 FS_gene_101
I used the the list of DEG to grep KEGG to GeneName Mapping file, then got a KEGG to GeneName Mapping file just for DEGs.
head(DEG_KEGG) KEGG DEG 1 K02727 FS_gene_1 2 K17277 FS_gene_1 3 K17307 FS_gene_38 ...
Then I use
enr_results <- enrichKEGG(DEG_KEGG$KEGG, organism='ko', minGSSize = 1, pvalueCutoff = 0.05, qvalueCutoff = 0.05) to do KEGG enrichment.
I also tried
enr_results2 <- enricher(DEG$DEG , TERM2GENE=expr5_FS_final, pvalueCutoff = 0.05, pAdjustMethod = "BH", qvalueCutoff = 0.05, minGSSize = 5)
Which one is correct? Could some one help me have a look?
Thanks a lot!
In essence both functions should should give identical results, provided the input is also the same.
enricheris a universal function for gene set overrepresentation analysis (ORA), and requires all arguments to be provided by the user, including
TERM2GENElinks genes to gene sets, and these sets can be defined by Gene Ontology, KEGG, or whatever source.
enrichKEGGcan be considered as a convenience function in the sense that it has been 'configured' to perform an ORA using gene sets defined by KEGG. In other words, there is no need for a user to provide input for the
TERM2NAME) arguments, since these will be automagically generated.
So, if the inputs are identical, the the results should be the same. Please note that the values that you used for
qvalueCutoffdiffer between your 2 function calls.
Thank you so much! I am confused because that I think gene set KEGG overrepresentation analysis (ORA) should take 2 arguments: 1) The KEGG universe, in my case, the KEGG annotation of the whole transcriptome (KEGG to gene mappings) 2) A set of genes of interests, in my case, up-regulated gene list from DE comparison
enrichKEGGseems only takes the KEGG to gene mappings of the DEGs. How does it perform ORA without the "The KEGG universe (file 1)". Does it compare to the whole KEGG database?
By default, the argument
universeis indeed set to
If you check the source code of the functions, it becomes clear that in case
enrichKEGGis run with
universe = NULL, as universe (background) all genes are used that have been annotated to a gene set by the KEGG curators. For example, for human KEGG has annotated 6451 genes to gene sets (pathways). Also, genes that are in your input, but are not present in a gene set, are filtered out.
If you provide your own gene sets using the function
enricher(through the argument
universe=NULL, then all genes that are present in the data frame
TERM2GENEare used as universe (background genes).
Lastly, if you do provide a list of background genes (i.e.
universe = names(geneList)), then the gene sets are filtered so that only genes that are present in the list of background genes are kept within each gene set. ORA will then be performed on the filtered gene sets.
See below for some code to illustrate these points.
This is so helpful! Thank you so much!
Last question, how do you determine those 3 parameters
minGSSize = 10, maxGSSize = 500, qvalueCutoff = 0.2?
I know qvalue is kind of similar to adjusted p value, would 0.2 be too high? What does
maxGSSizemean? and are there any "gold standards" for it?