I am using goseq for category testing of differentially expressed genes in my RNA-seq data sets and I wanted to extend it to categories other than the GO categories. Specifically, I wanted to make use of the molecular signatures database (MSigDB). To be clear, I am not trying to do a ranking approach like GSEA. I would simply like to take advantage of the extensive list of categories in MSigDB and perform category testing on my differentially expressed genes. I obtained MSigDB lists in R format from: http://bioinf.wehi.edu.au/software/MSigDB/index.html.
This provides lists of categories which in turn contain vectors of Entrez IDs in each category.
> load("~/MSigDB/human_H_v5p2.rdata") #loaded filename is Hs.H.
> head(Hs.H)
$HALLMARK_TNFA_SIGNALING_VIA_NFKB
[1] "3726" "2920" "467" "4792" "7128" "5743" "2919" "8870" "9308" "6364"
[11] "2921" "23764" "4791" "7127" "1839" "1316" "330" "5329" "7538" "3383"
I processed the MSigDB lists into ensembl ID-mapped format instead of entrez ID (RNA-seq data has Ensembl IDs for feature set) as described in the goseq manual in section 6.6 (KEGG pathway analysis):
library(goseq)
library(org.Hs.eg.db)
Ensembl2Entrez <- as.list(org.Hs.egENSEMBL2EG)
mapEnsembl2Cat <- function(id, mapkeys){unique(unlist(mapkeys[id], use.names = FALSE))}
Hs.H <- lapply(Ensembl2Entrez, mapEnsembl2Cat, Hs.H)
Unfortunately this produces pairlists of length 0 (Type: NULL) for each ensembl ID:
> head(Hs.H)
$ENSG00000121410
NULL
$ENSG00000175899
NULL
$ENSG00000256069
NULL
$ENSG00000171428
NULL
$ENSG00000156006
NULL
$ENSG00000196136
NULL
It works just fine when I use the eg2kegg file in the manual example to produce the KEGG file with ensembl IDs:
> Ensembl2KEGG <- as.list(org.Hs.egPATH)
> KEGG <- lapply(Ensembl2Entrez, mapEnsembl2Cat, Ensembl2KEGG)
> head(KEGG)
$ENSG00000121410
[1] NA
$ENSG00000175899
[1] "04610"
$ENSG00000256069
[1] NA
$ENSG00000171428
[1] "00232" "00983" "01100"
$ENSG00000156006
[1] "00232" "00983" "01100"
$ENSG00000196136
[1] NA
And I am able to pass this KEGG file to the goseq() function and perform the KEGG analysis just fine. It is unclear to me what is wrong with the formatting of the MSigDB file as its structure looks the same as the Ensembl2KEGG file. It could also be the mapping function and something quite simple I am missing there but since the Hs.H file structure is similar to the Ensembl2KEGG file structure I didn't see any reason why it would need to be modified.
Any suggestions are appreciated. An alternative source of R-formatted MSigDB files that have a similar format to the Ensembl2KEGG file obtained from org.Hs.egPATH would be equally helpful.
Thank you.
Thank you, this was very helpful. As your example demonstrates, I did not have to convert the ontology/ category lists to ensembl IDs. It was sufficient to just convert the differentially expressed gene set to entrez IDs.
As far as the inconsistency between entrez IDs and ensembl IDs, I had noticed the same thing. In fact, when I attempt to map all genes with an ensembl ID (~ 60,000 genes) to entrez IDs, over 50% do not have an entrez ID, which seems a very large level of disagreement between annotations/ feature sets considering we are talking about the same genome. This suggests that entrez is leaving out whole categories of annotations, e.g. pseudogenes perhaps.
It's not as simple as all that. If we use an EnsDb as the authoritative source of Ensembl IDs and an OrgDb as the authoritative source of Gene IDs (here I got the EnsDb from AnnotationHub. I exclude that code for brevity):
There are any number of reasons why EBI/EMBL might say a particular Ensembl ID has no match in NCBI's Gene IDs, and vice versa. But this is orthogonal to what you are trying to accomplish! You have a bunch of genes that appear to be differentially expressed, and there is no benefit to arbitrarily removing a large portion of those genes because of some technicalities that probably have more to do with the location of the gene in the genome than the underlying biological function.