Question

goseq: Formatting category list for use with goseq

0

Entering edit mode

Mthabisi Moyo ▴ 40

@mthabisi-moyo-9721

Last seen 11 days ago

United States

I am using goseq for category testing of differentially expressed genes in my RNA-seq data sets and I wanted to extend it to categories other than the GO categories. Specifically, I wanted to make use of the molecular signatures database (MSigDB). To be clear, I am not trying to do a ranking approach like GSEA. I would simply like to take advantage of the extensive list of categories in MSigDB and perform category testing on my differentially expressed genes. I obtained MSigDB lists in R format from: http://bioinf.wehi.edu.au/software/MSigDB/index.html.
This provides lists of categories which in turn contain vectors of Entrez IDs in each category.

> load("~/MSigDB/human_H_v5p2.rdata") #loaded filename is Hs.H. > head(Hs.H) $HALLMARK_TNFA_SIGNALING_VIA_NFKB [1] "3726" "2920" "467" "4792" "7128" "5743" "2919" "8870" "9308" "6364" [11] "2921" "23764" "4791" "7127" "1839" "1316" "330" "5329" "7538" "3383"

I processed the MSigDB lists into ensembl ID-mapped format instead of entrez ID (RNA-seq data has Ensembl IDs for feature set) as described in the goseq manual in section 6.6 (KEGG pathway analysis):

library(goseq) library(org.Hs.eg.db) Ensembl2Entrez <- as.list(org.Hs.egENSEMBL2EG) mapEnsembl2Cat <- function(id, mapkeys){unique(unlist(mapkeys[id], use.names = FALSE))} Hs.H <- lapply(Ensembl2Entrez, mapEnsembl2Cat, Hs.H)

Unfortunately this produces pairlists of length 0 (Type: NULL) for each ensembl ID:

> head(Hs.H)

$ENSG00000121410 NULL

$ENSG00000175899 NULL

$ENSG00000256069 NULL

$ENSG00000171428 NULL

$ENSG00000156006 NULL

$ENSG00000196136 NULL

It works just fine when I use the eg2kegg file in the manual example to produce the KEGG file with ensembl IDs:

> Ensembl2KEGG <- as.list(org.Hs.egPATH) > KEGG <- lapply(Ensembl2Entrez, mapEnsembl2Cat, Ensembl2KEGG) > head(KEGG)

$ENSG00000121410 [1] NA

$ENSG00000175899 [1] "04610"

$ENSG00000256069 [1] NA

$ENSG00000171428 [1] "00232" "00983" "01100"

$ENSG00000156006 [1] "00232" "00983" "01100"

$ENSG00000196136 [1] NA

And I am able to pass this KEGG file to the goseq() function and perform the KEGG analysis just fine. It is unclear to me what is wrong with the formatting of the MSigDB file as its structure looks the same as the Ensembl2KEGG file. It could also be the mapping function and something quite simple I am missing there but since the Hs.H file structure is similar to the Ensembl2KEGG file structure I didn't see any reason why it would need to be modified.
Any suggestions are appreciated. An alternative source of R-formatted MSigDB files that have a similar format to the Ensembl2KEGG file obtained from org.Hs.egPATH would be equally helpful.

Thank you.

goseq category test • 2.1k views

ADD COMMENT • link updated 6.2 years ago by James W. MacDonald 67k • written 6.2 years ago by Mthabisi Moyo ▴ 40

score 1 · Accepted Answer · 2018-09-25

I am not sure why that section says that you have to convert to Ensembl IDs, because it's not AFAIK necessary. In fact I would argue (and have done so in the past) that converting from NCBI to EBI/EMBL IDs and vice versa is not so trivial, and should be avoided unless you really need to do so. Anyway, here's an example

> library(goseq)

> data(genes)

## note that the names for this vector are Ensembl IDs. For this example I am going to convert to Gene IDs

> data(genes)
> names(genes) <- mapIds(org.Hs.eg.db, names(genes), "ENTREZID","ENSEMBL")
'select()' returned 1:many mapping between keys and columns
> sum(is.na(names(genes)))
[1] 8208

## so there are over 8000 genes where EBI/EMBL says they are genes, and NCBI says they are not!

> genes <- genes[!is.na(names(genes))]
> sum(duplicated(names(genes)))
[1] 19

## And 19 genes where EBI/EMBL says 'these genes are different things' and NCBI says 'nope, same thing'.

> genes <- genes[!duplicated(names(genes))]

## so, like I said, not trivial - I have done the most naive thing possible here, and you could credibly argue what I did was wrong

> pwf <- nullp(genes, "hg19", "knownGene")

> load(url("http://bioinf.wehi.edu.au/software/MSigDB/human_H_v5p2.rdata"))

> pvals <- goseq(pwf, "hg19", "knownGene", Hs.H)
Using manually entered categories.
For 11028 genes, we could not find any categories. These genes will be excluded.
To force their use, please run with use_genes_without_cat=TRUE (see documentation).
This was the default behavior for version 1.15.1 and earlier.
Calculating the p-values...

## This warning is due to the fact that there are 11028 genes that NCBI says are a thing, and EBI/EMBL says 'nope'

> head(pvals)
                           category over_represented_pvalue
3        HALLMARK_ANDROGEN_RESPONSE            2.186874e-21
13             HALLMARK_E2F_TARGETS            2.132188e-06
45 HALLMARK_TNFA_SIGNALING_VIA_NFKB            1.501282e-03
18          HALLMARK_G2M_CHECKPOINT            2.023453e-03
31        HALLMARK_MTORC1_SIGNALING            3.547345e-03
15 HALLMARK_ESTROGEN_RESPONSE_EARLY            4.621137e-03
   under_represented_pvalue numDEInCat numInCat
3                 1.0000000         72       99
13                0.9999993         82      199
45                0.9991730         56      146
18                0.9987806         72      197
31                0.9978255         68      192
15                0.9971784         64      176