Question: goseq: Formatting category list for use with goseq
0
6 months ago by
Mthabisi Moyo10 wrote:

I am using goseq for category testing of differentially expressed genes in my RNA-seq data sets and I wanted to extend it to categories other than the GO categories. Specifically, I wanted to make use of the molecular signatures database (MSigDB). To be clear, I am not trying to do a ranking approach like GSEA. I would simply like to take advantage of the extensive list of categories in MSigDB and perform category testing on my differentially expressed genes. I obtained MSigDB lists in R format from: http://bioinf.wehi.edu.au/software/MSigDB/index.html.
This provides lists of categories which in turn contain vectors of Entrez IDs in each category.

> load("~/MSigDB/human_H_v5p2.rdata") #loaded filename is Hs.H. > head(Hs.H) $HALLMARK_TNFA_SIGNALING_VIA_NFKB [1] "3726" "2920" "467" "4792" "7128" "5743" "2919" "8870" "9308" "6364" [11] "2921" "23764" "4791" "7127" "1839" "1316" "330" "5329" "7538" "3383"  I processed the MSigDB lists into ensembl ID-mapped format instead of entrez ID (RNA-seq data has Ensembl IDs for feature set) as described in the goseq manual in section 6.6 (KEGG pathway analysis): library(goseq) library(org.Hs.eg.db) Ensembl2Entrez <- as.list(org.Hs.egENSEMBL2EG) mapEnsembl2Cat <- function(id, mapkeys){unique(unlist(mapkeys[id], use.names = FALSE))} Hs.H <- lapply(Ensembl2Entrez, mapEnsembl2Cat, Hs.H) Unfortunately this produces pairlists of length 0 (Type: NULL) for each ensembl ID: > head(Hs.H) $ENSG00000121410 NULL

$ENSG00000175899 NULL $ENSG00000256069 NULL

$ENSG00000171428 NULL $ENSG00000156006 NULL

$ENSG00000196136 NULL It works just fine when I use the eg2kegg file in the manual example to produce the KEGG file with ensembl IDs: > Ensembl2KEGG <- as.list(org.Hs.egPATH) > KEGG <- lapply(Ensembl2Entrez, mapEnsembl2Cat, Ensembl2KEGG) > head(KEGG) $ENSG00000121410 [1] NA

$ENSG00000175899 [1] "04610" $ENSG00000256069 [1] NA

$ENSG00000171428 [1] "00232" "00983" "01100" $ENSG00000156006 [1] "00232" "00983" "01100"

\$ENSG00000196136 [1] NA

And I am able to pass this KEGG file to the goseq() function and perform the KEGG analysis just fine. It is unclear to me what is wrong with the formatting of the MSigDB file as its structure looks the same as the Ensembl2KEGG file. It could also be the mapping function and something quite simple I am missing there but since the Hs.H file structure is similar to the Ensembl2KEGG file structure I didn't see any reason why it would need to be modified.
Any suggestions are appreciated. An alternative source of R-formatted MSigDB files that have a similar format to the Ensembl2KEGG file obtained from org.Hs.egPATH would be equally helpful.

Thank you.

goseq category test • 277 views
modified 6 months ago by James W. MacDonald49k • written 6 months ago by Mthabisi Moyo10
Answer: goseq: Formatting category list for use with goseq
1
6 months ago by
United States
James W. MacDonald49k wrote:

I am not sure why that section says that you have to convert to Ensembl IDs, because it's not AFAIK necessary. In fact I would argue (and have done so in the past) that converting from NCBI to EBI/EMBL IDs and vice versa is not so trivial, and should be avoided unless you really need to do so. Anyway, here's an example

> library(goseq)

> data(genes)

## note that the names for this vector are Ensembl IDs. For this example I am going to convert to Gene IDs

> data(genes)
> names(genes) <- mapIds(org.Hs.eg.db, names(genes), "ENTREZID","ENSEMBL")
'select()' returned 1:many mapping between keys and columns
> sum(is.na(names(genes)))
[1] 8208

## so there are over 8000 genes where EBI/EMBL says they are genes, and NCBI says they are not!

> genes <- genes[!is.na(names(genes))]
> sum(duplicated(names(genes)))
[1] 19

## And 19 genes where EBI/EMBL says 'these genes are different things' and NCBI says 'nope, same thing'.

> genes <- genes[!duplicated(names(genes))]

## so, like I said, not trivial - I have done the most naive thing possible here, and you could credibly argue what I did was wrong

> pwf <- nullp(genes, "hg19", "knownGene")

> pvals <- goseq(pwf, "hg19", "knownGene", Hs.H)
Using manually entered categories.
For 11028 genes, we could not find any categories. These genes will be excluded.
To force their use, please run with use_genes_without_cat=TRUE (see documentation).
This was the default behavior for version 1.15.1 and earlier.
Calculating the p-values...

## This warning is due to the fact that there are 11028 genes that NCBI says are a thing, and EBI/EMBL says 'nope'

category over_represented_pvalue
3        HALLMARK_ANDROGEN_RESPONSE            2.186874e-21
13             HALLMARK_E2F_TARGETS            2.132188e-06
45 HALLMARK_TNFA_SIGNALING_VIA_NFKB            1.501282e-03
18          HALLMARK_G2M_CHECKPOINT            2.023453e-03
31        HALLMARK_MTORC1_SIGNALING            3.547345e-03
15 HALLMARK_ESTROGEN_RESPONSE_EARLY            4.621137e-03
under_represented_pvalue numDEInCat numInCat
3                 1.0000000         72       99
13                0.9999993         82      199
45                0.9991730         56      146
18                0.9987806         72      197
31                0.9978255         68      192
15                0.9971784         64      176

Thank you, this was very helpful. As your example demonstrates, I did not have to convert the ontology/ category lists to ensembl IDs. It was sufficient to just convert the differentially expressed gene set to entrez IDs.

As far as the inconsistency between entrez IDs and ensembl IDs, I had noticed the same thing. In fact, when I attempt to map all genes with an ensembl ID (~ 60,000 genes) to entrez IDs, over 50% do not have an entrez ID, which seems a very large level of disagreement between annotations/ feature sets considering we are talking about the same genome. This suggests that entrez is leaving out whole categories of annotations, e.g. pseudogenes perhaps.

It's not as simple as all that. If we use an EnsDb as the authoritative source of Ensembl IDs and an OrgDb as the authoritative source of Gene IDs (here I got the EnsDb from AnnotationHub. I exclude that code for brevity):

> z
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.1
|Creation time: Sun Apr  8 01:11:55 2018
|ensembl_version: 92
|ensembl_host: localhost
|Organism: homo_sapiens
|taxonomy_id: 9606
|genome_build: GRCh38
|DBSCHEMAVERSION: 2.1
| No. of genes: 65256.
| No. of transcripts: 225358.
|Protein data available.

> ensgns <- keys(z, "GENEID")
> length(ensgns)
[1] 65256
[1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457"
[5] "ENSG00000000460" "ENSG00000000938"

> egFromEns <- select(z, ensgns, "ENTREZID", "GENEID")
> dim(egFromEns)
[1] 65568     2
> sum(!is.na(egFromEns[,2]))
[1] 22884

## so there are over 65k Ensembl IDs, of which ~23k map to Gene IDs

> library(org.Hs.eg.db)
> egs <- keys(org.Hs.eg.db)
> length(egs)
[1] 60140
> ensFromEg <- select(org.Hs.eg.db, egs, "ENSEMBL")
'select()' returned 1:many mapping between keys and columns
> dim(ensFromEg)
[1] 62988     2
> sum(!is.na(ensFromEg[,2]))
[1] 28964
## And similarly, of the 60k Gene IDs, only about 26k map to Ensembl IDs (accounting for duplicate mappings).

There are any number of reasons why EBI/EMBL might say a particular Ensembl ID has no match in NCBI's Gene IDs, and vice versa. But this is orthogonal to what you are trying to accomplish! You have a bunch of genes that appear to be differentially expressed, and there is no benefit to arbitrarily removing a large portion of those genes because of some technicalities that probably have more to do with the location of the gene in the genome than the underlying biological function.