#

Question

Warning message when performing Gene Ontology analysis with goseq using hg38

0

Entering edit mode

Nikolay Ivanov • 0

@nikolay-ivanov-23079

Last seen 3.5 years ago

USA/New York City/Weill Cornell Medicine

I'm getting a warning message when I try to run Gene Ontology analysis with goseq using human genome assembly GRCh38 (hg38). I know that hg38 isn't natively supported by goseq, but goseq will fetch goseq gene lengths from TxDB or UCSC, provided the appropriate package is installed.

Thus, I run the following code:

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(goseq)

# load toy dataset
aa = c(1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1)
names(aa) = c('ENSG00000000003', 'ENSG00000000419', 'ENSG00000000457', 'ENSG00000000460', 'ENSG00000000938', 'ENSG00000000971', 'ENSG00000001036', 'ENSG00000001084', 'ENSG00000001167', 'ENSG00000001460', 'ENSG00000001461', 'ENSG00000001497', 'ENSG00000001561', 'ENSG00000001617', 'ENSG00000001626', 'ENSG00000001629', 'ENSG00000001630', 'ENSG00000001631', 'ENSG00000002016', 'ENSG00000002079')

# run the probability weighing function
pwf = nullp(aa, "hg38", "ensGene")

The code runs successfully, but I get the following warning message:

Warning messages:
1: In grep(txdbPattern, installedPackages) :
  argument 'pattern' has length > 1 and only the first element will be used

I looked through the source code, but can't seem to find the reason for this warning message. Does anyone know the reason for the warning? Can it affect my results?

Note that I don't get the warning message when I run the code with hg19:

library(goseq)

# load toy dataset
aa = c(1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1)
names(aa) = c('ENSG00000000003', 'ENSG00000000419', 'ENSG00000000457', 'ENSG00000000460', 'ENSG00000000938', 'ENSG00000000971', 'ENSG00000001036', 'ENSG00000001084', 'ENSG00000001167', 'ENSG00000001460', 'ENSG00000001461', 'ENSG00000001497', 'ENSG00000001561', 'ENSG00000001617', 'ENSG00000001626', 'ENSG00000001629', 'ENSG00000001630', 'ENSG00000001631', 'ENSG00000002016', 'ENSG00000002079')

# run the probability weighing function
pwf = nullp(aa, "hg19", "ensGene")

My session information is as follows:

sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.3 (Santiago)

Matrix products: default
BLAS/LAPACK: /pbtech_mounts/homes027/nai2008/anaconda3/envs/R_4.0/lib/libopenblasp-r0.3.10.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] org.Hs.eg.db_3.12.0
 [2] goseq_1.42.0
 [3] geneLenDataBase_1.26.0
 [4] BiasedUrn_1.07
 [5] TxDb.Hsapiens.UCSC.hg38.knownGene_3.10.0
 [6] GenomicFeatures_1.42.1
 [7] AnnotationDbi_1.52.0
 [8] Biobase_2.50.0
 [9] GenomicRanges_1.42.0
[10] GenomeInfoDb_1.26.0
[11] IRanges_2.24.0
[12] S4Vectors_0.28.0
[13] BiocGenerics_0.36.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6                  lattice_0.20-41
 [3] GO.db_3.12.1                prettyunits_1.1.1
 [5] Rsamtools_2.6.0             Biostrings_2.58.0
 [7] assertthat_0.2.1            BiocFileCache_1.14.0
 [9] R6_2.5.0                    RSQLite_2.2.3
[11] httr_1.4.2                  pillar_1.4.7
[13] zlibbioc_1.36.0             rlang_0.4.10
[15] progress_1.2.2              curl_4.3
[17] rstudioapi_0.13             blob_1.2.1
[19] Matrix_1.3-2                splines_4.0.3
[21] BiocParallel_1.24.0         stringr_1.4.0
[23] RCurl_1.98-1.2              bit_4.0.4
[25] biomaRt_2.46.1              DelayedArray_0.16.0
[27] compiler_4.0.3              rtracklayer_1.50.0
[29] pkgconfig_2.0.3             askpass_1.1
[31] mgcv_1.8-33                 openssl_1.4.3
[33] tidyselect_1.1.0            SummarizedExperiment_1.20.0
[35] tibble_3.0.6                GenomeInfoDbData_1.2.4
[37] matrixStats_0.58.0          XML_3.99-0.5
[39] crayon_1.4.1                dplyr_1.0.4
[41] dbplyr_2.0.0                GenomicAlignments_1.26.0
[43] bitops_1.0-6                rappdirs_0.3.1
[45] grid_4.0.3                  nlme_3.1-151
[47] lifecycle_0.2.0             DBI_1.1.1
[49] magrittr_2.0.1              stringi_1.5.3
[51] cachem_1.0.3                XVector_0.30.0
[53] xml2_1.3.2                  ellipsis_0.3.1
[55] generics_0.1.0              vctrs_0.3.6
[57] tools_4.0.3                 bit64_4.0.5
[59] glue_1.4.2                  purrr_0.3.4
[61] hms_1.0.0                   MatrixGenerics_1.2.0
[63] fastmap_1.1.0               memoise_2.0.0

goseq • 2.3k views

ADD COMMENT • link updated 3.9 years ago by James W. MacDonald 67k • written 3.9 years ago by Nikolay Ivanov • 0

score 2 · Answer 1 · 2021-02-16

2

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 7 days ago

United States

The warning you are getting has to do with the fact that the TxDb package you are referencing doesn't actually exist. There was a TxDb.Hsapiens.UCSC.hg19.ensGene package, but that went away for hg38 because UCSC did this weird thing where they combined Ensembl transcripts and NCBI Gene IDs in the knownGene table, and got rid of the ensGene table. So you can't even make an ensGene TxDb package for hg38.

Anyway, I'm always blathering on about how you shouldn't do cross-annotation mappings like that because it's way harder than you think (as an example, NCBI and Ensembl have a project that they have been working on for three years now to simply find a single transcript for each gene that they can agree on...). I have no idea why UCSC made that change, given that NCBI has their own transcript IDs (the refGene table) that actually do map to NCBI Gene IDs, but whatever.

So back to the main story. The function getlength tries to grep for the TxDb.Hsapiens.UCSC.hg38.ensGene package, which doesn't exist. So as a fallback, it uses the org.Hs.eg.db package to find plausible replacements and comes up with two things to grep on, namely 'TxDb..hg38.knownGene' and 'TxDb..hg38.refGene'. And since it's using a vector of length two as an argument to grep you get the warning that only the first thing is being used, because you can't grep on two patterns simultaneously.

In the end you will use TxDb.Hsapiens.UCSC.hg38.knownGene to estimate the average transcript length for those Ensembl genes. But the table you use to do that is a mapping between the Ensembl transcripts and the NCBI Gene IDs, and since NCBI and Ensembl can't even agree on one transcript per gene, you can guess how many NA values are in there.

That will work, and you will get results (as evidenced by the fact that you got results) but you will get different results if you estimate the average transcript length by hand using one of Johannes Rainier's EnsDb packages, which are pure Ensembl data, and if you then do the gene to GO mapping using Ensembl to GO directly rather than what you are doing currently which is Ensembl to NCBI Gene ID to GO, which results in any number of genes getting dropped.

ADD COMMENT • link 3.9 years ago James W. MacDonald 67k

0

Entering edit mode

Well there you go.

It would be nice to get an update to geneLenDataBase for the "native" gene length for human Ensembl genes (even though genes may come and go and change length slightly). It's nice when teaching to show the simple (natively supported gene length) code.

ADD REPLY • link 3.9 years ago Michael Love 43k

0

Entering edit mode

Thank you for your reply, that makes perfect sense!

So, I take it that to carry out the most accurate GO analysis (and by accurate I mean analysis with the least amount of genes dropped), I should do what you suggest at the end (namely estimate the average transcript length using the EnsDb.Hsapiens.v86 package and then do the gene to GO mapping using Ensembl to GO directly).

Here's a breakdown of the pipeline:

Calculate gene lengths as follows:

library(EnsDb.Hsapiens.v86)
edb <- EnsDb.Hsapiens.v86

# get length of each transcript like so:
lengthOf(edb, of = "tx", filter = TxIdFilter("ENST00000494424"))

Then, average all transcripts corresponding to a given gene and calculate Probability Weighting Function by providing the gene length data calculated above as the bias.data argument:

pwf= nullp(DEgenes, 'hg38', 'ensGene', bias.data = gene_lengths)

Next, map Ensembl gene IDs to GO categories using Biomart (http://www.ensembl.org/biomart). [Is Biomart the best way to do this?]

Lastly, run the goseq function using the Biomart-derived Ensembl gene IDs to GO map (gene2cat argument):

 goseq(pwf, 'hg38', 'ensGene', gene2cat = Biomart_geneID_to_GOterms, test.cats=c("GO:CC", "GO:BP", "GO:MF"))

Is that an appropriate pipeline?

#

I have another follow-up question: The reason why I’m using hg38 for goseq in the first place is because I performed transcript quantification from RNAseq data using an hg38 transcriptome index. If I perform goseq analysis using hg19, would that be wrong? (Yes, I know that average transcript lengths would differ between hg19 and hg38.) If I can use hg19, then all of the issues that I’m running into would be avoided since the TxDb.Hsapiens.UCSC.hg19.ensGene package does exist.

ADD REPLY • link 3.9 years ago Nikolay Ivanov • 0

0

Entering edit mode

I would in general use the biomaRt package to do the mapping rather than the Biomart website directly, but ymmv. Otherwise, that's exactly what I was getting at. Well done figuring it out yourself. And pointing out lengthOf which I was not familiar with.

You could use hg19, but that's sort of an odd thing to explain. I mean how do you explain that? Like 'It is possible to use hg38 data to perform the GO analysis, but it was so much easier to pretend that these data were based on hg19, so I did that instead'?

You're right that it's easier, but even if you use the hg19 ensGene package, you are still doing the Ensembl -> NCBI Gene ID -> GO ID mappings, so you should make the gene2cat data.frame, and at that point is it really that much more work to get the average gene lengths?

ADD REPLY • link 3.9 years ago James W. MacDonald 67k

0

Entering edit mode

Yea, using hg19 would definitely look odd.. Thanks again for your help!

ADD REPLY • link 3.9 years ago Nikolay Ivanov • 0

0

Entering edit mode

I implemented the pipeline you recommended and am posting the code here, in case others may need it. Of note, goseq uses median transcript length for each gene as the bias data, which is what I computed.

I have one remaining question. Seeing as KEGG.db is no longer maintained, and will be removed from Bioconductor soon, is there another way to convert between KEGG pathway numeric IDs and descriptive names? Using biomaRt, I can get the KEGG pathway and enzyme ID associated with each gene: the sole KEGG-related attribute on biomaRt is "kegg_enzyme", which is in the form 00190+7.1.1.2, where the part before the plus sign (00190) is the pathway, and the part after the plus sign (7.1.1.2) is the enzyme. (At least that's how I understand it.)

Now, if I want to examine which KEGG pathways are over-represented in my set of differentially expressed genes, I can create the gene2cat argument required by goseq from from the KEGG biomaRt data as discussed above, and then run goseq. However, the resulting table contains KEGG numeric IDs (eg, 00190), but not descriptive names (eg, "Oxidative phosphorylation"). How can I fetch these descriptive names (without manually googling every numeric code)? I used to utilize KEGG.db, but it's not maintained any longer..

#

My code to manually create gene length bias.data and gene2cat arguments for goseq using pure Ensembl data, for Gene Ontology & KEGG pathway over-representation analyses:

## compute median transcript length for each gene

library(EnsDb.Hsapiens.v86)
edb = EnsDb.Hsapiens.v86
tx = transcripts(edb)
tx_length = lengthOf(edb, of = "tx") # details: https://rdrr.io/bioc/ensembldb/man/EnsDb-lengths.html

# convert transcript name to corresponding gene
gene_tx_map = transcripts(edb, columns = c("tx_id", "gene_id"), return.type="data.frame")
mm=match(names(tx_length), gene_tx_map$tx_id)
names(tx_length) = gene_tx_map$gene_id[mm]

tx_lengths_split = split(tx_length, names(tx_length))
median_tx_lengths = unlist(lapply(tx_lengths_split,median))
median_tx_lengths = data.frame(gene_EnsemblID = names(median_tx_lengths), median_length = as.numeric(median_tx_lengths)) # use this to construct the bias.data argument in the nullp goseq function

## make a list where the names are genes and each entry is a vector containing GO categories associated with that gene

library(biomaRt)
mart = useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")

biomart_GO_results = getBM(attributes = c("ensembl_gene_id", "go_id", "name_1006"), mart = mart)

# remove genes with no associated GO term
biomart_GO_results = biomart_GO_results[-which(biomart_GO_results$go_id==""),]

genes_GO_list=split(biomart_GO_results[,1:2],biomart_GO_results$ensembl_gene_id)
gene2cat_GO = lapply(genes_GO_list, '[',,2) # use this as the gene2cat argument in the goseq function (for GO over-representation analysis)

## make a list where the names are genes and each entry is a vector containing KEGG categories associated with that gene
biomart_KEGG_results = getBM(attributes = c("ensembl_gene_id", "kegg_enzyme"), mart = mart)

# remove genes with no associated KEGG term
biomart_KEGG_results = biomart_KEGG_results[-which(biomart_KEGG_results$kegg_enzyme==""),]

# separate KEGG enzyme from KEGG pathway IDs
ss = function(x, pattern, slot=1,...) sapply(strsplit(x,pattern,...), "[", slot)
biomart_KEGG_results$kegg_enzyme = ss(biomart_KEGG_results$kegg_enzyme, '\\+', 1)

genes_KEGG_list=split(biomart_KEGG_results,biomart_KEGG_results$ensembl_gene_id)
gene2cat_KEGG = lapply(genes_KEGG_list, '[',,2) # use this as the gene2cat argument in the goseq function (for KEGG pathway over-representation analysis)

ADD REPLY • link 3.9 years ago Nikolay Ivanov • 0

1

Entering edit mode

If you just want to map between KeGG pathway IDs and names, you can use their REST API:

> mapper <- read.table("http://rest.kegg.jp/list/pathway/hsa", sep = "\t", quote = "\"", fill = TRUE, comment.char = "")
> head(mapper)
             V1                                                              V2
1 path:hsa00010             Glycolysis / Gluconeogenesis - Homo sapiens (human)
2 path:hsa00020                Citrate cycle (TCA cycle) - Homo sapiens (human)
3 path:hsa00030                Pentose phosphate pathway - Homo sapiens (human)
4 path:hsa00040 Pentose and glucuronate interconversions - Homo sapiens (human)
5 path:hsa00051          Fructose and mannose metabolism - Homo sapiens (human)
6 path:hsa00052                     Galactose metabolism - Homo sapiens (human)

ADD REPLY • link 3.9 years ago James W. MacDonald 67k

0

Entering edit mode

My understanding of the way biomaRt's "kegg_enzyme" attribute is structured (i.e. in the entry "00190+7.1.1.2", "00190" is the KEGG pathway and "7.1.1.2" is the enzyme) is correct, right?

Also, since I'm defining my mart with the dataset argument set to hsapiens_gene_ensembl, the KEGG pathway ID that I'm getting is for Homo sapients (e.g. 00190 is equivalent to path:hsa00190), correct?

library(biomaRt)
mart = useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
biomart_KEGG_results = getBM(attributes = c("ensembl_gene_id", "kegg_enzyme"), mart = mart)