I'm getting a warning message when I try to run Gene Ontology analysis with goseq using human genome assembly GRCh38 (hg38). I know that hg38 isn't natively supported by goseq, but goseq will fetch goseq gene lengths from TxDB or UCSC, provided the appropriate package is installed.
Thus, I run the following code:
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(goseq)
# load toy dataset
aa = c(1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1)
names(aa) = c('ENSG00000000003', 'ENSG00000000419', 'ENSG00000000457', 'ENSG00000000460', 'ENSG00000000938', 'ENSG00000000971', 'ENSG00000001036', 'ENSG00000001084', 'ENSG00000001167', 'ENSG00000001460', 'ENSG00000001461', 'ENSG00000001497', 'ENSG00000001561', 'ENSG00000001617', 'ENSG00000001626', 'ENSG00000001629', 'ENSG00000001630', 'ENSG00000001631', 'ENSG00000002016', 'ENSG00000002079')
# run the probability weighing function
pwf = nullp(aa, "hg38", "ensGene")
The code runs successfully, but I get the following warning message:
Warning messages:
1: In grep(txdbPattern, installedPackages) :
argument 'pattern' has length > 1 and only the first element will be used
I looked through the source code, but can't seem to find the reason for this warning message. Does anyone know the reason for the warning? Can it affect my results?
Note that I don't get the warning message when I run the code with hg19:
library(goseq)
# load toy dataset
aa = c(1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1)
names(aa) = c('ENSG00000000003', 'ENSG00000000419', 'ENSG00000000457', 'ENSG00000000460', 'ENSG00000000938', 'ENSG00000000971', 'ENSG00000001036', 'ENSG00000001084', 'ENSG00000001167', 'ENSG00000001460', 'ENSG00000001461', 'ENSG00000001497', 'ENSG00000001561', 'ENSG00000001617', 'ENSG00000001626', 'ENSG00000001629', 'ENSG00000001630', 'ENSG00000001631', 'ENSG00000002016', 'ENSG00000002079')
# run the probability weighing function
pwf = nullp(aa, "hg19", "ensGene")
My session information is as follows:
sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.3 (Santiago)
Matrix products: default
BLAS/LAPACK: /pbtech_mounts/homes027/nai2008/anaconda3/envs/R_4.0/lib/libopenblasp-r0.3.10.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] org.Hs.eg.db_3.12.0
[2] goseq_1.42.0
[3] geneLenDataBase_1.26.0
[4] BiasedUrn_1.07
[5] TxDb.Hsapiens.UCSC.hg38.knownGene_3.10.0
[6] GenomicFeatures_1.42.1
[7] AnnotationDbi_1.52.0
[8] Biobase_2.50.0
[9] GenomicRanges_1.42.0
[10] GenomeInfoDb_1.26.0
[11] IRanges_2.24.0
[12] S4Vectors_0.28.0
[13] BiocGenerics_0.36.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 lattice_0.20-41
[3] GO.db_3.12.1 prettyunits_1.1.1
[5] Rsamtools_2.6.0 Biostrings_2.58.0
[7] assertthat_0.2.1 BiocFileCache_1.14.0
[9] R6_2.5.0 RSQLite_2.2.3
[11] httr_1.4.2 pillar_1.4.7
[13] zlibbioc_1.36.0 rlang_0.4.10
[15] progress_1.2.2 curl_4.3
[17] rstudioapi_0.13 blob_1.2.1
[19] Matrix_1.3-2 splines_4.0.3
[21] BiocParallel_1.24.0 stringr_1.4.0
[23] RCurl_1.98-1.2 bit_4.0.4
[25] biomaRt_2.46.1 DelayedArray_0.16.0
[27] compiler_4.0.3 rtracklayer_1.50.0
[29] pkgconfig_2.0.3 askpass_1.1
[31] mgcv_1.8-33 openssl_1.4.3
[33] tidyselect_1.1.0 SummarizedExperiment_1.20.0
[35] tibble_3.0.6 GenomeInfoDbData_1.2.4
[37] matrixStats_0.58.0 XML_3.99-0.5
[39] crayon_1.4.1 dplyr_1.0.4
[41] dbplyr_2.0.0 GenomicAlignments_1.26.0
[43] bitops_1.0-6 rappdirs_0.3.1
[45] grid_4.0.3 nlme_3.1-151
[47] lifecycle_0.2.0 DBI_1.1.1
[49] magrittr_2.0.1 stringi_1.5.3
[51] cachem_1.0.3 XVector_0.30.0
[53] xml2_1.3.2 ellipsis_0.3.1
[55] generics_0.1.0 vctrs_0.3.6
[57] tools_4.0.3 bit64_4.0.5
[59] glue_1.4.2 purrr_0.3.4
[61] hms_1.0.0 MatrixGenerics_1.2.0
[63] fastmap_1.1.0 memoise_2.0.0
Well there you go.
It would be nice to get an update to
geneLenDataBase
for the "native" gene length for human Ensembl genes (even though genes may come and go and changelength
slightly). It's nice when teaching to show the simple (natively supported gene length) code.Thank you for your reply, that makes perfect sense!
So, I take it that to carry out the most accurate GO analysis (and by accurate I mean analysis with the least amount of genes dropped), I should do what you suggest at the end (namely estimate the average transcript length using the
EnsDb.Hsapiens.v86
package and then do the gene to GO mapping using Ensembl to GO directly).Here's a breakdown of the pipeline:
Calculate gene lengths as follows:
Then, average all transcripts corresponding to a given gene and calculate Probability Weighting Function by providing the gene length data calculated above as the
bias.data
argument:Next, map Ensembl gene IDs to GO categories using Biomart (http://www.ensembl.org/biomart). [Is Biomart the best way to do this?]
Lastly, run the goseq function using the Biomart-derived Ensembl gene IDs to GO map (
gene2cat
argument):Is that an appropriate pipeline?
#
I have another follow-up question: The reason why I’m using hg38 for goseq in the first place is because I performed transcript quantification from RNAseq data using an hg38 transcriptome index. If I perform goseq analysis using hg19, would that be wrong? (Yes, I know that average transcript lengths would differ between hg19 and hg38.) If I can use hg19, then all of the issues that I’m running into would be avoided since the
TxDb.Hsapiens.UCSC.hg19.ensGene
package does exist.I would in general use the biomaRt package to do the mapping rather than the Biomart website directly, but ymmv. Otherwise, that's exactly what I was getting at. Well done figuring it out yourself. And pointing out
lengthOf
which I was not familiar with.You could use hg19, but that's sort of an odd thing to explain. I mean how do you explain that? Like 'It is possible to use hg38 data to perform the GO analysis, but it was so much easier to pretend that these data were based on hg19, so I did that instead'?
You're right that it's easier, but even if you use the hg19 ensGene package, you are still doing the Ensembl -> NCBI Gene ID -> GO ID mappings, so you should make the gene2cat
data.frame
, and at that point is it really that much more work to get the average gene lengths?Yea, using hg19 would definitely look odd.. Thanks again for your help!
I implemented the pipeline you recommended and am posting the code here, in case others may need it. Of note, goseq uses median transcript length for each gene as the bias data, which is what I computed.
I have one remaining question. Seeing as KEGG.db is no longer maintained, and will be removed from Bioconductor soon, is there another way to convert between KEGG pathway numeric IDs and descriptive names? Using biomaRt, I can get the KEGG pathway and enzyme ID associated with each gene: the sole KEGG-related attribute on biomaRt is "kegg_enzyme", which is in the form 00190+7.1.1.2, where the part before the plus sign (00190) is the pathway, and the part after the plus sign (7.1.1.2) is the enzyme. (At least that's how I understand it.)
Now, if I want to examine which KEGG pathways are over-represented in my set of differentially expressed genes, I can create the gene2cat argument required by goseq from from the KEGG biomaRt data as discussed above, and then run goseq. However, the resulting table contains KEGG numeric IDs (eg, 00190), but not descriptive names (eg, "Oxidative phosphorylation"). How can I fetch these descriptive names (without manually googling every numeric code)? I used to utilize KEGG.db, but it's not maintained any longer..
#
My code to manually create gene length bias.data and gene2cat arguments for goseq using pure Ensembl data, for Gene Ontology & KEGG pathway over-representation analyses:
If you just want to map between KeGG pathway IDs and names, you can use their REST API:
My understanding of the way biomaRt's "kegg_enzyme" attribute is structured (i.e. in the entry "00190+7.1.1.2", "00190" is the KEGG pathway and "7.1.1.2" is the enzyme) is correct, right?
Also, since I'm defining my
mart
with the dataset argument set tohsapiens_gene_ensembl
, the KEGG pathway ID that I'm getting is for Homo sapients (e.g. 00190 is equivalent to path:hsa00190), correct?That would be my assumption.