Prior to doing GO term enrichment analysis, I've been trying to convert the Ensembl IDs of differentially expressed genes to Entrez IDs using the R package biomaRt. I've done some spot checking and it looks like everything is getting converted correctly. However, I've run into two particular Ensembl IDs that are getting incorrectly converted. The Ensembl IDs in question are: ENSMUSG00000071633.11 and ENSMUSG00000071646.9 which are being converted to 208285 and 72465, respectively. After comparing searches on Ensembl and NCBI's database, there doesn't seem to be a relationship between the Ensembl and Entrez IDs that I'm getting. This makes me suspicious that there might be other errors lurking in my dataset. Has anyone else run into this issue before and if so, how did you resolve it? Am I missing something obvious here?
R script used to convert Ensembl IDs to Entrez IDs:
DEListTangerineRed <- read.csv("DEgenesTangerineRed.csv", stringsAsFactors = FALSE)
mart <- useMart("ensembl", dataset = "mmusculus_gene_ensembl")
TangerineRedGenes <- DEListTangerineRed$X
ConvertedTangerineRedGenes <- getBM(filters= "ensembl_gene_id_version", attributes= c("ensembl_gene_id_version",
"entrezgene", "description"),values <- TangerineRedGenes, mart= mart)
write.csv(ConvertedTangerineRedGenes, file="ConvertedTangerineRedGenes.csv")
Session Info:
R version 3.5.0 (2018-04-23) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux Matrix products: default BLAS/LAPACK: /opt/intel/compilers_and_libraries_2016.3.210/linux/mkl/lib/intel64_lin/libmkl_intel_lp64.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_2.38.0 org.Mm.eg.db_3.7.0 topGO_2.34.0 SparseM_1.77 GO.db_3.7.0 [6] AnnotationDbi_1.44.0 graph_1.60.0 DESeq2_1.22.1 SummarizedExperiment_1.12.0 DelayedArray_0.8.0 [11] BiocParallel_1.16.0 matrixStats_0.54.0 Biobase_2.42.0 GenomicRanges_1.34.0 GenomeInfoDb_1.18.1 [16] IRanges_2.16.0 S4Vectors_0.20.0 BiocGenerics_0.28.0 gplots_3.0.1 reshape2_1.4.3 [21] RColorBrewer_1.1-2 Rsubread_1.30.9 Glimma_1.10.0 edgeR_3.22.5 limma_3.38.2 loaded via a namespace (and not attached): [1] bitops_1.0-6 bit64_0.9-7 httr_1.3.1 progress_1.2.0 tools_3.5.0 backports_1.1.2 [7] R6_2.2.2 rpart_4.1-13 KernSmooth_2.23-15 Hmisc_4.1-1 DBI_1.0.0 lazyeval_0.2.1 [13] colorspace_1.3-2 nnet_7.3-12 prettyunits_1.0.2 gridExtra_2.3 curl_3.2 bit_1.1-14 [19] compiler_3.5.0 htmlTable_1.12 caTools_1.17.1.1 scales_0.5.0 checkmate_1.8.5 genefilter_1.64.0 [25] stringr_1.3.1 digest_0.6.15 foreign_0.8-70 XVector_0.22.0 base64enc_0.1-3 pkgconfig_2.0.1 [31] htmltools_0.3.6 htmlwidgets_1.3 rlang_0.3.0.1 rstudioapi_0.7 RSQLite_2.1.1 bindr_0.1.1 [37] jsonlite_1.5 gtools_3.8.1 acepack_1.4.1 dplyr_0.7.4 RCurl_1.95-4.11 magrittr_1.5 [43] GenomeInfoDbData_1.2.0 Formula_1.2-3 Matrix_1.2-14 Rcpp_0.12.16 munsell_0.4.3 yaml_2.2.0 [49] stringi_1.2.2 zlibbioc_1.28.0 plyr_1.8.4 grid_3.5.0 blob_1.1.1 gdata_2.18.0 [55] crayon_1.3.4 lattice_0.20-35 splines_3.5.0 annotate_1.60.0 hms_0.4.2 locfit_1.5-9.1 [61] knitr_1.20 pillar_1.2.2 geneplotter_1.60.0 XML_3.98-1.16 glue_1.2.0 latticeExtra_0.6-28 [67] data.table_1.11.8 gtable_0.2.0 assertthat_0.2.0 ggplot2_3.1.0 xtable_1.8-2 survival_2.41-3 [73] tibble_1.4.2 memoise_1.1.0 bindrcpp_0.2.2 cluster_2.0.7-1 |
|
|
Hi James,
You're correct about the IDs being unordered. I was re-ordering them in Excel so that the DE information (F value, P-Value, etc.) was correctly lined up. What I discovered is that in some cases, instead of returning an "NA" for Ensembl IDs it doesn't recognize, it will occasionally just drop entries from the list all together! Therefore, the length of IDs from my DE analysis and converted IDs were different and I basically had an issue with frame shifting in the compiled dataset. This seems like a really weird error on the package's part as far as I can tell.