biomaRt Ensembl ID query returns NA in 'ensembl_gene_id', 'ensembl_transcript_id', etc.
2
0
Entering edit mode
@yuragrabovska-9835
Last seen 2.5 years ago
United Kingdom

I am feeding biomaRt a list of ensembl IDs (object: `ensemblIDs`) from an RNA-Seq experiment

Then I am running the following function calls

ensembl <- useMart('ensembl', dataset='hsapiens_gene_ensembl')
symbols.a <- getBM(attributes = c('ensembl_gene_id',
                                       'ensembl_transcript_id',
                                  'hgnc_symbol',
                                  'external_gene_name',
                                  'gene_biotype',
                                  'description',
                                  'name_1006',
                                  'definition_1006'),
                   filters = 'ensembl_gene_id',
                   ensemblIDs,
                   mart = ensembl)

After matching, I get back a list of results but although I am feeding the function ensembl IDs, the resulting data.frame returns a large number of NA. Taking a few examples:

ENSG00000139131, ENSG00000167157, ENSG00000149547

These all have ensembl gene web entries so it seems weird that it isn't matching them.

 

R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
[1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8  
[6] LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] doParallel_1.0.10                                  RSQLite_2.0                                      
[3] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.0 GenomicFeatures_1.28.4                           
[5] Rtsne_0.13                                         plyr_1.8.4                                       
[7] pheatmap_1.0.8                                     NMF_0.20.6                                       
[9] cluster_2.0.6                                      rngtools_1.2.4                                   
[11] pkgmaker_0.22                                      registry_0.3                                     
[13] minfi_1.22.1                                       bumphunter_1.16.0                                
[15] locfit_1.5-9.1                                     iterators_1.0.8                                  
[17] Biostrings_2.44.1                                  XVector_0.16.0                                   
[19] limma_3.32.2                                       igraph_1.0.1                                     
[21] hugene11sttranscriptcluster.db_8.6.0               hugene10sttranscriptcluster.db_8.6.0             
[23] hthgu133a.db_3.2.3                                 hgug4112a.db_3.2.3                               
[25] hgu95av2.db_3.2.3                                  hgu133plus2.db_3.2.3                             
[27] hgu133b.db_3.2.3                                   hgu133a2.db_3.2.3                                
[29] hgu133a.db_3.2.3                                   org.Hs.eg.db_3.4.1                               
[31] gplots_3.0.1                                       GEOquery_2.42.0                                  
[33] genefilter_1.58.1                                  foreach_1.4.3                                    
[35] DESeq2_1.16.1                                      SummarizedExperiment_1.6.3                       
[37] DelayedArray_0.2.7                                 matrixStats_0.52.2                               
[39] GenomicRanges_1.28.3                               GenomeInfoDb_1.12.2                              
[41] biomaRt_2.32.1                                     beadarray_2.26.1                                 
[43] ggplot2_2.2.1                                      annotate_1.54.0                                  
[45] XML_3.98-1.9                                       AnnotationDbi_1.38.1                             
[47] IRanges_2.10.2                                     S4Vectors_0.14.3                                 
[49] affy_1.54.0                                        Biobase_2.36.2                                   
[51] BiocGenerics_0.22.0                               

loaded via a namespace (and not attached):
[1] colorspace_1.3-2         siggenes_1.50.0          mclust_5.3               htmlTable_1.9            base64enc_0.1-3          base64_2.0             
[7] affyio_1.46.0            bit64_0.9-7              codetools_0.2-15         splines_3.4.0            geneplotter_1.54.0       knitr_1.16             
[13] Formula_1.2-2            Rsamtools_1.28.0         gridBase_0.4-7           compiler_3.4.0           httr_1.2.1               backports_1.1.0        
[19] Matrix_1.2-10            lazyeval_0.2.0           BeadDataPackR_1.28.0     acepack_1.4.1            htmltools_0.3.6          tools_3.4.0            
[25] gtable_0.2.0             GenomeInfoDbData_0.99.0  reshape2_1.4.2           doRNG_1.6.6              Rcpp_0.12.11             multtest_2.32.0        
[31] nlme_3.1-131             gdata_2.18.0             preprocessCore_1.38.1    rtracklayer_1.36.3       stringr_1.2.0            gtools_3.5.0           
[37] beanplot_1.2             MASS_7.3-47              zlibbioc_1.22.0          scales_0.4.1             BiocInstaller_1.26.0     RColorBrewer_1.1-2     
[43] memoise_1.1.0            gridExtra_2.2.1          rpart_4.1-11             reshape_0.8.6            latticeExtra_0.6-28      stringi_1.1.5          
[49] checkmate_1.8.3          caTools_1.17.1           BiocParallel_1.10.1      rlang_0.1.1              pkgconfig_2.0.1          bitops_1.0-6           
[55] nor1mix_1.2-2            lattice_0.20-35          GenomicAlignments_1.12.1 htmlwidgets_0.9          bit_1.1-12               magrittr_1.5           
[61] R6_2.2.2                 Hmisc_4.0-3              DBI_0.7                  foreign_0.8-69           survival_2.41-3          RCurl_1.95-4.8         
[67] nnet_7.3-12              tibble_1.3.3             KernSmooth_2.23-15       grid_3.4.0               data.table_1.10.4        blob_1.1.0             
[73] digest_0.6.12            xtable_1.8-2             illuminaio_0.18.0        openssl_0.9.6            munsell_0.4.3            quadprog_1.5-5   

 

biomart ensembl annotation • 2.0k views
ADD COMMENT
0
Entering edit mode

Can you provide the output from sessionInfo() so we can see which version of biomaRt you're using?  

ADD REPLY
0
Entering edit mode

Edited above to add. Just to add:

Of the original list I fed the function, around 20% weren't matched. I subset those ensembl IDs which were not matched and ran just them (around 5000) through the function again. Most were matched but around 20% weren't matched again. I repeated the process but wasn't able to extract the additional missing genes.

 

ADD REPLY
2
Entering edit mode
Mike Smith ★ 6.5k
@mike-smith
Last seen 3 hours ago
EMBL Heidelberg

I think this is probably related to a problem the Ensembl BioMart has when your list of query values is very long.  You'll notice on the web interface it only recommends submitting up to 500 values at a time.  If you submit something longer than this the query may time out, but it does so silently and still returns some results.  The same is true if you do the query with biomaRt, and it's basically impossible to tell from the returned values that this has happened.

I recently patched the developmental version of biomaRt (https://www.bioconductor.org/packages/devel/bioc/html/biomaRt.html) to break large queries into smaller chunks and submit them independently, so you this shouldn't happen in the future.  You can install the devel version using the following code, and then see if the problem persists:

library(BiocInstaller)
biocLite("Bioconductor-mirror/biomaRt")

 

ADD COMMENT
0
Entering edit mode

Thanks Mike. This makes sense. I'll implement your suggestion.

ADD REPLY
0
Entering edit mode
tcalvo ▴ 90
@tcalvo-12466
Last seen 10 months ago
Brazil

Have you checked if the input ids are correct? I mean, are you telling what they're correctly? If so, try using another annotation.

ADD COMMENT
0
Entering edit mode

Of the original list I fed the function, around 20% weren't matched. I subset those ensembl IDs which were not matched and ran just them (around 5000) through the function again. Around 20% weren't matched again. I repeated the process but wasn't able to extract the additional missing genes. The fact that feeding the function a list of previously missed names yields a result makes me think that the function is doing something i I dont quite understand, but it is still able to match the missing results if fed a smaller list. 

ADD REPLY

Login before adding your answer.

Traffic: 922 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6