Search
Question: biomaRt Ensembl ID query returns NA in 'ensembl_gene_id', 'ensembl_transcript_id', etc.
0
gravatar for yura.grabovska
4 months ago by
yura.grabovska0 wrote:

I am feeding biomaRt a list of ensembl IDs (object: `ensemblIDs`) from an RNA-Seq experiment

Then I am running the following function calls

ensembl <- useMart('ensembl', dataset='hsapiens_gene_ensembl')
symbols.a <- getBM(attributes = c('ensembl_gene_id',
                                       'ensembl_transcript_id',
                                  'hgnc_symbol',
                                  'external_gene_name',
                                  'gene_biotype',
                                  'description',
                                  'name_1006',
                                  'definition_1006'),
                   filters = 'ensembl_gene_id',
                   ensemblIDs,
                   mart = ensembl)

After matching, I get back a list of results but although I am feeding the function ensembl IDs, the resulting data.frame returns a large number of NA. Taking a few examples:

ENSG00000139131, ENSG00000167157, ENSG00000149547

These all have ensembl gene web entries so it seems weird that it isn't matching them.

 

R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
[1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8  
[6] LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] doParallel_1.0.10                                  RSQLite_2.0                                      
[3] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.0 GenomicFeatures_1.28.4                           
[5] Rtsne_0.13                                         plyr_1.8.4                                       
[7] pheatmap_1.0.8                                     NMF_0.20.6                                       
[9] cluster_2.0.6                                      rngtools_1.2.4                                   
[11] pkgmaker_0.22                                      registry_0.3                                     
[13] minfi_1.22.1                                       bumphunter_1.16.0                                
[15] locfit_1.5-9.1                                     iterators_1.0.8                                  
[17] Biostrings_2.44.1                                  XVector_0.16.0                                   
[19] limma_3.32.2                                       igraph_1.0.1                                     
[21] hugene11sttranscriptcluster.db_8.6.0               hugene10sttranscriptcluster.db_8.6.0             
[23] hthgu133a.db_3.2.3                                 hgug4112a.db_3.2.3                               
[25] hgu95av2.db_3.2.3                                  hgu133plus2.db_3.2.3                             
[27] hgu133b.db_3.2.3                                   hgu133a2.db_3.2.3                                
[29] hgu133a.db_3.2.3                                   org.Hs.eg.db_3.4.1                               
[31] gplots_3.0.1                                       GEOquery_2.42.0                                  
[33] genefilter_1.58.1                                  foreach_1.4.3                                    
[35] DESeq2_1.16.1                                      SummarizedExperiment_1.6.3                       
[37] DelayedArray_0.2.7                                 matrixStats_0.52.2                               
[39] GenomicRanges_1.28.3                               GenomeInfoDb_1.12.2                              
[41] biomaRt_2.32.1                                     beadarray_2.26.1                                 
[43] ggplot2_2.2.1                                      annotate_1.54.0                                  
[45] XML_3.98-1.9                                       AnnotationDbi_1.38.1                             
[47] IRanges_2.10.2                                     S4Vectors_0.14.3                                 
[49] affy_1.54.0                                        Biobase_2.36.2                                   
[51] BiocGenerics_0.22.0                               

loaded via a namespace (and not attached):
[1] colorspace_1.3-2         siggenes_1.50.0          mclust_5.3               htmlTable_1.9            base64enc_0.1-3          base64_2.0             
[7] affyio_1.46.0            bit64_0.9-7              codetools_0.2-15         splines_3.4.0            geneplotter_1.54.0       knitr_1.16             
[13] Formula_1.2-2            Rsamtools_1.28.0         gridBase_0.4-7           compiler_3.4.0           httr_1.2.1               backports_1.1.0        
[19] Matrix_1.2-10            lazyeval_0.2.0           BeadDataPackR_1.28.0     acepack_1.4.1            htmltools_0.3.6          tools_3.4.0            
[25] gtable_0.2.0             GenomeInfoDbData_0.99.0  reshape2_1.4.2           doRNG_1.6.6              Rcpp_0.12.11             multtest_2.32.0        
[31] nlme_3.1-131             gdata_2.18.0             preprocessCore_1.38.1    rtracklayer_1.36.3       stringr_1.2.0            gtools_3.5.0           
[37] beanplot_1.2             MASS_7.3-47              zlibbioc_1.22.0          scales_0.4.1             BiocInstaller_1.26.0     RColorBrewer_1.1-2     
[43] memoise_1.1.0            gridExtra_2.2.1          rpart_4.1-11             reshape_0.8.6            latticeExtra_0.6-28      stringi_1.1.5          
[49] checkmate_1.8.3          caTools_1.17.1           BiocParallel_1.10.1      rlang_0.1.1              pkgconfig_2.0.1          bitops_1.0-6           
[55] nor1mix_1.2-2            lattice_0.20-35          GenomicAlignments_1.12.1 htmlwidgets_0.9          bit_1.1-12               magrittr_1.5           
[61] R6_2.2.2                 Hmisc_4.0-3              DBI_0.7                  foreign_0.8-69           survival_2.41-3          RCurl_1.95-4.8         
[67] nnet_7.3-12              tibble_1.3.3             KernSmooth_2.23-15       grid_3.4.0               data.table_1.10.4        blob_1.1.0             
[73] digest_0.6.12            xtable_1.8-2             illuminaio_0.18.0        openssl_0.9.6            munsell_0.4.3            quadprog_1.5-5   

 

ADD COMMENTlink modified 4 months ago by Mike Smith2.1k • written 4 months ago by yura.grabovska0

Can you provide the output from sessionInfo() so we can see which version of biomaRt you're using?  

ADD REPLYlink written 4 months ago by Mike Smith2.1k

Edited above to add. Just to add:

Of the original list I fed the function, around 20% weren't matched. I subset those ensembl IDs which were not matched and ran just them (around 5000) through the function again. Most were matched but around 20% weren't matched again. I repeated the process but wasn't able to extract the additional missing genes.

 

ADD REPLYlink modified 4 months ago • written 4 months ago by yura.grabovska0
2
gravatar for Mike Smith
4 months ago by
Mike Smith2.1k
EMBL Heidelberg / de.NBI
Mike Smith2.1k wrote:

I think this is probably related to a problem the Ensembl BioMart has when your list of query values is very long.  You'll notice on the web interface it only recommends submitting up to 500 values at a time.  If you submit something longer than this the query may time out, but it does so silently and still returns some results.  The same is true if you do the query with biomaRt, and it's basically impossible to tell from the returned values that this has happened.

I recently patched the developmental version of biomaRt (https://www.bioconductor.org/packages/devel/bioc/html/biomaRt.html) to break large queries into smaller chunks and submit them independently, so you this shouldn't happen in the future.  You can install the devel version using the following code, and then see if the problem persists:

library(BiocInstaller)
biocLite("Bioconductor-mirror/biomaRt")

 

ADD COMMENTlink written 4 months ago by Mike Smith2.1k

Thanks Mike. This makes sense. I'll implement your suggestion.

ADD REPLYlink written 4 months ago by yura.grabovska0
0
gravatar for thyagoleal
4 months ago by
thyagoleal20
thyagoleal20 wrote:

Have you checked if the input ids are correct? I mean, are you telling what they're correctly? If so, try using another annotation.

ADD COMMENTlink written 4 months ago by thyagoleal20

Of the original list I fed the function, around 20% weren't matched. I subset those ensembl IDs which were not matched and ran just them (around 5000) through the function again. Around 20% weren't matched again. I repeated the process but wasn't able to extract the additional missing genes. The fact that feeding the function a list of previously missed names yields a result makes me think that the function is doing something i I dont quite understand, but it is still able to match the missing results if fed a smaller list. 

ADD REPLYlink written 4 months ago by yura.grabovska0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 315 users visited in the last hour