Question: Matching GenBank Accession Number with corresponding Biomart results
gravatar for Morgan Howard
9 months ago by
Morgan Howard0 wrote:


I am taking a large list of accession numbers from NCBI’s EST and GenBank databases and using biomart R package to query the the Ensembl mart and the human gene Ensembl dataset in order to retrieve the attributes entrezgenetransname, entrezgene, description, and transcript_biotype. This all works as desired except for I am unable to tell what results go with which accession numbers. Surely there must be a way to have the NCBI GenBank accession numbers queried against in a column next to its associated result. I’ve searched though the user manual and many forums and I have been unsuccessful in finding a solution. Any help would be very much appreciated! Thank you

Below I've posted the code I am running as well as a dump of my sessionInfo

ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
attributesList = listAttributes(ensembl)
attributes <- c('entrezgene_trans_name','entrezgene','description','transcript_biotype')
information <- getBM(attributes=c('entrezgene_trans_name','entrezgene','description','transcript_biotype'),
      values = ResultsFile$sacc, 
      mart = ensembl)

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14.2

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] biomaRt_2.38.0       plotly_4.8.0         ggplot2_3.1.0        bindrcpp_0.2.2      
 [5] dplyr_0.7.8          plyr_1.8.4           stringr_1.3.1        annotate_1.60.0     
 [9] XML_3.98-1.16        AnnotationDbi_1.44.0 IRanges_2.16.0       S4Vectors_0.20.1    
[13] Biobase_2.42.0       BiocGenerics_0.28.0  data.table_1.12.0    BiocManager_1.30.4  

loaded via a namespace (and not attached):
 [1] progress_1.2.0    tidyselect_0.2.5  purrr_0.2.5       colorspace_1.4-0  htmltools_0.3.6  
 [6] viridisLite_0.3.0 yaml_2.2.0        blob_1.1.1        rlang_0.3.1       later_0.7.5      
[11] pillar_1.3.1      glue_1.3.0        withr_2.1.2       DBI_1.0.0         bit64_0.9-7      
[16] bindr_0.1.1       munsell_0.5.0     gtable_0.2.0      htmlwidgets_1.3   memoise_1.1.0    
[21] httpuv_1.4.5.1    crosstalk_1.0.0   curl_3.3          Rcpp_1.0.0        xtable_1.8-3     
[26] promises_1.0.1    scales_1.0.0      jsonlite_1.6      mime_0.6          bit_1.1-14       
[31] hms_0.4.2         digest_0.6.18     stringi_1.2.4     shiny_1.2.0       grid_3.5.1       
[36] tools_3.5.1       bitops_1.0-6      magrittr_1.5      lazyeval_0.2.1    RCurl_1.95-4.11  
[41] tibble_2.0.1      RSQLite_2.1.1     crayon_1.3.4      tidyr_0.8.2       pkgconfig_2.0.2  
[46] prettyunits_1.0.2 assertthat_0.2.0  httr_1.4.0        rstudioapi_0.9.0  R6_2.3.0         
[51] compiler_3.5.1

I have also included the structure of my ResultsFile dataTable

                                 qseqid     sacc length slen pident nident qstart qend    evalue
1: 022f7e4d-c2b0-445d-92be-1d9c751edb51 AA010948    251  294 86.853    218    290  520  4.12e-66
2: 022f7e4d-c2b0-445d-92be-1d9c751edb51 AA186504    351  359 85.185    299    195  516  1.11e-86
3: 022f7e4d-c2b0-445d-92be-1d9c751edb51 AA187225    420  410 86.905    365     59  455 2.28e-123
4: 022f7e4d-c2b0-445d-92be-1d9c751edb51 AA284245    405  418 86.914    352    141  516 2.99e-112
5: 022f7e4d-c2b0-445d-92be-1d9c751edb51 AA302401    289  307 86.505    250    202  473  8.74e-78
6: 022f7e4d-c2b0-445d-92be-1d9c751edb51 AA306255    342  410 87.135    298     59  382  3.95e-96
annotation biomart R • 279 views
ADD COMMENTlink modified 9 months ago by James W. MacDonald51k • written 9 months ago by Morgan Howard0
Answer: Matching GenBank Accession Number with corresponding Biomart results
gravatar for James W. MacDonald
9 months ago by
United States
James W. MacDonald51k wrote:

When you query the Biomart server, you are actually querying the underlying database, which returns data in random order. So if you want to match up to your existing data you need to include the accession number as part of your query, in which case you can then use match to re-order to match up with your data.

I don't see how your query could be working for you, however, as you are not passing anything for the filters argument. I don't actuallly know what you would use for GenBank IDs on the Biomart server, so can't test. I would also caution you against using an EBI database to map NCBI IDs - there are any number of discrepancies between the two annotation services that will affect your results.

ADD COMMENTlink written 9 months ago by James W. MacDonald51k

So if I am understanding you correctly, one of the attributes would be need to the accession number which the ensembl biomart database does not support. If you take the first result from my results file and enter it into both EMBL-EBI and the NCBI databases, the information is identical. I assume there is a potential for discrepancies but I've yet to encounter one. Here is what my output is looking like from the mart: Image of output from biomart

ADD REPLYlink modified 9 months ago • written 9 months ago by Morgan Howard0

You are doing something that I have not actually ever done, so I missed where you have gone wrong. By omitting the filters argument you are asking the Biomart to give you everything for the four attributes you list. This isn't helpful because you aren't actually mapping the GenBank IDs to anything! Instead you are just telling the Biomart server to give you a bunch of stuff.

So far as I can tell, EBI doesn't have any GenBank IDs in their databases, which sort of makes sense. GenBank is where most submissions to NCBI go first, and is sort of a dumping ground for things that may or may not end up being accepted as a real thing. And EBI and NCBI have a hard enough time agreeing on what is and isn't a gene, and where it might be and how many transcripts it might have, that adding in a bunch of speculative content is probably a fool's errand.

If you want to map GenBank IDs to Gene IDs, then you are better off using NCBI data rather than using EBI as an intermediate step. Something like

mapped <- select(, as.character(ResultsFile$sacc), c("ENTREZID","SYMBOL","GENENAME"), "ACCNUM")

But do note that you are trying to map from really speculative content to more accepted content, so you should expect lots of missing data.

ADD REPLYlink written 9 months ago by James W. MacDonald51k

Thank you immensely, this is a perfect solution to my issue. I am not expecting all accessions to have a hit. This option also is able to be widened to other species with the other databases available.

ADD REPLYlink written 9 months ago by Morgan Howard0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 430 users visited in the last hour