Question: Matching GenBank Accession Number with corresponding Biomart results
0
9 months ago by
IUPUI
Morgan Howard0 wrote:

Hello,

I am taking a large list of accession numbers from NCBI’s EST and GenBank databases and using biomart R package to query the the Ensembl mart and the human gene Ensembl dataset in order to retrieve the attributes entrezgenetransname, entrezgene, description, and transcript_biotype. This all works as desired except for I am unable to tell what results go with which accession numbers. Surely there must be a way to have the NCBI GenBank accession numbers queried against in a column next to its associated result. I’ve searched though the user manual and many forums and I have been unsuccessful in finding a solution. Any help would be very much appreciated! Thank you

Below I've posted the code I am running as well as a dump of my sessionInfo

library(biomaRt)
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
attributesList = listAttributes(ensembl)
attributes <- c('entrezgene_trans_name','entrezgene','description','transcript_biotype')
information <- getBM(attributes=c('entrezgene_trans_name','entrezgene','description','transcript_biotype'),
values = ResultsFile$sacc, mart = ensembl)  R version 3.5.1 (2018-07-02) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS 10.14.2 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_2.38.0 plotly_4.8.0 ggplot2_3.1.0 bindrcpp_0.2.2 [5] dplyr_0.7.8 plyr_1.8.4 stringr_1.3.1 annotate_1.60.0 [9] XML_3.98-1.16 AnnotationDbi_1.44.0 IRanges_2.16.0 S4Vectors_0.20.1 [13] Biobase_2.42.0 BiocGenerics_0.28.0 data.table_1.12.0 BiocManager_1.30.4 loaded via a namespace (and not attached): [1] progress_1.2.0 tidyselect_0.2.5 purrr_0.2.5 colorspace_1.4-0 htmltools_0.3.6 [6] viridisLite_0.3.0 yaml_2.2.0 blob_1.1.1 rlang_0.3.1 later_0.7.5 [11] pillar_1.3.1 glue_1.3.0 withr_2.1.2 DBI_1.0.0 bit64_0.9-7 [16] bindr_0.1.1 munsell_0.5.0 gtable_0.2.0 htmlwidgets_1.3 memoise_1.1.0 [21] httpuv_1.4.5.1 crosstalk_1.0.0 curl_3.3 Rcpp_1.0.0 xtable_1.8-3 [26] promises_1.0.1 scales_1.0.0 jsonlite_1.6 mime_0.6 bit_1.1-14 [31] hms_0.4.2 digest_0.6.18 stringi_1.2.4 shiny_1.2.0 grid_3.5.1 [36] tools_3.5.1 bitops_1.0-6 magrittr_1.5 lazyeval_0.2.1 RCurl_1.95-4.11 [41] tibble_2.0.1 RSQLite_2.1.1 crayon_1.3.4 tidyr_0.8.2 pkgconfig_2.0.2 [46] prettyunits_1.0.2 assertthat_0.2.0 httr_1.4.0 rstudioapi_0.9.0 R6_2.3.0 [51] compiler_3.5.1  I have also included the structure of my ResultsFile dataTable  qseqid sacc length slen pident nident qstart qend evalue 1: 022f7e4d-c2b0-445d-92be-1d9c751edb51 AA010948 251 294 86.853 218 290 520 4.12e-66 2: 022f7e4d-c2b0-445d-92be-1d9c751edb51 AA186504 351 359 85.185 299 195 516 1.11e-86 3: 022f7e4d-c2b0-445d-92be-1d9c751edb51 AA187225 420 410 86.905 365 59 455 2.28e-123 4: 022f7e4d-c2b0-445d-92be-1d9c751edb51 AA284245 405 418 86.914 352 141 516 2.99e-112 5: 022f7e4d-c2b0-445d-92be-1d9c751edb51 AA302401 289 307 86.505 250 202 473 8.74e-78 6: 022f7e4d-c2b0-445d-92be-1d9c751edb51 AA306255 342 410 87.135 298 59 382 3.95e-96  annotation biomart R • 279 views ADD COMMENTlink modified 9 months ago by James W. MacDonald51k • written 9 months ago by Morgan Howard0 Answer: Matching GenBank Accession Number with corresponding Biomart results 2 9 months ago by United States James W. MacDonald51k wrote: When you query the Biomart server, you are actually querying the underlying database, which returns data in random order. So if you want to match up to your existing data you need to include the accession number as part of your query, in which case you can then use match to re-order to match up with your data. I don't see how your query could be working for you, however, as you are not passing anything for the filters argument. I don't actuallly know what you would use for GenBank IDs on the Biomart server, so can't test. I would also caution you against using an EBI database to map NCBI IDs - there are any number of discrepancies between the two annotation services that will affect your results. ADD COMMENTlink written 9 months ago by James W. MacDonald51k So if I am understanding you correctly, one of the attributes would be need to the accession number which the ensembl biomart database does not support. If you take the first result from my results file and enter it into both EMBL-EBI and the NCBI databases, the information is identical. I assume there is a potential for discrepancies but I've yet to encounter one. Here is what my output is looking like from the mart: ADD REPLYlink modified 9 months ago • written 9 months ago by Morgan Howard0 2 You are doing something that I have not actually ever done, so I missed where you have gone wrong. By omitting the filters argument you are asking the Biomart to give you everything for the four attributes you list. This isn't helpful because you aren't actually mapping the GenBank IDs to anything! Instead you are just telling the Biomart server to give you a bunch of stuff. So far as I can tell, EBI doesn't have any GenBank IDs in their databases, which sort of makes sense. GenBank is where most submissions to NCBI go first, and is sort of a dumping ground for things that may or may not end up being accepted as a real thing. And EBI and NCBI have a hard enough time agreeing on what is and isn't a gene, and where it might be and how many transcripts it might have, that adding in a bunch of speculative content is probably a fool's errand. If you want to map GenBank IDs to Gene IDs, then you are better off using NCBI data rather than using EBI as an intermediate step. Something like library(org.Hs.eg.db) mapped <- select(org.Hs.eg.db, as.character(ResultsFile$sacc), c("ENTREZID","SYMBOL","GENENAME"), "ACCNUM")


But do note that you are trying to map from really speculative content to more accepted content, so you should expect lots of missing data.

ADD REPLYlink written 9 months ago by James W. MacDonald51k

Thank you immensely, this is a perfect solution to my issue. I am not expecting all accessions to have a hit. This option also is able to be widened to other species with the other databases available.