Hello,
I am taking a large list of accession numbers from NCBI’s EST and GenBank databases and using biomart R package to query the the Ensembl mart and the human gene Ensembl dataset in order to retrieve the attributes entrezgenetransname, entrezgene, description, and transcript_biotype. This all works as desired except for I am unable to tell what results go with which accession numbers. Surely there must be a way to have the NCBI GenBank accession numbers queried against in a column next to its associated result. I’ve searched though the user manual and many forums and I have been unsuccessful in finding a solution. Any help would be very much appreciated! Thank you
Below I've posted the code I am running as well as a dump of my sessionInfo
library(biomaRt)
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
attributesList = listAttributes(ensembl)
attributes <- c('entrezgene_trans_name','entrezgene','description','transcript_biotype')
information <- getBM(attributes=c('entrezgene_trans_name','entrezgene','description','transcript_biotype'),
values = ResultsFile$sacc,
mart = ensembl)
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS 10.14.2
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] biomaRt_2.38.0 plotly_4.8.0 ggplot2_3.1.0 bindrcpp_0.2.2
[5] dplyr_0.7.8 plyr_1.8.4 stringr_1.3.1 annotate_1.60.0
[9] XML_3.98-1.16 AnnotationDbi_1.44.0 IRanges_2.16.0 S4Vectors_0.20.1
[13] Biobase_2.42.0 BiocGenerics_0.28.0 data.table_1.12.0 BiocManager_1.30.4
loaded via a namespace (and not attached):
[1] progress_1.2.0 tidyselect_0.2.5 purrr_0.2.5 colorspace_1.4-0 htmltools_0.3.6
[6] viridisLite_0.3.0 yaml_2.2.0 blob_1.1.1 rlang_0.3.1 later_0.7.5
[11] pillar_1.3.1 glue_1.3.0 withr_2.1.2 DBI_1.0.0 bit64_0.9-7
[16] bindr_0.1.1 munsell_0.5.0 gtable_0.2.0 htmlwidgets_1.3 memoise_1.1.0
[21] httpuv_1.4.5.1 crosstalk_1.0.0 curl_3.3 Rcpp_1.0.0 xtable_1.8-3
[26] promises_1.0.1 scales_1.0.0 jsonlite_1.6 mime_0.6 bit_1.1-14
[31] hms_0.4.2 digest_0.6.18 stringi_1.2.4 shiny_1.2.0 grid_3.5.1
[36] tools_3.5.1 bitops_1.0-6 magrittr_1.5 lazyeval_0.2.1 RCurl_1.95-4.11
[41] tibble_2.0.1 RSQLite_2.1.1 crayon_1.3.4 tidyr_0.8.2 pkgconfig_2.0.2
[46] prettyunits_1.0.2 assertthat_0.2.0 httr_1.4.0 rstudioapi_0.9.0 R6_2.3.0
[51] compiler_3.5.1
I have also included the structure of my ResultsFile dataTable
qseqid sacc length slen pident nident qstart qend evalue
1: 022f7e4d-c2b0-445d-92be-1d9c751edb51 AA010948 251 294 86.853 218 290 520 4.12e-66
2: 022f7e4d-c2b0-445d-92be-1d9c751edb51 AA186504 351 359 85.185 299 195 516 1.11e-86
3: 022f7e4d-c2b0-445d-92be-1d9c751edb51 AA187225 420 410 86.905 365 59 455 2.28e-123
4: 022f7e4d-c2b0-445d-92be-1d9c751edb51 AA284245 405 418 86.914 352 141 516 2.99e-112
5: 022f7e4d-c2b0-445d-92be-1d9c751edb51 AA302401 289 307 86.505 250 202 473 8.74e-78
6: 022f7e4d-c2b0-445d-92be-1d9c751edb51 AA306255 342 410 87.135 298 59 382 3.95e-96
So if I am understanding you correctly, one of the attributes would be need to the accession number which the ensembl biomart database does not support. If you take the first result from my results file and enter it into both EMBL-EBI and the NCBI databases, the information is identical. I assume there is a potential for discrepancies but I've yet to encounter one. Here is what my output is looking like from the mart:
You are doing something that I have not actually ever done, so I missed where you have gone wrong. By omitting the filters argument you are asking the Biomart to give you everything for the four attributes you list. This isn't helpful because you aren't actually mapping the GenBank IDs to anything! Instead you are just telling the Biomart server to give you a bunch of stuff.
So far as I can tell, EBI doesn't have any GenBank IDs in their databases, which sort of makes sense. GenBank is where most submissions to NCBI go first, and is sort of a dumping ground for things that may or may not end up being accepted as a real thing. And EBI and NCBI have a hard enough time agreeing on what is and isn't a gene, and where it might be and how many transcripts it might have, that adding in a bunch of speculative content is probably a fool's errand.
If you want to map GenBank IDs to Gene IDs, then you are better off using NCBI data rather than using EBI as an intermediate step. Something like
But do note that you are trying to map from really speculative content to more accepted content, so you should expect lots of missing data.
Thank you immensely, this is a perfect solution to my issue. I am not expecting all accessions to have a hit. This option also is able to be widened to other species with the other databases available.