Hello,
I've been trying to use biomaRt to query ensembl's Biomart and extract genomic and functional information about a list of SNPs. Here's an example of the kind of query I'm doing and the results.
snpMart <- useMart("ENSEMBL_MART_SNP", dataset="hsapiens_snp")
snpList <- c("rs7349186", "rs3927683", "rs1697421", "rs12041233", "rs112751018", "rs2819336")
snp_annot <- getBM(attributes = c('refsnp_id', "consequence_type_tv", 'chr_name', 'chrom_start', 'chrom_end'),
filters = "snp_filter",
values = snpList,
mart = snpMart) %>%
arrange(chr_name, chrom_start) %>%
relocate(consequence_type_tv, .before = chr_name)
snp_annot
refsnp_id consequence_type_tv chr_name chrom_start chrom_end
1 rs7349186 missense_variant 1 20644627 20644627
2 rs3927683 1 20796024 20796024
3 rs1697421 1 21496799 21496799
4 rs12041233 1 37287106 37287106
5 rs112751018 1 39622232 39622232
6 rs2819336 intron_variant 1 43550138 43550138
The issue I'm having, is related to the consequence_type_tv
, on the attributes to extract from the query. In fact, many SNPs return no consequence_type_tv
. I've assumed these would be because these were intergenic variants, which is the case for many variants. So, my first issue is that these variants are not labelled as "intergenic variants".
However, when checking a few variants manually in the ensembl site, I've noticed that many that are not annotated by biomaRt, are actually annotated as other consequences, besides intergenic. Take for example rs12041233: in my biomaRt query, there's no annotated functional consequence. In contrast, in the ensembl website, the variant is reported as an intronic variant (for an ensembl lncRNA).
Not only that, I found also some incoherences between what's on the variant page in ensembl, and the information retrieved using biomaRt. Namely, on allelic frequencies reported for minor alleles being different between the two sources (which I thought were the same.)
Am I doing something wrong? Am I missing something?
Thank you very much for your help!
sessioninfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RColorBrewer_1.1-3 pheatmap_1.0.12 qqman_0.1.8 gprofiler2_0.2.2 UpSetR_1.4.0 eulerr_7.0.0
[7] DT_0.28 gwasrapidd_0.99.15 rsnps_0.5.0.0 biomaRt_2.54.1 lubridate_1.9.2 forcats_1.0.0
[13] stringr_1.5.0 dplyr_1.1.2 purrr_1.0.1 readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
[19] ggplot2_3.4.2 tidyverse_2.0.0 data.table_1.14.8
loaded via a namespace (and not attached):
[1] bitops_1.0-7 bit64_4.0.5 filelock_1.0.2 progress_1.2.2 httr_1.4.6
[6] GenomeInfoDb_1.34.9 tools_4.2.2 bslib_0.5.0 utf8_1.2.3 R6_2.5.1
[11] lazyeval_0.2.2 DBI_1.1.3 BiocGenerics_0.44.0 colorspace_2.1-0 withr_2.5.0
[16] tidyselect_1.2.0 gridExtra_2.3 prettyunits_1.1.1 bit_4.0.5 curl_5.0.1
[21] compiler_4.2.2 cli_3.6.1 Biobase_2.58.0 xml2_1.3.5 plotly_4.10.2
[26] labeling_0.4.2 triebeard_0.4.1 sass_0.4.7 scales_1.2.1 rappdirs_0.3.3
[31] digest_0.6.33 rmarkdown_2.23 XVector_0.38.0 pkgconfig_2.0.3 htmltools_0.5.5
[36] dbplyr_2.3.3 fastmap_1.1.1 htmlwidgets_1.6.2 rlang_1.1.1 rstudioapi_0.15.0
[41] httpcode_0.3.0 RSQLite_2.3.1 shiny_1.7.4.1 farver_2.1.1 jquerylib_0.1.4
[46] generics_0.1.3 jsonlite_1.8.7 crosstalk_1.2.0 RCurl_1.98-1.12 magrittr_2.0.3
[51] GenomeInfoDbData_1.2.9 Rcpp_1.0.11 munsell_0.5.0 S4Vectors_0.36.2 fansi_1.0.4
[56] lifecycle_1.0.3 stringi_1.7.12 yaml_2.3.7 MASS_7.3-60 zlibbioc_1.44.0
[61] plyr_1.8.8 BiocFileCache_2.6.1 grid_4.2.2 blob_1.2.4 promises_1.2.0.1
[66] ggrepel_0.9.3 crayon_1.5.2 Biostrings_2.66.0 hms_1.1.3 KEGGREST_1.38.0
[71] polylabelr_0.2.0 knitr_1.43 pillar_1.9.0 stats4_4.2.2 crul_1.4.0
[76] XML_3.99-0.14 glue_1.6.2 evaluate_0.21 calibrate_1.7.7 httpuv_1.6.11
[81] urltools_1.7.3 png_0.1-8 vctrs_0.6.3 tzdb_0.4.0 polyclip_1.10-4
[86] gtable_0.3.3 assertthat_0.2.1 cachem_1.0.8 xfun_0.39 mime_0.12
[91] xtable_1.8-4 later_1.3.1 viridisLite_0.4.2 AnnotationDbi_1.60.2 memoise_2.0.1
[96] IRanges_2.32.0 timechange_0.2.0 ellipsis_0.3.2
Thanks, James. I think you are spot on. I do realise that the BioMart server is an entity within Ensembl, and that the queries are being passed on to this server directly. My issue is that I would assume that the info on the BioMart server and Ensembl is identical. Which, with the current query I'm using, is not the case (I've tried lots of different arguments to try to get the information, with no luck). This is the reason why I was posting this question here: to know if any users ever came across this and know whether my query is missing some important arguments/attributes/filters.
Regarding the folks @ Ensembl... I did contact them multiple times, but had no answer! :(