Hi,
I've to find rsID from chromosome positions and chromosome names of several SNPs. I've found biomaRt package in R to solve my problem.
I've the following .txt file:
CHR chr_start chr_end
8 101592213 101592213
8 106973048 106973048
8 108690829 108690829
8 102569817 102569817
8 108580746 108580746
I've loaded the biomaRt package and then I've created the Mart object with the following command:
snp_mart <- useMart(biomart="ENSEMBL_MART_SNP", host="grch37.ensembl.org",
+ path="/biomart/martservice", dataset="hsapiens_snp")
Then, I created the following function to get rsIDs, alleles, and chromosome positions:
> rescue_rsid <- function(db) {
+
+ temp <- getBM(attributes = c('refsnp_id', 'allele', 'chrom_start'),
+ filters = c('chr_name', 'start', 'end'),
+ values = list(db[,which(colnames(db) == "CHR")], db[,which(colnames(db) == "chr_start")], db[,which(colnames(db) == "chr_end")]),
+ mart = snp_mart)
+ return(temp)
+ }
Then, I want to apply the previous function to all my SNPs with the following command:
apply(ds4, 1, rescue_rsid)
where ds4 is the dataframe with my filters, but I found the following results in Rconsole:
> rescue_rsid(ds4)
Error in curl::curl_fetch_memory(url, handle = handle) :
Timeout was reached: [grch37.ensembl.org:80] Operation timed out after 300012 milliseconds with 12109 bytes received
Called from: curl::curl_fetch_memory(url, handle = handle)
Browse[1]> rescue_rsid(ds4[1,])
refsnp_id allele chrom_start
1 rs62513865 C/T 101592213
Browse[1]>
However, when I run the following command:
> rescue_rsid(ds4[1,])
Rconsole reported the following result:
refsnp_id allele chrom_start
1 rs62513865 C/T 101592213
So, this would mean that the function searches rsID but I don't understand why it doesn't work for all my SNPs...
Could someone help me? Thank u!
My R session:
> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=Italian_Italy.1252 LC_CTYPE=Italian_Italy.1252 LC_MONETARY=Italian_Italy.1252
[4] LC_NUMERIC=C LC_TIME=Italian_Italy.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] biomaRt_2.44.4
loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 pillar_1.4.7 compiler_4.0.3 dbplyr_2.1.0 prettyunits_1.1.1
[6] tools_4.0.3 progress_1.2.2 bit_4.0.4 tibble_3.0.6 RSQLite_2.2.3
[11] memoise_2.0.0 BiocFileCache_1.12.1 lifecycle_1.0.0 pkgconfig_2.0.3 rlang_0.4.10
[16] DBI_1.1.1 curl_4.3 parallel_4.0.3 fastmap_1.1.0 withr_2.4.1
[21] stringr_1.4.0 httr_1.4.2 dplyr_1.0.4 xml2_1.3.2 rappdirs_0.3.3
[26] generics_0.1.0 S4Vectors_0.26.1 vctrs_0.3.6 askpass_1.1 IRanges_2.22.2
[31] hms_1.0.0 tidyselect_1.1.0 stats4_4.0.3 bit64_4.0.5 glue_1.4.2
[36] Biobase_2.48.0 R6_2.5.0 AnnotationDbi_1.50.3 XML_3.99-0.5 purrr_0.3.4
[41] blob_1.2.1 magrittr_2.0.1 ellipsis_0.3.1 BiocGenerics_0.34.0 assertthat_0.2.1
[46] stringi_1.5.3 openssl_1.4.3 cachem_1.0.3 crayon_1.4.1
Hi Mike,
Thanks for a great explanation of how to do this. I am trying to get rsIDs for a list of SNPs for which I have chrom, start and end position and I used your method. I am a bit confused however, because the results I get have SNPs that weren't in my original query. I am inputting lists of 500 chromosomal regions at a time because it times out otherwise, and the results have around 900 observations each, meaning that hundreds of variants are in the results that aren't in the input.
For example, these are the first 20 variants in my ordered list of chromosomal regions:
and here are some of the results from the getBM query:
There is a variant with position 1:801853:801905 which was not in my original list. I don't want these extra variants in the results because it's meaning the query takes much longer. Do you know how to make sure that I don't get these results that I didn't request?
My getBM query was:
Thanks, Anja