Question

How to find rsID with biomaRt in R

1

Entering edit mode

Federica ▴ 10

@federica-24874

Last seen 3.2 years ago

Italy

Hi,

I've to find rsID from chromosome positions and chromosome names of several SNPs. I've found biomaRt package in R to solve my problem.

I've the following .txt file:

CHR chr_start   chr_end
  8 101592213 101592213
  8 106973048 106973048
  8 108690829 108690829
  8 102569817 102569817
  8 108580746 108580746

I've loaded the biomaRt package and then I've created the Mart object with the following command:

snp_mart <- useMart(biomart="ENSEMBL_MART_SNP", host="grch37.ensembl.org", 
+                     path="/biomart/martservice", dataset="hsapiens_snp")

Then, I created the following function to get rsIDs, alleles, and chromosome positions:

> rescue_rsid <- function(db) {
+   
+   temp <- getBM(attributes = c('refsnp_id', 'allele', 'chrom_start'), 
+                 filters = c('chr_name', 'start', 'end'), 
+                 values = list(db[,which(colnames(db) == "CHR")], db[,which(colnames(db) == "chr_start")], db[,which(colnames(db) == "chr_end")]), 
+                 mart = snp_mart)
+   return(temp) 
+ }

Then, I want to apply the previous function to all my SNPs with the following command:

apply(ds4, 1, rescue_rsid)

where ds4 is the dataframe with my filters, but I found the following results in Rconsole:

> rescue_rsid(ds4)
Error in curl::curl_fetch_memory(url, handle = handle) : 
  Timeout was reached: [grch37.ensembl.org:80] Operation timed out after 300012 milliseconds with 12109 bytes received
Called from: curl::curl_fetch_memory(url, handle = handle)
Browse[1]> rescue_rsid(ds4[1,])
   refsnp_id allele chrom_start
1 rs62513865    C/T   101592213
Browse[1]>

However, when I run the following command:

> rescue_rsid(ds4[1,])

Rconsole reported the following result:

  refsnp_id allele chrom_start
1 rs62513865    C/T   101592213

So, this would mean that the function searches rsID but I don't understand why it doesn't work for all my SNPs...

Could someone help me? Thank u!

My R session:

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=Italian_Italy.1252  LC_CTYPE=Italian_Italy.1252    LC_MONETARY=Italian_Italy.1252
[4] LC_NUMERIC=C                   LC_TIME=Italian_Italy.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomaRt_2.44.4

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6           pillar_1.4.7         compiler_4.0.3       dbplyr_2.1.0         prettyunits_1.1.1   
 [6] tools_4.0.3          progress_1.2.2       bit_4.0.4            tibble_3.0.6         RSQLite_2.2.3       
[11] memoise_2.0.0        BiocFileCache_1.12.1 lifecycle_1.0.0      pkgconfig_2.0.3      rlang_0.4.10        
[16] DBI_1.1.1            curl_4.3             parallel_4.0.3       fastmap_1.1.0        withr_2.4.1         
[21] stringr_1.4.0        httr_1.4.2           dplyr_1.0.4          xml2_1.3.2           rappdirs_0.3.3      
[26] generics_0.1.0       S4Vectors_0.26.1     vctrs_0.3.6          askpass_1.1          IRanges_2.22.2      
[31] hms_1.0.0            tidyselect_1.1.0     stats4_4.0.3         bit64_4.0.5          glue_1.4.2          
[36] Biobase_2.48.0       R6_2.5.0             AnnotationDbi_1.50.3 XML_3.99-0.5         purrr_0.3.4         
[41] blob_1.2.1           magrittr_2.0.1       ellipsis_0.3.1       BiocGenerics_0.34.0  assertthat_0.2.1    
[46] stringi_1.5.3        openssl_1.4.3        cachem_1.0.3         crayon_1.4.1

biomaRt • 10k views

ADD COMMENT • link updated 5 months ago by James W. MacDonald 66k • written 3.4 years ago by Federica ▴ 10

0

Entering edit mode

James W. MacDonald 66k

@james-w-macdonald-5106

Last seen 4 hours ago

United States

Does this help explain things?

> library(GenomicRanges)
> library(biomaRt)
Warning message:
package 'biomaRt' was built under R version 4.3.2 
> snp_mart <- useEnsembl(biomart="ENSEMBL_MART_SNP", 
                       host="grch37.ensembl.org", 
                       dataset="hsapiens_snp")
## scan in your positions
> vals <- scan("clipboard","c")
## biomaRt call
> z <- getBM(attributes = c('refsnp_id', 'allele', 'chrom_start','chrom_end','chr_name'), 
      filters = 'chromosomal_region', 
      values = vals, mart = snp_mart)
## convert to GRanges objects
> valgr <- GRanges(sapply(strsplit(vals, ":"), function(x) paste0(x[1], ":", x[2], "-", x[3])))
> valgr
GRanges object with 20 ranges and 0 metadata columns:
       seqnames        ranges strand
          <Rle>     <IRanges>  <Rle>
   [1]        1        738223      *
   [2]        1        768448      *
   [3]        1        777232      *
   [4]        1        791853      *
   [5]        1        794332      *
   ...      ...           ...    ...
  [16]        1        805556      *
  [17]        1        809876      *
  [18]        1 821325-821329      *
  [19]        1        824398      *
  [20]        1        828166      *

> resultgr <- GRanges(z[,5], IRanges(z[,3],z[,4]), rsid = z[,1], allele = z[,2])
> resultgr
GRanges object with 31 ranges and 2 metadata columns:
       seqnames        ranges strand
          <Rle>     <IRanges>  <Rle>
   [1]        1        738223      *
   [2]        1        768448      *
   [3]        1        777232      *
   [4]        1        791853      *
   [5]        1        794332      *
   ...      ...           ...    ...
  [27]        1 821326-821332      *
  [28]        1        821328      *
  [29]        1        821329      *
  [30]        1        824398      *
  [31]        1        828166      *
       |        rsid
       | <character>
   [1] | rs138388092
   [2] |  rs12562034
   [3] | rs112618790
   [4] |   rs6684487
   [5] |  rs12127425
   ... .         ...
  [27] |  rs78941097
  [28] | rs796562407
  [29] | rs202018963
  [30] |   rs7538305
  [31] |  rs13303117
                       allele
                  <character>
   [1]                    T/C
   [2]                    G/A
   [3]                    C/T
   [4]                  G/A/T
   [5]                    G/A
   ...                    ...
  [27] CAGTCAG/CAG/CAGTCAGT..
  [28]                  G/C/T
  [29]                  T/A/G
  [30]                  A/C/T
  [31]                  A/G/T
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

## Now get overlaps
> fo <- findOverlaps(valgr, resultgr)
> fo
Hits object with 31 hits and 0 metadata columns:
       queryHits subjectHits
       <integer>   <integer>
   [1]         1           1
   [2]         2           2
   [3]         3           3
   [4]         4           4
   [5]         5           5
   ...       ...         ...
  [27]        18          27
  [28]        18          28
  [29]        18          29
  [30]        19          30
  [31]        20          31
  -------
  queryLength: 20 / subjectLength: 31

## why were duplicates returned?
> rsltgrlst <- splitAsList(resultgr, queryHits(fo))
> rsltgrlst[lengths(rsltgrlst) > 1L]
GRangesList object of length 5:
$`11`
GRanges object with 2 ranges and 2 metadata columns:
      seqnames        ranges strand
         <Rle>     <IRanges>  <Rle>
  [1]        1 801676-801680      *
  [2]        1        801680      *
      |         rsid      allele
      |  <character> <character>
  [1] | rs1639792397    CACCA/CA
  [2] |   rs12134490         A/C
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

$`12`
GRanges object with 2 ranges and 2 metadata columns:
      seqnames        ranges strand
         <Rle>     <IRanges>  <Rle>
  [1]        1 801853-801905      *
  [2]        1        801858      *
      |         rsid
      |  <character>
  [1] | rs1639795103
  [2] |   rs17276806
                      allele
                 <character>
  [1] TCCTCCCTTTGCACTCGAAT..
  [2]                    C/T
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

$`15`
GRanges object with 5 ranges and 2 metadata columns:
      seqnames        ranges strand
         <Rle>     <IRanges>  <Rle>
  [1]        1 804974-804980      *
  [2]        1 804977-804978      *
  [3]        1        804978      *
  [4]        1        804979      *
  [5]        1 804979-804989      *
      |         rsid
      |  <character>
  [1] | rs1639841302
  [2] | rs1190791124
  [3] | rs1639841366
  [4] |  rs114479279
  [5] |  rs202146812
                      allele
                 <character>
  [1]            TAATTAA/TAA
  [2]                 TT/TTT
  [3]                    T/A
  [4]                    A/T
  [5] AAAAAAAAAAA/AAAAAA/A..
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

$`16`
GRanges object with 2 ranges and 2 metadata columns:
      seqnames        ranges strand
         <Rle>     <IRanges>  <Rle>
  [1]        1 805548-805576      *
  [2]        1        805556      *
      |         rsid
      |  <character>
  [1] | rs1639852315
  [2] |   rs72631880
                      allele
                 <character>
  [1] GGGCCCCGTGATTATATTTG..
  [2]                  T/A/C
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

$`18`
GRanges object with 5 ranges and 2 metadata columns:
      seqnames        ranges strand
         <Rle>     <IRanges>  <Rle>
  [1]        1 821324-821325      *
  [2]        1 821325-821328      *
  [3]        1 821326-821332      *
  [4]        1        821328      *
  [5]        1        821329      *
      |         rsid
      |  <character>
  [1] | rs1640080539
  [2] | rs1640080568
  [3] |   rs78941097
  [4] |  rs796562407
  [5] |  rs202018963
                      allele
                 <character>
  [1]                   AA/A
  [2]                 ACAG/-
  [3] CAGTCAG/CAG/CAGTCAGT..
  [4]                  G/C/T
  [5]                  T/A/G
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

ADD COMMENT • link 5 months ago James W. MacDonald 66k

James W. MacDonald · Accepted Answer · 2021-03-05

Thanks for the nicely structured question! There's a few different things to address here, which are intertwined, but hopefully we can get your query working. Lets start with the error:

Operation timed out after 300012 milliseconds

This happens because there's a 5 minute time limit on queries to the Ensembl BioMart. You can see it mentioned in the disclaimer at the bottom of the BioMart website, and it's enforced inside the biomaRt code. So seeing this error doesn't necessarily mean your code is wrong; it might just be slow. Equally it could be a busy time and server doesn't respond quickly. Since you're submitting lots of queries inside the apply() it may only be one that's failing and stopping the whole thing.

One thing to note about biomaRt is that it will accept vectors to the values argument, so you don't need to loop (or apply) over all you chromosomal positions. You can do a single call to getBM() and pass the three data.frame columns to the values list.

If you're not using the apply() but actually doing rescue_rsid(ds4) then I think you're already taking advantage of this vectorisation - but that might actually be the cause of the timeout. Generally one large query is faster than lots of small queries, but if the server is being slow it might be hitting the 5 minute limit.

However, for this specific type of queries there's a slightly simpler way if you know there's a filter called chromosomal_position. This takes a colon separated character resenting a region of the form chr:start:end:strand. (You can omit strand here).

I think this should be slightly faster than providing the three arguments separately. Here's an example with your data:

library(biomaRt)

## create example data
ds4 <- data.frame(CHR = c("8", "8", "8", "8", "8"),
                 chr_start = c('101592213', '106973048', '108690829', '102569817', '108580746'),
                 chr_end = c('101592213', '106973048', '108690829', '102569817', '108580746'))

snp_mart <- useEnsembl(biomart="ENSEMBL_MART_SNP", 
                       host="grch37.ensembl.org", 
                       dataset="hsapiens_snp")

## combine the positions in to a single vector
position <- apply(ds4, 1, paste, collapse = ":")
position
#> [1] "8:101592213:101592213" "8:106973048:106973048" "8:108690829:108690829"
#> [4] "8:102569817:102569817" "8:108580746:108580746"

getBM(attributes = c('refsnp_id', 'allele', 'chrom_start'), 
      filters = 'chromosomal_region', 
      values = position, 
      mart = snp_mart)
#>     refsnp_id  allele chrom_start
#> 1  rs62513865     C/T   101592213
#> 2   rs6994300     G/A   102569817
#> 3  rs79643588     G/A   106973048
#> 4 rs138449472 G/A/C/T   108580746
#> 5  rs17396518   T/C/G   108690829

It's important to note that when you submit a vector of values to biomaRt the order they are returned in is not guaranteed to be the same as you submitted them, and any value that didn't get a hit in their database will be missing from the results. So while this is a much more efficient way of querying, you need to do some sanity checking on the results if order or completeness is import to whatever you're doing next.