How to solve operation timed out with biomaRt
1
0
Entering edit mode
Federica ▴ 10
@federica-24874
Last seen 3.5 years ago
Italy

Hi,

I have to find the rsIDs of hundrends of thousands of SNPs. Since I am confident with R, I decided to test biomaRt package.

It helped me to find rsIDs of hundreds of SNPs using R locally. I had the following .txt file:

CHR chr_start   chr_end
     8 101592213 101592213
     8 106973048 106973048
     8 108690829 108690829
     8 102569817 102569817
     8 108580746 108580746
     8 108681675 108681675
     8 103044620 103044620
     8 104152280 104152280

Then, I run the following commands:

library(biomaRt)


#creating Mart object
snp_mart <- useMart(biomart="ENSEMBL_MART_SNP", host="grch37.ensembl.org", 
                    path="/biomart/martservice", dataset="hsapiens_snp")

## combine the positions into a single vector

ds4$position <- apply(ds4, 1, paste, collapse = ":")

I run the previous command to obtain the following output:

"8:101592213:101592213" "8:106973048:106973048" "8:108690829:108690829" "8:102569817:102569817"
 "8:108580746:108580746" "8:108681675:108681675" "8:103044620:103044620" "8:104152280:104152280"

In this way, I could use chromosomal_region as filter in the getBM() function. Then, I run the following commands:

 rescue_rsid <- function(db) {

          temp <- getBM(attributes = c('refsnp_id', 'allele', 'chrom_start'), 
                        filters = 'chromosomal_region', 
                        values = db$position, 
                        mart = snp_mart)
          return(temp) 
        }


        temp <- rescue_rsid(ds4)

I obtained rsIDs of these SNPs with my R session (sessionInfo() is the following):

R Under development (unstable) (2021-02-23 r80032)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=Italian_Italy.1252  LC_CTYPE=Italian_Italy.1252    LC_MONETARY=Italian_Italy.1252 LC_NUMERIC=C                   LC_TIME=Italian_Italy.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomaRt_2.47.4

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6           pillar_1.5.0         dbplyr_2.1.0         compiler_4.1.0       XVector_0.31.1       prettyunits_1.1.1    tools_4.1.0         
 [8] zlibbioc_1.37.0      progress_1.2.2       bit_4.0.4            tibble_3.0.6         BiocFileCache_1.15.1 RSQLite_2.2.3        memoise_2.0.0       
[15] lifecycle_1.0.0      pkgconfig_2.0.3      png_0.1-7            rlang_0.4.10         DBI_1.1.1            rstudioapi_0.13      filelock_1.0.2      
[22] curl_4.3             parallel_4.1.0       fastmap_1.1.0        withr_2.4.1          dplyr_1.0.4          httr_1.4.2           stringr_1.4.0       
[29] xml2_1.3.2           rappdirs_0.3.3       generics_0.1.0       askpass_1.1          Biostrings_2.59.2    S4Vectors_0.29.7     vctrs_0.3.6         
[36] IRanges_2.25.6       hms_1.0.0            tidyselect_1.1.0     stats4_4.1.0         bit64_4.0.5          glue_1.4.2           Biobase_2.51.0      
[43] R6_2.5.0             fansi_0.4.2          AnnotationDbi_1.53.1 XML_3.99-0.5         purrr_0.3.4          blob_1.2.1           magrittr_2.0.1      
[50] ellipsis_0.3.1       BiocGenerics_0.37.1  assertthat_0.2.1     KEGGREST_1.31.1      utf8_1.1.4           stringi_1.5.3        openssl_1.4.3       
[57] cachem_1.0.4         crayon_1.4.1   

However, when I try to run the same script on a server to handle much more SNPs, the output on my Putty shell is the following:

Error in curl::curl_fetch_memory(url, handle = handle) :
  Timeout was reached: [grch37.ensembl.org:80] Operation timed out after 300001 milliseconds with 0 bytes received

My uname -a output is:

Linux platonesrv1 5.4.0-66-generic #74-Ubuntu SMP Wed Jan 27 22:54:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

How could I solve the question of timed out with biomaRt?

Thank you!

biomaRt • 4.7k views
ADD COMMENT
0
Entering edit mode
Mike Smith ★ 6.6k
@mike-smith
Last seen 3 hours ago
EMBL Heidelberg

If you're planning to query hundreds of thousands of locations then I don't think biomaRt is the right tool to use. The Ensembl BioMart server isn't designed to handle really large queries. As I mentioned in https://support.bioconductor.org/p/9135301/ there's a 5 minute time limit, they recommend a maximum of 500 values in any query. biomaRt tries to help with this by automatically splitting large queries into smaller batches, and submitting them for you, but it will still take a really long time if you still have thousands of batches. The latency sending things over the internet, plus the fragility of a single batch failing will make this super frustrating - I guess you've seen that already!

I'd recommend trying to do this with genome and SNP data that you've download to your local machine. This take a bit longer to start with, as they're large download, but each query will then be much faster.

Here's an example that tries to replicate your biomaRt query, but using the GRCh37 genome and dbSNP data from Bioconductor. You'll have to install those two packages.

## example data.frame
ds4 <- data.frame(CHR = c("8", "8", "8", "8", "8"),
                  chr_start = c('101592213', '106973048', '108690829', '102569817', '108580746'),
                  chr_end = c('101592213', '106973048', '108690829', '102569817', '108580746'))

## load Bioconductor genome and SNP annotation for GRCh37
library(BSgenome.Hsapiens.UCSC.hg19)
library(SNPlocs.Hsapiens.dbSNP144.GRCh37)

## this step is optional!
## here we just simplify the names of the objects, making the code neater
genome <- BSgenome.Hsapiens.UCSC.hg19
all_snps <- SNPlocs.Hsapiens.dbSNP144.GRCh37

## By default the genome we're using follows the UCSC convention for
## naming chromosome e.g. "chr8".  This step changes that to match our
## SNP data which uses NCBI naming e.g. "8"
seqlevelsStyle(genome) <- "NCBI"

## construct a GPos object containing all the positions we're interested in
positions <- GPos(seqnames = ds4$CHR, pos = ds4$chr_start)

## query the genome with out positions
my_snps <- snpsByOverlaps(all_snps, positions, genome = genome)

## this gives us a GPos object
my_snps
#> UnstitchedGPos object with 5 positions and 5 metadata columns:
#>       seqnames       pos strand |   RefSNP_id alleles_as_ambig genome_compat  ref_allele     alt_alleles
#>          <Rle> <integer>  <Rle> | <character>      <character>     <logical> <character> <CharacterList>
#>   [1]        8 101592213      * |  rs62513865                Y          TRUE           C               T
#>   [2]        8 102569817      * |   rs6994300                R          TRUE           G               A
#>   [3]        8 106973048      * |  rs79643588                R          TRUE           G               A
#>   [4]        8 108580746      * | rs138449472                R          TRUE           G               A
#>   [5]        8 108690829      * |  rs17396518                K          TRUE           T               G
#>   -------
#>   seqinfo: 25 sequences (1 circular) from GRCh37.p13 genome

## we can construct something similar to the biomaRt output if that's more useful
data.frame(chr_pos = start(my_snps), mcols(my_snps)[,c("RefSNP_id", "ref_allele", "alt_alleles")])
#>     chr_pos   RefSNP_id ref_allele alt_alleles
#> 1 101592213  rs62513865          C           T
#> 2 102569817   rs6994300          G           A
#> 3 106973048  rs79643588          G           A
#> 4 108580746 rs138449472          G           A
#> 5 108690829  rs17396518          T           G

This is not identical to the biomaRt output - we're missing the multiple alternative alleles. If that's a problem you'll need to look into why the discrepancy is there, I'm afraid I don't know the reason.

getBM(attributes = c('refsnp_id', 'allele', 'chrom_start'), 
      filters = 'chromosomal_region', 
      values = position, 
      mart = snp_mart)

#>     refsnp_id  allele chrom_start
#> 1  rs62513865     C/T   101592213
#> 2   rs6994300     G/A   102569817
#> 3  rs79643588     G/A   106973048
#> 4 rs138449472 G/A/C/T   108580746
#> 5  rs17396518   T/C/G   108690829
ADD COMMENT
0
Entering edit mode

Hi, Mike! I came across a mistake when I tried to replicate your example. I was wondering if you like to take a look at it. I followed your instructions and established the example data.frame, loaded the libraries, and set "genome" and "all_snps" correctly. The problem occured when I used

seqlevelsStyle(genome) <- "NCBI"

which returned

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 1 did not have 3 elements

I had no idea what "line 1" referred to as well as how to fix this problem. I'd appreciate it if you could see to this problem. Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 772 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6