Question

Scanf error in biomaRt when field is missing

0

Entering edit mode

grealesm • 0

@grealesm-20515

Last seen 6.6 years ago

Dear all,

I'm experiencing a problem lately when querying Ensembl via biomaRt. To put it simple, if I ask for fields that include missing values (see example), it returns a data frame with rows with different columns, leading to R throwing a scanf error.

In my example, let's say I try to retrieve peptide sequences and IDs for HTR3A (ENSG00000166736). This gene has several transcripts in Ensembl, one of them has no protein associated (see https://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000166736;r=11:113974881-113990313).

If I run the following code, it returns the desired result, with a cell containing "Sequence unavailable", as expected.

 library(biomaRt)
 mart <- useMart("ENSEMBL_MART_ENSEMBL")
 mart <- useDataset("hsapiens_gene_ensembl", mart = mart)
 CurlHandle <- RCurl::getCurlHandle()
 BM <- biomaRt::getBM(c("ensembl_gene_id", "ensembl_transcript_id", "coding"), filters = "ensembl_gene_id" , values = "ENSG00000166736", mart = mart, curl = CurlHandle)

However, if I run the following I get an error, probably due to that protein ID being missing.

> BM <- biomaRt::getBM(c("ensembl_gene_id", "ensembl_peptide_id", "peptide"), filters = "ensembl_gene_id" , values = "ENSG00000166736", mart = mart, curl = CurlHandle)
    Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : line 3 did not have 3 elements

I tried several hosts, as suggested in other related questions, and to set quote = TRUE among other things, to no avail. I can imagine what is happening under the hood (i.e. R not liking the server sending back a table with missing elements), but I can't figure out a fix for this. I'd appreciate some help! :)

My sessionInfo()

> sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 18.3

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8       
 [4] LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.8.0.1  biomaRt_2.38.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1           compiler_3.5.3       pillar_1.3.1         prettyunits_1.0.2   
 [5] bitops_1.0-6         tools_3.5.3          progress_1.2.0       digest_0.6.18       
 [9] bit_1.1-14           RSQLite_2.1.1        memoise_1.1.0        tibble_2.1.1        
[13] pkgconfig_2.0.2      rlang_0.3.4          DBI_1.0.0            rstudioapi_0.10     
[17] curl_3.3             parallel_3.5.3       fuzzyjoin_0.1.4      xml2_1.2.0          
[21] stringr_1.4.0        httr_1.4.0           S4Vectors_0.20.1     IRanges_2.16.0      
[25] hms_0.4.2            stats4_3.5.3         bit64_0.9-7          tidyselect_0.2.5    
[29] data.table_1.12.2    glue_1.3.1           Biobase_2.42.0       R6_2.4.0            
[33] AnnotationDbi_1.44.0 XML_3.98-1.19        tidyr_0.8.3          selectr_0.4-1       
[37] purrr_0.3.2          blob_1.1.1           magrittr_1.5         BiocGenerics_0.28.0 
[41] rvest_0.3.3          assertthat_0.2.1     stringi_1.4.3        RCurl_1.95-4.12     
[45] crayon_1.3.4

Cheers!

software error biomaRt • 1.0k views

ADD COMMENT • link updated 6.6 years ago by Mike Smith ★ 6.6k • written 6.6 years ago by grealesm • 0

score 2 · Answer 1 · 2019-04-14

2

Entering edit mode

Mike Smith ★ 6.6k

@mike-smith

Last seen 10 weeks ago

EMBL Heidelberg

Sort term hacky fix, try swapping the order of the attributes you want to return e.g.

BM <- biomaRt::getBM(c("ensembl_peptide_id", "ensembl_gene_id", "peptide"), 
                     filters = "ensembl_gene_id" , 
                     values = "ENSG00000166736", 
                     mart = mart, curl = CurlHandle)

The server side code that returns a tsv file will insert an empty cell unless it's the last column in which case you get an new line and the error you've seeing. I'll have a look at whether there's a more robust reading function than read.table() but this should work since there will always be a gene ID for every record.

ADD COMMENT • link 6.6 years ago Mike Smith ★ 6.6k

0

Entering edit mode

It worked! However, now I tried to include both "coding" and "peptide" sequences in the same go, and again got the error (I tried out several combinations of attribute orders, none of them worked). Trying coding and peptide in different calls gives a good result, but not when included together. Is there any limitation to querying both coding and peptide sequences at the same time?

Thanks!

ADD REPLY • link 6.6 years ago grealesm • 0

0

Entering edit mode

Ensembl won't let you retrieve more than one type of sequence at a time. If you try using the webinterface the different sequence types are selected with a radio button where only one selection is allowed. As far as I'm aware there's no way for the biomaRt package to know which attributes are mutually exclusive, so this problem exists.

My recommendation would be to run two separate queries as you've done and then merge the results based on the protein ID.

ADD REPLY • link 6.6 years ago Mike Smith ★ 6.6k

0

Entering edit mode

That was my suspicion. I'll do it that way then.

Thanks so much for your help! Cheers,

ADD REPLY • link 6.6 years ago grealesm • 0