Question: Scanf error in biomaRt when field is missing
0
gravatar for grealesm
9 weeks ago by
grealesm0
grealesm0 wrote:

Dear all,

I'm experiencing a problem lately when querying Ensembl via biomaRt. To put it simple, if I ask for fields that include missing values (see example), it returns a data frame with rows with different columns, leading to R throwing a scanf error.

In my example, let's say I try to retrieve peptide sequences and IDs for HTR3A (ENSG00000166736). This gene has several transcripts in Ensembl, one of them has no protein associated (see https://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000166736;r=11:113974881-113990313).

If I run the following code, it returns the desired result, with a cell containing "Sequence unavailable", as expected.

 library(biomaRt)
 mart <- useMart("ENSEMBL_MART_ENSEMBL")
 mart <- useDataset("hsapiens_gene_ensembl", mart = mart)
 CurlHandle <- RCurl::getCurlHandle()
 BM <- biomaRt::getBM(c("ensembl_gene_id", "ensembl_transcript_id", "coding"), filters = "ensembl_gene_id" , values = "ENSG00000166736", mart = mart, curl = CurlHandle)

However, if I run the following I get an error, probably due to that protein ID being missing.

> BM <- biomaRt::getBM(c("ensembl_gene_id", "ensembl_peptide_id", "peptide"), filters = "ensembl_gene_id" , values = "ENSG00000166736", mart = mart, curl = CurlHandle)
    Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : line 3 did not have 3 elements

I tried several hosts, as suggested in other related questions, and to set quote = TRUE among other things, to no avail. I can imagine what is happening under the hood (i.e. R not liking the server sending back a table with missing elements), but I can't figure out a fix for this. I'd appreciate some help! :)

My sessionInfo()

> sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 18.3

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8       
 [4] LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.8.0.1  biomaRt_2.38.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1           compiler_3.5.3       pillar_1.3.1         prettyunits_1.0.2   
 [5] bitops_1.0-6         tools_3.5.3          progress_1.2.0       digest_0.6.18       
 [9] bit_1.1-14           RSQLite_2.1.1        memoise_1.1.0        tibble_2.1.1        
[13] pkgconfig_2.0.2      rlang_0.3.4          DBI_1.0.0            rstudioapi_0.10     
[17] curl_3.3             parallel_3.5.3       fuzzyjoin_0.1.4      xml2_1.2.0          
[21] stringr_1.4.0        httr_1.4.0           S4Vectors_0.20.1     IRanges_2.16.0      
[25] hms_0.4.2            stats4_3.5.3         bit64_0.9-7          tidyselect_0.2.5    
[29] data.table_1.12.2    glue_1.3.1           Biobase_2.42.0       R6_2.4.0            
[33] AnnotationDbi_1.44.0 XML_3.98-1.19        tidyr_0.8.3          selectr_0.4-1       
[37] purrr_0.3.2          blob_1.1.1           magrittr_1.5         BiocGenerics_0.28.0 
[41] rvest_0.3.3          assertthat_0.2.1     stringi_1.4.3        RCurl_1.95-4.12     
[45] crayon_1.3.4  

Cheers!

biomart software error • 126 views
ADD COMMENTlink modified 9 weeks ago by Mike Smith3.7k • written 9 weeks ago by grealesm0
Answer: Scanf error in biomaRt when field is missing
2
gravatar for Mike Smith
9 weeks ago by
Mike Smith3.7k
EMBL Heidelberg / de.NBI
Mike Smith3.7k wrote:

Sort term hacky fix, try swapping the order of the attributes you want to return e.g.

BM <- biomaRt::getBM(c("ensembl_peptide_id", "ensembl_gene_id", "peptide"), 
                     filters = "ensembl_gene_id" , 
                     values = "ENSG00000166736", 
                     mart = mart, curl = CurlHandle)

The server side code that returns a tsv file will insert an empty cell unless it's the last column in which case you get an new line and the error you've seeing. I'll have a look at whether there's a more robust reading function than read.table() but this should work since there will always be a gene ID for every record.

ADD COMMENTlink written 9 weeks ago by Mike Smith3.7k

It worked! However, now I tried to include both "coding" and "peptide" sequences in the same go, and again got the error (I tried out several combinations of attribute orders, none of them worked). Trying coding and peptide in different calls gives a good result, but not when included together. Is there any limitation to querying both coding and peptide sequences at the same time?

Thanks!

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by grealesm0

Ensembl won't let you retrieve more than one type of sequence at a time. If you try using the webinterface the different sequence types are selected with a radio button where only one selection is allowed. As far as I'm aware there's no way for the biomaRt package to know which attributes are mutually exclusive, so this problem exists.

My recommendation would be to run two separate queries as you've done and then merge the results based on the protein ID.

ADD REPLYlink written 9 weeks ago by Mike Smith3.7k

That was my suspicion. I'll do it that way then.

Thanks so much for your help! Cheers,

ADD REPLYlink written 9 weeks ago by grealesm0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 196 users visited in the last hour