I'm experiencing a problem lately when querying Ensembl via biomaRt. To put it simple, if I ask for fields that include missing values (see example), it returns a data frame with rows with different columns, leading to R throwing a scanf error.
In my example, let's say I try to retrieve peptide sequences and IDs for HTR3A (ENSG00000166736). This gene has several transcripts in Ensembl, one of them has no protein associated (see https://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000166736;r=11:113974881-113990313).
If I run the following code, it returns the desired result, with a cell containing "Sequence unavailable", as expected.
library(biomaRt) mart <- useMart("ENSEMBL_MART_ENSEMBL") mart <- useDataset("hsapiens_gene_ensembl", mart = mart) CurlHandle <- RCurl::getCurlHandle() BM <- biomaRt::getBM(c("ensembl_gene_id", "ensembl_transcript_id", "coding"), filters = "ensembl_gene_id" , values = "ENSG00000166736", mart = mart, curl = CurlHandle)
However, if I run the following I get an error, probably due to that protein ID being missing.
> BM <- biomaRt::getBM(c("ensembl_gene_id", "ensembl_peptide_id", "peptide"), filters = "ensembl_gene_id" , values = "ENSG00000166736", mart = mart, curl = CurlHandle) Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 3 did not have 3 elements
I tried several hosts, as suggested in other related questions, and to set
quote = TRUE among other things, to no avail. I can imagine what is happening under the hood (i.e. R not liking the server sending back a table with missing elements), but I can't figure out a fix for this.
I'd appreciate some help! :)
> sessionInfo() R version 3.5.3 (2019-03-11) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Linux Mint 18.3 Matrix products: default BLAS: /usr/lib/openblas-base/libblas.so.3 LAPACK: /usr/lib/libopenblasp-r0.2.18.so locale:  LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8  LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8  LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C  LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages:  stats graphics grDevices utils datasets methods base other attached packages:  dplyr_0.8.0.1 biomaRt_2.38.0 loaded via a namespace (and not attached):  Rcpp_1.0.1 compiler_3.5.3 pillar_1.3.1 prettyunits_1.0.2  bitops_1.0-6 tools_3.5.3 progress_1.2.0 digest_0.6.18  bit_1.1-14 RSQLite_2.1.1 memoise_1.1.0 tibble_2.1.1  pkgconfig_2.0.2 rlang_0.3.4 DBI_1.0.0 rstudioapi_0.10  curl_3.3 parallel_3.5.3 fuzzyjoin_0.1.4 xml2_1.2.0  stringr_1.4.0 httr_1.4.0 S4Vectors_0.20.1 IRanges_2.16.0  hms_0.4.2 stats4_3.5.3 bit64_0.9-7 tidyselect_0.2.5  data.table_1.12.2 glue_1.3.1 Biobase_2.42.0 R6_2.4.0  AnnotationDbi_1.44.0 XML_3.98-1.19 tidyr_0.8.3 selectr_0.4-1  purrr_0.3.2 blob_1.1.1 magrittr_1.5 BiocGenerics_0.28.0  rvest_0.3.3 assertthat_0.2.1 stringi_1.4.3 RCurl_1.95-4.12  crayon_1.3.4