Dear all,
I'm experiencing a problem lately when querying Ensembl via biomaRt. To put it simple, if I ask for fields that include missing values (see example), it returns a data frame with rows with different columns, leading to R throwing a scanf error.
In my example, let's say I try to retrieve peptide sequences and IDs for HTR3A (ENSG00000166736). This gene has several transcripts in Ensembl, one of them has no protein associated (see https://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000166736;r=11:113974881-113990313).
If I run the following code, it returns the desired result, with a cell containing "Sequence unavailable", as expected.
library(biomaRt)
mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset("hsapiens_gene_ensembl", mart = mart)
CurlHandle <- RCurl::getCurlHandle()
BM <- biomaRt::getBM(c("ensembl_gene_id", "ensembl_transcript_id", "coding"), filters = "ensembl_gene_id" , values = "ENSG00000166736", mart = mart, curl = CurlHandle)
However, if I run the following I get an error, probably due to that protein ID being missing.
> BM <- biomaRt::getBM(c("ensembl_gene_id", "ensembl_peptide_id", "peptide"), filters = "ensembl_gene_id" , values = "ENSG00000166736", mart = mart, curl = CurlHandle)
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 3 did not have 3 elements
I tried several hosts, as suggested in other related questions, and to set quote = TRUE
among other things, to no avail. I can imagine what is happening under the hood (i.e. R not liking the server sending back a table with missing elements), but I can't figure out a fix for this.
I'd appreciate some help! :)
My sessionInfo()
> sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 18.3
Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8
[4] LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.8.0.1 biomaRt_2.38.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 compiler_3.5.3 pillar_1.3.1 prettyunits_1.0.2
[5] bitops_1.0-6 tools_3.5.3 progress_1.2.0 digest_0.6.18
[9] bit_1.1-14 RSQLite_2.1.1 memoise_1.1.0 tibble_2.1.1
[13] pkgconfig_2.0.2 rlang_0.3.4 DBI_1.0.0 rstudioapi_0.10
[17] curl_3.3 parallel_3.5.3 fuzzyjoin_0.1.4 xml2_1.2.0
[21] stringr_1.4.0 httr_1.4.0 S4Vectors_0.20.1 IRanges_2.16.0
[25] hms_0.4.2 stats4_3.5.3 bit64_0.9-7 tidyselect_0.2.5
[29] data.table_1.12.2 glue_1.3.1 Biobase_2.42.0 R6_2.4.0
[33] AnnotationDbi_1.44.0 XML_3.98-1.19 tidyr_0.8.3 selectr_0.4-1
[37] purrr_0.3.2 blob_1.1.1 magrittr_1.5 BiocGenerics_0.28.0
[41] rvest_0.3.3 assertthat_0.2.1 stringi_1.4.3 RCurl_1.95-4.12
[45] crayon_1.3.4
Cheers!
It worked! However, now I tried to include both "coding" and "peptide" sequences in the same go, and again got the error (I tried out several combinations of attribute orders, none of them worked). Trying coding and peptide in different calls gives a good result, but not when included together. Is there any limitation to querying both coding and peptide sequences at the same time?
Thanks!
Ensembl won't let you retrieve more than one type of sequence at a time. If you try using the webinterface the different sequence types are selected with a radio button where only one selection is allowed. As far as I'm aware there's no way for the biomaRt package to know which attributes are mutually exclusive, so this problem exists.
My recommendation would be to run two separate queries as you've done and then merge the results based on the protein ID.
That was my suspicion. I'll do it that way then.
Thanks so much for your help! Cheers,