When trying to access certain attributes via ‹biomaRt›, I get an error which seems to indicate a faulty Biomart data transfer. Here’s an annotated MWE:
library(biomaRt) ensembl = useMart('ensembl', 'celegans_gene_ensembl') attr = c('ensembl_gene_id', 'external_gene_name', 'entrezgene', 'coding') x = getBM(attributes = attr, mart = ensembl, bmHeader = TRUE, filters = 'ensembl_gene_id', values = 'WBGene00007063') # this works; `x` contains a valid result. But this fails: y = getBM(attributes = attr, mart = ensembl, bmHeader = TRUE, filters = 'ensembl_gene_id', values = 'WBGene00166185')
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 4 elements
version
_ platform x86_64-apple-darwin13.4.0 arch x86_64 os darwin13.4.0 system x86_64, darwin13.4.0 status major 3 minor 3.0 year 2016 month 05 day 03 svn rev 70573 language R version.string R version 3.3.0 (2016-05-03) nickname Supposedly Educational
sessionInfo()
R version 3.3.0 (2016-05-03) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.11.5 (El Capitan) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_2.28.0 setwidth_1.0-4 nvimcom_0.9-15 loaded via a namespace (and not attached): [1] IRanges_2.6.0 XML_3.98-1.4 bitops_1.0-6 DBI_0.4-1 stats4_3.3.0 magrittr_1.5 RSQLite_1.0.0 [8] S4Vectors_0.10.1 tools_3.3.0 Biobase_2.32.0 RCurl_1.95-4.8 parallel_3.3.0 BiocGenerics_0.18.0 AnnotationDbi_1.34.3
In addition, when running the above with attr = c('ensembl_gene_id', 'external_gene_name', 'entrezgene', 'coding')
, i.e. omitting the “entrezgene
” attribute, it works as well. Debugging reveals that, indeed, Biomart returns data that has a missing column for the Entrez gene ID; here is the result split by '\t'
, inside getBM
after postForm
returned:
strsplit(strsplit(postRes, '\n')[[1]], '\t')
[[1]] [1] "Coding sequence" "Ensembl Gene ID" "Associated Gene Name" "EntrezGene ID" [[2]] [1] "Sequence unavailable" "WBGene00166185" "21ur-6527"
Clearly the last column is missing from the data, maybe because no sequence information is available. ‹biomaRt› needs to handle this appropriately. In my actual use-case, I am downloading all sequences for C elegans, without filter for gene ID. As a consequence, this query always fails.
I can replicate the issue on Linux with R 3.3.1 / biomaRt 2.28.0.