biomaRt crashes when result contains missing values
1
0
Entering edit mode
klmr • 0
@klmr-10984
Last seen 8.5 years ago
Cambridge

When trying to access certain attributes via ‹biomaRt›, I get an error which seems to indicate a faulty Biomart data transfer. Here’s an annotated MWE:

library(biomaRt)
ensembl = useMart('ensembl', 'celegans_gene_ensembl')
attr = c('ensembl_gene_id', 'external_gene_name', 'entrezgene', 'coding')
x = getBM(attributes = attr, mart = ensembl, bmHeader = TRUE,
        filters = 'ensembl_gene_id', values = 'WBGene00007063')
# this works; `x` contains a valid result. But this fails:
y = getBM(attributes = attr, mart = ensembl, bmHeader = TRUE,
        filters = 'ensembl_gene_id', values = 'WBGene00166185')
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  line 1 did not have 4 elements
version
               _
platform       x86_64-apple-darwin13.4.0
arch           x86_64
os             darwin13.4.0
system         x86_64, darwin13.4.0
status
major          3
minor          3.0
year           2016
month          05
day            03
svn rev        70573
language       R
version.string R version 3.3.0 (2016-05-03)
nickname       Supposedly Educational
sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] biomaRt_2.28.0 setwidth_1.0-4 nvimcom_0.9-15

loaded via a namespace (and not attached):
 [1] IRanges_2.6.0        XML_3.98-1.4         bitops_1.0-6         DBI_0.4-1            stats4_3.3.0         magrittr_1.5         RSQLite_1.0.0
 [8] S4Vectors_0.10.1     tools_3.3.0          Biobase_2.32.0       RCurl_1.95-4.8       parallel_3.3.0       BiocGenerics_0.18.0  AnnotationDbi_1.34.3

In addition, when running the above with attr = c('ensembl_gene_id', 'external_gene_name', 'entrezgene', 'coding'), i.e. omitting the “entrezgene” attribute, it works as well. Debugging reveals that, indeed, Biomart returns data that has a missing column for the Entrez gene ID; here is the result split by '\t', inside getBM after postForm returned:

strsplit(strsplit(postRes, '\n')[[1]], '\t')
[[1]]
[1] "Coding sequence"      "Ensembl Gene ID"      "Associated Gene Name" "EntrezGene ID"

[[2]]
[1] "Sequence unavailable" "WBGene00166185"       "21ur-6527"

Clearly the last column is missing from the data, maybe because no sequence information is available. ‹biomaRt› needs to handle this appropriately. In my actual use-case, I am downloading all sequences for C elegans, without filter for gene ID. As a consequence, this query always fails.

biomaRt bug • 2.1k views
ADD COMMENT
0
Entering edit mode

I can replicate the issue on Linux with R 3.3.1 / biomaRt 2.28.0.

ADD REPLY
1
Entering edit mode
Mike Smith ★ 6.6k
@mike-smith
Last seen 8 hours ago
EMBL Heidelberg

I'm not sure this is a 'valid' query, since I can't actually run it via the web interface to Biomart.  Biomart breaks the attributes you can select down into 'pages', and I think you should only be able to return attributes that are on the same page.  Some things appear on multiple pages, but if you look at the four things you're interested in, they can't all be returned from the same page.

idx <- listAttributes(ensembl, what = "name") %in% c('ensembl_gene_id', 'external_gene_name', 'entrezgene', 'coding'),]
listAttributes(ensembl)[idx,]
>
                   name          description         page
1       ensembl_gene_id      Ensembl Gene ID feature_page
15   external_gene_name Associated Gene Name feature_page
39           entrezgene        EntrezGene ID feature_page
122     ensembl_gene_id      Ensembl Gene ID    structure
133  external_gene_name Associated Gene Name    structure
156     ensembl_gene_id      Ensembl Gene ID     homologs
164  external_gene_name Associated Gene Name     homologs
1135    ensembl_gene_id      Ensembl Gene ID          snp
1143 external_gene_name Associated Gene Name          snp
1148    ensembl_gene_id      Ensembl Gene ID  snp_somatic
1156 external_gene_name Associated Gene Name  snp_somatic
1171             coding      Coding sequence    sequences
1175    ensembl_gene_id      Ensembl Gene ID    sequences
1177 external_gene_name Associated Gene Name    sequences

If you really want all four attributes, you can run two queries and combine the results:

values <- c('WBGene00166185', 'WBGene00007063')
attr1 <- c('ensembl_gene_id', 'external_gene_name', 'coding')
x <- getBM(attributes = attr1, mart = ensembl, bmHeader = TRUE,
          filters = 'ensembl_gene_id', values = values)

attr2 <- c('ensembl_gene_id', 'entrezgene')
y <- getBM(attributes = attr2, mart = ensembl, bmHeader = TRUE,
          filters = 'ensembl_gene_id', values = values)

library(dplyr)
full_join(x, y)

There's some commented out code in the package that looks like it tried to check for this at one point, maybe it should be reimplemented.

ADD COMMENT
0
Entering edit mode

I find it hard to infer that this shouldn’t work merely from the HTML interface, since it’s not documented anywhere (neither in the official documentation, nor in Ensembl Biomart or ‹biomaRt›) whether restrictions apply. At any rate the query undeniably *works* and returns a result, ‹biomaRt› just fails to parse it.

ADD REPLY
0
Entering edit mode

I agree that the documentation on 'pages' is lacking (maybe they're called something else in the documentation?), I made my inference that it isn't supported purely from the fact that I couldn't do it.

I presume this behaviour is also linked to the fact that the attributes don't come back in the same order that you requested them.  Interestingly, if you ask for the attributes in a different order, you can run the query without issue:

ensembl = useMart('ensembl', 'celegans_gene_ensembl')
attr = c('entrezgene', 'ensembl_gene_id', 'external_gene_name', 'coding')
getBM(attributes = attr, mart = ensembl, bmHeader = TRUE,
          filters = 'ensembl_gene_id', values = 'WBGene00166185')
       Coding sequence EntrezGene ID Ensembl Gene ID Associated Gene Name
1 Sequence unavailable            NA  WBGene00166185            21ur-6527

It's not clear to me how biomaRt should handle this, other than by preventing the query or reporting a more helpful error message.  If a subset of rows are missing entries, can the absent columns reliably be identified?  Perhaps they're always at the end of a row, as in this example?

ADD REPLY
0
Entering edit mode

“the attributes don't come back in the same order that you requested them.” — That’s a known bug, and according to personal communication from the Ensembl team, the authors of Ensembl Biomart and biomaRt cannot agree on whose fault this is, hence no fix.

ADD REPLY

Login before adding your answer.

Traffic: 707 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6