Question: biomaRt crashes when result contains missing values
0
gravatar for klmr
3.3 years ago by
klmr0
Cambridge
klmr0 wrote:

When trying to access certain attributes via ‹biomaRt›, I get an error which seems to indicate a faulty Biomart data transfer. Here’s an annotated MWE:

library(biomaRt)
ensembl = useMart('ensembl', 'celegans_gene_ensembl')
attr = c('ensembl_gene_id', 'external_gene_name', 'entrezgene', 'coding')
x = getBM(attributes = attr, mart = ensembl, bmHeader = TRUE,
        filters = 'ensembl_gene_id', values = 'WBGene00007063')
# this works; `x` contains a valid result. But this fails:
y = getBM(attributes = attr, mart = ensembl, bmHeader = TRUE,
        filters = 'ensembl_gene_id', values = 'WBGene00166185')
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  line 1 did not have 4 elements
version
               _
platform       x86_64-apple-darwin13.4.0
arch           x86_64
os             darwin13.4.0
system         x86_64, darwin13.4.0
status
major          3
minor          3.0
year           2016
month          05
day            03
svn rev        70573
language       R
version.string R version 3.3.0 (2016-05-03)
nickname       Supposedly Educational
sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] biomaRt_2.28.0 setwidth_1.0-4 nvimcom_0.9-15

loaded via a namespace (and not attached):
 [1] IRanges_2.6.0        XML_3.98-1.4         bitops_1.0-6         DBI_0.4-1            stats4_3.3.0         magrittr_1.5         RSQLite_1.0.0
 [8] S4Vectors_0.10.1     tools_3.3.0          Biobase_2.32.0       RCurl_1.95-4.8       parallel_3.3.0       BiocGenerics_0.18.0  AnnotationDbi_1.34.3

In addition, when running the above with attr = c('ensembl_gene_id', 'external_gene_name', 'entrezgene', 'coding'), i.e. omitting the “entrezgene” attribute, it works as well. Debugging reveals that, indeed, Biomart returns data that has a missing column for the Entrez gene ID; here is the result split by '\t', inside getBM after postForm returned:

strsplit(strsplit(postRes, '\n')[[1]], '\t')
[[1]]
[1] "Coding sequence"      "Ensembl Gene ID"      "Associated Gene Name" "EntrezGene ID"

[[2]]
[1] "Sequence unavailable" "WBGene00166185"       "21ur-6527"

Clearly the last column is missing from the data, maybe because no sequence information is available. ‹biomaRt› needs to handle this appropriately. In my actual use-case, I am downloading all sequences for C elegans, without filter for gene ID. As a consequence, this query always fails.

biomart bug • 1.1k views
ADD COMMENTlink modified 3.3 years ago by Mike Smith4.0k • written 3.3 years ago by klmr0

I can replicate the issue on Linux with R 3.3.1 / biomaRt 2.28.0.

ADD REPLYlink written 3.3 years ago by Keith Hughitt120
Answer: biomaRt crashes when result contains missing values
1
gravatar for Mike Smith
3.3 years ago by
Mike Smith4.0k
EMBL Heidelberg / de.NBI
Mike Smith4.0k wrote:

I'm not sure this is a 'valid' query, since I can't actually run it via the web interface to Biomart.  Biomart breaks the attributes you can select down into 'pages', and I think you should only be able to return attributes that are on the same page.  Some things appear on multiple pages, but if you look at the four things you're interested in, they can't all be returned from the same page.

idx <- listAttributes(ensembl, what = "name") %in% c('ensembl_gene_id', 'external_gene_name', 'entrezgene', 'coding'),]
listAttributes(ensembl)[idx,]
>
                   name          description         page
1       ensembl_gene_id      Ensembl Gene ID feature_page
15   external_gene_name Associated Gene Name feature_page
39           entrezgene        EntrezGene ID feature_page
122     ensembl_gene_id      Ensembl Gene ID    structure
133  external_gene_name Associated Gene Name    structure
156     ensembl_gene_id      Ensembl Gene ID     homologs
164  external_gene_name Associated Gene Name     homologs
1135    ensembl_gene_id      Ensembl Gene ID          snp
1143 external_gene_name Associated Gene Name          snp
1148    ensembl_gene_id      Ensembl Gene ID  snp_somatic
1156 external_gene_name Associated Gene Name  snp_somatic
1171             coding      Coding sequence    sequences
1175    ensembl_gene_id      Ensembl Gene ID    sequences
1177 external_gene_name Associated Gene Name    sequences

If you really want all four attributes, you can run two queries and combine the results:

values <- c('WBGene00166185', 'WBGene00007063')
attr1 <- c('ensembl_gene_id', 'external_gene_name', 'coding')
x <- getBM(attributes = attr1, mart = ensembl, bmHeader = TRUE,
          filters = 'ensembl_gene_id', values = values)

attr2 <- c('ensembl_gene_id', 'entrezgene')
y <- getBM(attributes = attr2, mart = ensembl, bmHeader = TRUE,
          filters = 'ensembl_gene_id', values = values)

library(dplyr)
full_join(x, y)

There's some commented out code in the package that looks like it tried to check for this at one point, maybe it should be reimplemented.

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Mike Smith4.0k

I find it hard to infer that this shouldn’t work merely from the HTML interface, since it’s not documented anywhere (neither in the official documentation, nor in Ensembl Biomart or ‹biomaRt›) whether restrictions apply. At any rate the query undeniably *works* and returns a result, ‹biomaRt› just fails to parse it.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by klmr0

I agree that the documentation on 'pages' is lacking (maybe they're called something else in the documentation?), I made my inference that it isn't supported purely from the fact that I couldn't do it.

I presume this behaviour is also linked to the fact that the attributes don't come back in the same order that you requested them.  Interestingly, if you ask for the attributes in a different order, you can run the query without issue:

ensembl = useMart('ensembl', 'celegans_gene_ensembl')
attr = c('entrezgene', 'ensembl_gene_id', 'external_gene_name', 'coding')
getBM(attributes = attr, mart = ensembl, bmHeader = TRUE,
          filters = 'ensembl_gene_id', values = 'WBGene00166185')
       Coding sequence EntrezGene ID Ensembl Gene ID Associated Gene Name
1 Sequence unavailable            NA  WBGene00166185            21ur-6527

It's not clear to me how biomaRt should handle this, other than by preventing the query or reporting a more helpful error message.  If a subset of rows are missing entries, can the absent columns reliably be identified?  Perhaps they're always at the end of a row, as in this example?

ADD REPLYlink written 3.3 years ago by Mike Smith4.0k

“the attributes don't come back in the same order that you requested them.” — That’s a known bug, and according to personal communication from the Ensembl team, the authors of Ensembl Biomart and biomaRt cannot agree on whose fault this is, hence no fix.

ADD REPLYlink written 3.3 years ago by klmr0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 406 users visited in the last hour