I coerced warnings to errors, ran your code, then used traceback()
to see where things were going wrong.
options(warn=2)
annotation <- ...
+ mart = ensembl)
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
(converted from warning) EOF within quoted string
> traceback()
7: doWithOneRestart(return(expr), restart)
6: withOneRestart(expr, restarts[[1L]])
5: withRestarts({
.Internal(.signalCondition(simpleWarning(msg, call), msg,
call))
.Internal(.dfltWarn(msg, call))
}, muffleWarning = function() NULL)
4: .signalSimpleWarning("EOF within quoted string", quote(scan(file = file,
what = what, sep = sep, quote = quote, dec = dec, nmax = nrows,
skip = 0, na.strings = na.strings, quiet = TRUE, fill = fill,
strip.white = strip.white, blank.lines.skip = blank.lines.skip,
multi.line = FALSE, comment.char = comment.char, allowEscapes = allowEscapes,
flush = flush, encoding = encoding, skipNul = skipNul)))
3: scan(file = file, what = what, sep = sep, quote = quote, dec = dec,
nmax = nrows, skip = 0, na.strings = na.strings, quiet = TRUE,
fill = fill, strip.white = strip.white, blank.lines.skip = blank.lines.skip,
multi.line = FALSE, comment.char = comment.char, allowEscapes = allowEscapes,
flush = flush, encoding = encoding, skipNul = skipNul)
2: read.table(con, sep = "\t", header = bmHeader, quote = "\"",
comment.char = "", check.names = FALSE, stringsAsFactors = FALSE)
1: getBM(attributes = c("ensembl_gene_id", "hgnc_symbol", "description"),
filters = "ensembl_gene_id", values = c("ENSBTAG00000045918",
"ENSBTAG00000045919", "ENSBTAG00000045920", "ENSBTAG00000045921",
"ENSBTAG00000045922", "ENSBTAG00000045923"), mart = ensembl)
Looks like it's in scan, so I debugged that and tried again
debug(scan)
annotation <- ...
...
debugging in: scan(file, what = "", sep = sep, quote = quote, nlines = 1, quiet = TRUE,
skip = 0, strip.white = TRUE, blank.lines.skip = blank.lines.skip,
comment.char = comment.char, allowEscapes = allowEscapes,
encoding = encoding, skipNul = skipNul)
...
I took a look at file
, guessing it is what was returned by biomaRt
Browse[2]> readLines(file)
[1] "ENSBTAG00000045918\t\t"
[2] "ENSBTAG00000045919\t\tUncharacterized protein [Source:UniProtKB/TrEMBL;Acc:G3N011]"
[3] "ENSBTAG00000045920\t\t5S ribosomal RNA [Source:RFAM;Acc:RF00001]"
[4] "ENSBTAG00000045921\t\tUncharacterized protein [Source:UniProtKB/TrEMBL;Acc:G3N082]"
[5] "ENSBTAG00000045922\t\tOlfactory receptor [Source:UniProtKB/TrEMBL;Acc:G3MX62]"
[6] "ENSBTAG00000045918\t\t"
[7] "ENSBTAG00000045919\t\tUncharacterized protein [Source:UniProtKB/TrEMBL;Acc:G3N011]"
[8] "ENSBTAG00000045920\t\t5S ribosomal RNA [Source:RFAM;Acc:RF00001]"
[9] "ENSBTAG00000045921\t\tUncharacterized protein [Source:UniProtKB/TrEMBL;Acc:G3N082]"
[10] "ENSBTAG00000045922\t\tOlfactory receptor [Source:UniProtKB/TrEMBL;Acc:G3MX62]"
[11] "ENSBTAG00000045923\t\tMHC class II antigen; Putative MHC class II antigen\"; Uncharacterized protein [Source:UniProtKB/TrEMBL;Acc:Q70IB5]"
[12] ""
Line 11 looks suspicious - there's an escaped quote \", and I see that from the line numbered '2' in the traceback() that the code explicitly allows this to indicate the opening of a string. When I look online I see the same quotation mark. I think (a) it's a data entry error with biomart; (b) it's not easy to work around, in particular because it seems like the biomaRt author has explicitly indicated that the quotation mark should open a string; and (c) that this can be quite bad, for instance the reason that you can't isolate this to a single ID is because with a single ID scan() guesses the wrong number of fields to parse, and somehow biomaRt quietly ignores the record. I think the best bet is to get this fixed upstream.
Hi Martin, thanks for the quick response and the rationale. I'll drop the listed maintainer an email to try and get this resolved.
For the record, I received the following update from the biomaRt maintainer, in part:
I'm in contact with Ensembl. It is an issue in the Ensembl BioMart content as quotes should not be in data fields.
I think the best place to fix this is there and not try to figure out a workaround in biomaRt as then we would start to do this for every error they might include.
The way to report issues with the Ensembl data content is to email them at helpdesk@ensembl.org , they are usually quick to respond as they have a team just for that;)