Premature EOF within Description Field - Bos Taurus
1
1
Entering edit mode
@andrewjskelton73-7074
Last seen 5 weeks ago
United Kingdom

Hi, 

I've been working with a some Bos Taurus samples, and tried to use biomaRt to retrieve HGNC symbols, and descriptions based on Ensembl Gene Identifiers, using the getBM function. 

ensembl    <- useMart("ensembl")
ensembl    <- useDataset("btaurus_gene_ensembl",
                         mart = ensembl)
annotation <- getBM(attributes = c('ensembl_gene_id',  'hgnc_symbol', 'description'),
                    filters    = 'ensembl_gene_id',
                    values     = c("ENSBTAG00000045918", "ENSBTAG00000045919", 
                                   "ENSBTAG00000045920", "ENSBTAG00000045921", 
                                   "ENSBTAG00000045922", "ENSBTAG00000045923"),
                    mart       = ensembl)

Strangely, it seems that only a combination of the above query results in the error: 

Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  EOF within quoted string

I can't isolate it to a single Gene ID, but rather this vector of IDs. Any advice is appreciated!

Cheers!

biomart ensemble mart • 1.5k views
ADD COMMENT
1
Entering edit mode
@martin-morgan-1513
Last seen 4 days ago
United States

I coerced warnings to errors, ran your code, then used traceback() to see where things were going wrong.

options(warn=2)
annotation <- ...
+                     mart       = ensembl)
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  (converted from warning) EOF within quoted string
> traceback()
7: doWithOneRestart(return(expr), restart)
6: withOneRestart(expr, restarts[[1L]])
5: withRestarts({
       .Internal(.signalCondition(simpleWarning(msg, call), msg, 
           call))
       .Internal(.dfltWarn(msg, call))
   }, muffleWarning = function() NULL)
4: .signalSimpleWarning("EOF within quoted string", quote(scan(file = file, 
       what = what, sep = sep, quote = quote, dec = dec, nmax = nrows, 
       skip = 0, na.strings = na.strings, quiet = TRUE, fill = fill, 
       strip.white = strip.white, blank.lines.skip = blank.lines.skip, 
       multi.line = FALSE, comment.char = comment.char, allowEscapes = allowEscapes, 
       flush = flush, encoding = encoding, skipNul = skipNul)))
3: scan(file = file, what = what, sep = sep, quote = quote, dec = dec, 
       nmax = nrows, skip = 0, na.strings = na.strings, quiet = TRUE, 
       fill = fill, strip.white = strip.white, blank.lines.skip = blank.lines.skip, 
       multi.line = FALSE, comment.char = comment.char, allowEscapes = allowEscapes, 
       flush = flush, encoding = encoding, skipNul = skipNul)
2: read.table(con, sep = "\t", header = bmHeader, quote = "\"", 
       comment.char = "", check.names = FALSE, stringsAsFactors = FALSE)
1: getBM(attributes = c("ensembl_gene_id", "hgnc_symbol", "description"), 
       filters = "ensembl_gene_id", values = c("ENSBTAG00000045918", 
           "ENSBTAG00000045919", "ENSBTAG00000045920", "ENSBTAG00000045921", 
           "ENSBTAG00000045922", "ENSBTAG00000045923"), mart = ensembl)

Looks like it's in scan, so I debugged that and tried again

debug(scan)
annotation <- ...
...
debugging in: scan(file, what = "", sep = sep, quote = quote, nlines = 1, quiet = TRUE, 
    skip = 0, strip.white = TRUE, blank.lines.skip = blank.lines.skip, 
    comment.char = comment.char, allowEscapes = allowEscapes, 
    encoding = encoding, skipNul = skipNul)
...

I took a look at file, guessing it is what was returned by biomaRt

Browse[2]> readLines(file)
 [1] "ENSBTAG00000045918\t\t"                                                                                                                    
 [2] "ENSBTAG00000045919\t\tUncharacterized protein  [Source:UniProtKB/TrEMBL;Acc:G3N011]"                                                       
 [3] "ENSBTAG00000045920\t\t5S ribosomal RNA [Source:RFAM;Acc:RF00001]"                                                                          
 [4] "ENSBTAG00000045921\t\tUncharacterized protein  [Source:UniProtKB/TrEMBL;Acc:G3N082]"                                                       
 [5] "ENSBTAG00000045922\t\tOlfactory receptor  [Source:UniProtKB/TrEMBL;Acc:G3MX62]"                                                            
 [6] "ENSBTAG00000045918\t\t"                                                                                                                    
 [7] "ENSBTAG00000045919\t\tUncharacterized protein  [Source:UniProtKB/TrEMBL;Acc:G3N011]"                                                       
 [8] "ENSBTAG00000045920\t\t5S ribosomal RNA [Source:RFAM;Acc:RF00001]"                                                                          
 [9] "ENSBTAG00000045921\t\tUncharacterized protein  [Source:UniProtKB/TrEMBL;Acc:G3N082]"                                                       
[10] "ENSBTAG00000045922\t\tOlfactory receptor  [Source:UniProtKB/TrEMBL;Acc:G3MX62]"                                                            
[11] "ENSBTAG00000045923\t\tMHC class II antigen; Putative MHC class II antigen\"; Uncharacterized protein  [Source:UniProtKB/TrEMBL;Acc:Q70IB5]"
[12] ""                                                                                                                              

Line 11 looks suspicious - there's an escaped quote \", and I see that from the line numbered '2' in the traceback() that the code explicitly allows this to indicate the opening of a string. When I look online I see the same quotation mark. I think (a) it's a data entry error with biomart; (b) it's not easy to work around, in particular because it seems like the biomaRt author has explicitly indicated that the quotation mark should open a string; and (c) that this can be quite bad, for instance the reason that you can't isolate this to a single ID is because with a single ID scan() guesses the wrong number of fields to parse, and somehow biomaRt quietly ignores the record. I think the best bet is to get this fixed upstream.

 

 

ADD COMMENT
0
Entering edit mode

Hi Martin, thanks for the quick response and the rationale. I'll drop the listed maintainer an email to try and get this resolved.

ADD REPLY
1
Entering edit mode

For the record, I received the following update from the biomaRt maintainer, in part:

I'm in contact with Ensembl. It is an issue in the Ensembl BioMart content as quotes should not be in data fields.  

I think the best place to fix this is there and not try to figure out a workaround in biomaRt as then we would start to do this for every error they might include.

The way to report issues with the Ensembl data content is to email them at helpdesk@ensembl.org  , they are usually quick to respond as they have a team just for that;)  

ADD REPLY

Login before adding your answer.

Traffic: 859 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6