Search
Question: Premature EOF within Description Field - Bos Taurus
1
gravatar for andrew.j.skelton73
19 months ago by
United Kingdom
andrew.j.skelton73290 wrote:

Hi, 

I've been working with a some Bos Taurus samples, and tried to use biomaRt to retrieve HGNC symbols, and descriptions based on Ensembl Gene Identifiers, using the getBM function. 

ensembl    <- useMart("ensembl")
ensembl    <- useDataset("btaurus_gene_ensembl",
                         mart = ensembl)
annotation <- getBM(attributes = c('ensembl_gene_id',  'hgnc_symbol', 'description'),
                    filters    = 'ensembl_gene_id',
                    values     = c("ENSBTAG00000045918", "ENSBTAG00000045919", 
                                   "ENSBTAG00000045920", "ENSBTAG00000045921", 
                                   "ENSBTAG00000045922", "ENSBTAG00000045923"),
                    mart       = ensembl)

Strangely, it seems that only a combination of the above query results in the error: 

Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  EOF within quoted string

I can't isolate it to a single Gene ID, but rather this vector of IDs. Any advice is appreciated!

Cheers!

ADD COMMENTlink modified 19 months ago by Martin Morgan ♦♦ 20k • written 19 months ago by andrew.j.skelton73290
1
gravatar for Martin Morgan
19 months ago by
Martin Morgan ♦♦ 20k
United States
Martin Morgan ♦♦ 20k wrote:

I coerced warnings to errors, ran your code, then used traceback() to see where things were going wrong.

options(warn=2)
annotation <- ...
+                     mart       = ensembl)
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  (converted from warning) EOF within quoted string
> traceback()
7: doWithOneRestart(return(expr), restart)
6: withOneRestart(expr, restarts[[1L]])
5: withRestarts({
       .Internal(.signalCondition(simpleWarning(msg, call), msg, 
           call))
       .Internal(.dfltWarn(msg, call))
   }, muffleWarning = function() NULL)
4: .signalSimpleWarning("EOF within quoted string", quote(scan(file = file, 
       what = what, sep = sep, quote = quote, dec = dec, nmax = nrows, 
       skip = 0, na.strings = na.strings, quiet = TRUE, fill = fill, 
       strip.white = strip.white, blank.lines.skip = blank.lines.skip, 
       multi.line = FALSE, comment.char = comment.char, allowEscapes = allowEscapes, 
       flush = flush, encoding = encoding, skipNul = skipNul)))
3: scan(file = file, what = what, sep = sep, quote = quote, dec = dec, 
       nmax = nrows, skip = 0, na.strings = na.strings, quiet = TRUE, 
       fill = fill, strip.white = strip.white, blank.lines.skip = blank.lines.skip, 
       multi.line = FALSE, comment.char = comment.char, allowEscapes = allowEscapes, 
       flush = flush, encoding = encoding, skipNul = skipNul)
2: read.table(con, sep = "\t", header = bmHeader, quote = "\"", 
       comment.char = "", check.names = FALSE, stringsAsFactors = FALSE)
1: getBM(attributes = c("ensembl_gene_id", "hgnc_symbol", "description"), 
       filters = "ensembl_gene_id", values = c("ENSBTAG00000045918", 
           "ENSBTAG00000045919", "ENSBTAG00000045920", "ENSBTAG00000045921", 
           "ENSBTAG00000045922", "ENSBTAG00000045923"), mart = ensembl)

Looks like it's in scan, so I debugged that and tried again

debug(scan)
annotation <- ...
...
debugging in: scan(file, what = "", sep = sep, quote = quote, nlines = 1, quiet = TRUE, 
    skip = 0, strip.white = TRUE, blank.lines.skip = blank.lines.skip, 
    comment.char = comment.char, allowEscapes = allowEscapes, 
    encoding = encoding, skipNul = skipNul)
...

I took a look at file, guessing it is what was returned by biomaRt

Browse[2]> readLines(file)
 [1] "ENSBTAG00000045918\t\t"                                                                                                                    
 [2] "ENSBTAG00000045919\t\tUncharacterized protein  [Source:UniProtKB/TrEMBL;Acc:G3N011]"                                                       
 [3] "ENSBTAG00000045920\t\t5S ribosomal RNA [Source:RFAM;Acc:RF00001]"                                                                          
 [4] "ENSBTAG00000045921\t\tUncharacterized protein  [Source:UniProtKB/TrEMBL;Acc:G3N082]"                                                       
 [5] "ENSBTAG00000045922\t\tOlfactory receptor  [Source:UniProtKB/TrEMBL;Acc:G3MX62]"                                                            
 [6] "ENSBTAG00000045918\t\t"                                                                                                                    
 [7] "ENSBTAG00000045919\t\tUncharacterized protein  [Source:UniProtKB/TrEMBL;Acc:G3N011]"                                                       
 [8] "ENSBTAG00000045920\t\t5S ribosomal RNA [Source:RFAM;Acc:RF00001]"                                                                          
 [9] "ENSBTAG00000045921\t\tUncharacterized protein  [Source:UniProtKB/TrEMBL;Acc:G3N082]"                                                       
[10] "ENSBTAG00000045922\t\tOlfactory receptor  [Source:UniProtKB/TrEMBL;Acc:G3MX62]"                                                            
[11] "ENSBTAG00000045923\t\tMHC class II antigen; Putative MHC class II antigen\"; Uncharacterized protein  [Source:UniProtKB/TrEMBL;Acc:Q70IB5]"
[12] ""                                                                                                                              

Line 11 looks suspicious - there's an escaped quote \", and I see that from the line numbered '2' in the traceback() that the code explicitly allows this to indicate the opening of a string. When I look online I see the same quotation mark. I think (a) it's a data entry error with biomart; (b) it's not easy to work around, in particular because it seems like the biomaRt author has explicitly indicated that the quotation mark should open a string; and (c) that this can be quite bad, for instance the reason that you can't isolate this to a single ID is because with a single ID scan() guesses the wrong number of fields to parse, and somehow biomaRt quietly ignores the record. I think the best bet is to get this fixed upstream.

 

 

ADD COMMENTlink written 19 months ago by Martin Morgan ♦♦ 20k

Hi Martin, thanks for the quick response and the rationale. I'll drop the listed maintainer an email to try and get this resolved.

ADD REPLYlink written 19 months ago by andrew.j.skelton73290

For the record, I received the following update from the biomaRt maintainer, in part:

I'm in contact with Ensembl. It is an issue in the Ensembl BioMart content as quotes should not be in data fields.  

I think the best place to fix this is there and not try to figure out a workaround in biomaRt as then we would start to do this for every error they might include.

The way to report issues with the Ensembl data content is to email them at helpdesk@ensembl.org  , they are usually quick to respond as they have a team just for that;)  

ADD REPLYlink written 19 months ago by Martin Morgan ♦♦ 20k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 237 users visited in the last hour