Scan error using biomart getBM with version option and host set
1
0
Entering edit mode
Marc Saric ▴ 70
@marc-saric-1645
Last seen 5.3 years ago

Dear all,

I ran the following code and hit the error already described in https://support.bioconductor.org/p/104454/, https://support.bioconductor.org/p/104845/, and https://support.bioconductor.org/p/106479/.

Additional note: The main EnsEMBL BioMart seems to be down for maintenance currently, so I had to use one of the mirror sites as advised.

My code below (the relevant portion, if needed, I could try to provide a minimum runnable example). I am kind of stuck currently.

Thank you.

# Other libraries
library("BiocParallel")
library("DESeq2")
library("ggrepel")
library("tidyverse")
library("readr")
library("stringr")
library("AnnotationDbi")
library("EnsDb.Hsapiens.v86")

library("biomaRt")

ENSEMBL_DB_HOST = "useast.ensembl.org" # Set back to default, once they are up and running again
ENSEMBL_VERSION = "Ensembl Genes 96"  # Try to fix https://support.bioconductor.org/p/104454/

mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", host = ENSEMBL_DB_HOST, version = ENSEMBL_VERSION)

go_sets <- getBM(attributes =  c("ensembl_gene_id", "hgnc_symbol", "entrezgene", "go_id", "name_1006", "definition_1006", "go_linkage_type", "namespace_1003"),
                    filters = "ensembl_gene_id",
                    values = gsub("\\..*", "", row.names(res)),
                    mart = mart
                  )

(res is a DESeq2 resultset having EnsEMBL gene ids as row.names. I cut out the version tag using the gsub() call).

The stacktrace

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 5308 did not have 8 elements
Traceback:

1. getBM(attributes = c("ensembl_gene_id", "hgnc_symbol", "entrezgene", 
 .     "go_id", "name_1006", "definition_1006", "go_linkage_type", 
 .     "namespace_1003"), filters = "ensembl_gene_id", values = gsub("\\..*", 
 .     "", row.names(res)), mart = mart)   # at line 4-8 of file <text>
2. read.table(con, sep = "\t", header = callHeader, quote = quote, 
 .     comment.char = "", check.names = FALSE, stringsAsFactors = FALSE)
3. scan(file = file, what = what, sep = sep, quote = quote, dec = dec, 
 .     nmax = nrows, skip = 0, na.strings = na.strings, quiet = TRUE, 
 .     fill = fill, strip.white = strip.white, blank.lines.skip = blank.lines.skip, 
 .     multi.line = FALSE, comment.char = comment.char, allowEscapes = allowEscapes, 
 .     flush = flush, encoding = encoding, skipNul = skipNul)

Version info

sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS/LAPACK: /opt/conda/lib/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] biomaRt_2.38.0              GO.db_3.7.0                
 [3] org.Hs.eg.db_3.7.0          EnsDb.Hsapiens.v86_2.99.0  
 [5] ensembldb_2.6.8             AnnotationFilter_1.6.0     
 [7] GenomicFeatures_1.34.8      AnnotationDbi_1.44.0       
 [9] forcats_0.4.0               stringr_1.4.0              
[11] dplyr_0.8.0.1               purrr_0.3.2                
[13] readr_1.3.1                 tidyr_0.8.3                
[15] tibble_2.1.1                tidyverse_1.2.1            
[17] ggrepel_0.8.0               ggplot2_3.1.1              
[19] DESeq2_1.22.2               SummarizedExperiment_1.12.0
[21] DelayedArray_0.8.0          matrixStats_0.54.0         
[23] Biobase_2.42.0              GenomicRanges_1.34.0       
[25] GenomeInfoDb_1.18.2         IRanges_2.16.0             
[27] S4Vectors_0.20.1            BiocGenerics_0.28.0        
[29] BiocParallel_1.16.6        

loaded via a namespace (and not attached):
 [1] colorspace_1.4-1         IRdisplay_0.7.0          htmlTable_1.13.1        
 [4] XVector_0.22.0           base64enc_0.1-3          rstudioapi_0.10         
 [7] bit64_0.9-7              lubridate_1.7.4          xml2_1.2.0              
[10] splines_3.5.1            geneplotter_1.60.0       knitr_1.22              
[13] IRkernel_0.8.15          Formula_1.2-3            jsonlite_1.6            
[16] Rsamtools_1.34.1         broom_0.5.2              annotate_1.60.1         
[19] cluster_2.0.9            compiler_3.5.1           httr_1.4.0              
[22] backports_1.1.4          assertthat_0.2.1         Matrix_1.2-17           
[25] lazyeval_0.2.2           cli_1.1.0                acepack_1.4.1           
[28] htmltools_0.3.6          prettyunits_1.0.2        tools_3.5.1             
[31] gtable_0.3.0             glue_1.3.1               GenomeInfoDbData_1.2.0  
[34] Rcpp_1.0.1               cellranger_1.1.0         Biostrings_2.50.2       
[37] nlme_3.1-139             rtracklayer_1.42.2       xfun_0.6                
[40] rvest_0.3.3              XML_3.98-1.19            zlibbioc_1.28.0         
[43] scales_1.0.0             hms_0.4.2                ProtGenerics_1.14.0     
[46] RColorBrewer_1.1-2       curl_3.3                 memoise_1.1.0           
[49] gridExtra_2.3            rpart_4.1-15             latticeExtra_0.6-28     
[52] stringi_1.4.3            RSQLite_2.1.1            genefilter_1.64.0       
[55] checkmate_1.9.3          repr_0.19.2              rlang_0.3.4             
[58] pkgconfig_2.0.2          bitops_1.0-6             evaluate_0.13           
[61] lattice_0.20-38          GenomicAlignments_1.18.1 htmlwidgets_1.2         
[64] bit_1.1-14               tidyselect_0.2.5         plyr_1.8.4              
[67] magrittr_1.5             R6_2.4.0                 generics_0.0.2          
[70] Hmisc_4.2-0              pbdZMQ_0.3-3             DBI_1.0.0               
[73] pillar_1.3.1             haven_2.1.0              foreign_0.8-71          
[76] withr_2.1.2              survival_2.44-1.1        RCurl_1.95-4.12         
[79] nnet_7.3-12              modelr_0.1.4             crayon_1.3.4            
[82] uuid_0.1-2               progress_1.2.0           locfit_1.5-9.1          
[85] grid_3.5.1               readxl_1.3.1             data.table_1.12.2       
[88] blob_1.1.1               digest_0.6.18            xtable_1.8-4            
[91] munsell_0.5.0
biomart ensembl getbm • 1.8k views
ADD COMMENT
1
Entering edit mode
Mike Smith ★ 6.6k
@mike-smith
Last seen 4 hours ago
EMBL Heidelberg

I'm on my phone at the moment so can't test this, but it looks like one of your results lines comes back missing a value from Ensembl and biomaRt can't handle it. One thing you can try is not asking for the namespace_1003 attribute to check that's the one causing the problem.

You can also try changing the order of the attributes your which often forces Ensembl to insert a blank value in the space.

Otherwise you'll probably need to provide the gene IDs your using to help me debug this.

ADD COMMENT
0
Entering edit mode

Hi Mike,

thank you very much. I did that.

The error persists today, using ENSEMBL_DB_HOST = "www.ensembl.org so you were right and I scaned through the gene ids to provide a minimum example to better reproduce the failure:

The culprit was

go_sets <- getBM(attributes =  c("ensembl_gene_id", "hgnc_symbol", "entrezgene", "go_id", "name_1006", "go_linkage_type", "namespace_1003", "definition_1006"),
                     filters = "ensembl_gene_id",
                     #values = gsub("\\..*", "", row.names(res)[2170]),
                     values = c('ENSG00000100036'),
                     mart = mart,
                     verbose = TRUE,
                     bmHeader = FALSE
                    )

which yields (abberviated due to size restrictions of the post)

    <?xml version='1.0' encoding='UTF-8'?><!DOCTYPE Query><Query  virtualSchemaName = 'default' uniqueRows = '1' count = '0' datasetConfigVersion = '0.6' header='1' requestid= 'biomaRt'> <Dataset name = 'hsapiens_gene_ensembl'><Attribute name = 'ensembl_gene_id'/><Attribute name = 'hgnc_symbol'/><Attribute name = 'entrezgene'/><Attribute name = 'go_id'/><Attribute name = 'name_1006'/><Attribute name = 'go_linkage_type'/><Attribute name = 'namespace_1003'/><Attribute name = 'definition_1006'/><Filter name = 'ensembl_gene_id' value = 'ENSG00000100036' /></Dataset></Query>

    #################
    Results from server:
    [1] "Gene stable ID\tHGNC symbol\tNCBI gene ID\tGO term accession\tGO term name\tGO term evidence code\tGO domain\tGO term definition\nENSG00000100036\tSLC35E4\t339665\tGO:0016020\tmembrane\tIEA\tcellular_component\tA lipid bilayer along with all the proteins and protein complexes embedded in it an attached to it.\nENSG00000100036\tSLC35E4\t339665\tGO:0016021\tintegral component of membrane\tIEA\tcellular_component\tThe component of a 
[...]    
    Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 9 did not have 8 elements
    Traceback:

    1. getBM(attributes = c("ensembl_gene_id", "hgnc_symbol", "entrezgene", 
     .     "go_id", "name_1006", "go_linkage_type", "namespace_1003", 
     .     "definition_1006"), filters = "ensembl_gene_id", values = c("ENSG00000100036"), 
     .     mart = mart, verbose = TRUE, bmHeader = FALSE)   # at line 8-15 of file <text>
    2. read.table(con, sep = "\t", header = callHeader, quote = quote, 
     .     comment.char = "", check.names = FALSE, stringsAsFactors = FALSE)
    3. scan(file = file, what = what, sep = sep, quote = quote, dec = dec, 
     .     nmax = nrows, skip = 0, na.strings = na.strings, quiet = TRUE, 
     .     fill = fill, strip.white = strip.white, blank.lines.skip = blank.lines.skip, 
     .     multi.line = FALSE, comment.char = comment.char, allowEscapes = allowEscapes, 
     .     flush = flush, encoding = encoding, skipNul = skipNul)

This can be circumvented by not including the 'definition_1006' attribute.

Is there any chance, that such issues could be handled in a more benign fashion in the future?

ADD REPLY
0
Entering edit mode

This is an unusual example in that the offending entry actually has an extra \n in the GO description. This is why the solution to change the column orders doesn't work, as it just shifts the erroneous 'new row' around but it will always be there.

The simplest solution might actually be to contact Ensembl and try to understand why there is a line break, and either remove it or escape it properly. Otherwise i'm not sure you can be 100% confident that a \n doesn't truly represent a new entry.

I've started an issue at https://github.com/grimbough/biomaRt/issues/16 and will keep track of any progress there.

ADD REPLY

Login before adding your answer.

Traffic: 665 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6