biomaRt fails to retrieve GO description for some particular gene
1
0
Entering edit mode
foehn ▴ 100
@foehn-16281
Last seen 2.6 years ago
United States

Hello,

I'm having trouble finding GO terms and definitions for a list of genes using biomaRt. The problem seems to be specific to some genes instead of all. For exmaple,

> library("tibble")
> library("biomaRt")
> BM = useMart("ensembl", dataset = "hsapiens_gene_ensembl")

Below is a successful query.

> tibble(getBM(attributes = c("external_gene_name", "definition_1006"), filters = "external_gene_name", values = "RUNX1", mart = BM))
# A tibble: 39 x 1
   `getBM(...)`$external_ge… $definition_1006                                   
   <chr>                     <chr>                                              
 1 RUNX1                     Any molecular function by which a gene product int…
 2 RUNX1                     Any process that modulates the frequency, rate or …
 3 RUNX1                     A membrane-bounded organelle of eukaryotic cells i…
 4 RUNX1                     Interacting selectively and non-covalently with AT…
 5 RUNX1                     A protein or a member of a complex that interacts …
 6 RUNX1                     Any process that activates or increases the freque…
 7 RUNX1                     Organized structure of distinctive morphology and …
 8 RUNX1                     The part of the cytoplasm that does not contain or…
 9 RUNX1                     That part of the nuclear content other than the ch…
10 RUNX1                     Interacting selectively and non-covalently with an…
# … with 29 more rows

However, when I change RUNX1 to BCOR, things start to fail,

> tibble(getBM(attributes = c("external_gene_name", "definition_1006"), filters = "external_gene_name", values = "BCOR", mart = BM))
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 22 did not have 2 elements

But I can confirm BCOR is a valid gene symbol,

> tibble(getBM(attributes = c("external_gene_name", "ensembl_gene_id", "go_id"), filters = "external_gene_name", values = "BCOR", mart = BM))
# A tibble: 24 x 1
   `getBM(...)`$external_gene_name $ensembl_gene_id $go_id    
   <chr>                           <chr>            <chr>     
 1 BCOR                            ENSG00000183337  GO:0005515
 2 BCOR                            ENSG00000183337  GO:0005634
 3 BCOR                            ENSG00000183337  GO:0006325
 4 BCOR                            ENSG00000183337  GO:0004842
 5 BCOR                            ENSG00000183337  GO:0000122
 6 BCOR                            ENSG00000183337  GO:0003714
 7 BCOR                            ENSG00000183337  GO:0007507
 8 BCOR                            ENSG00000183337  GO:0008134
 9 BCOR                            ENSG00000183337  GO:0044212
10 BCOR                            ENSG00000183337  GO:0045892
# … with 14 more rows

It appears that definition_1006 just does not work for BCOR. This baffles me. Does anybody know what went wrong here? Thanks.

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin16.7.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS: /usr/local/R/3.5.2/lib/R/lib/libRblas.dylib
LAPACK: /usr/local/R/3.5.2/lib/R/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tibble_2.1.1   biomaRt_2.38.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1           AnnotationDbi_1.44.0 magrittr_1.5        
 [4] BiocGenerics_0.28.0  hms_0.4.2            progress_1.2.0      
 [7] IRanges_2.16.0       bit_1.1-14           R6_2.4.0            
[10] rlang_0.3.4          fansi_0.4.0          httr_1.4.0          
[13] stringr_1.4.0        blob_1.1.1           tools_3.5.2         
[16] parallel_3.5.2       Biobase_2.42.0       utf8_1.1.4          
[19] cli_1.1.0            DBI_1.0.0            bit64_0.9-7         
[22] digest_0.6.18        assertthat_0.2.1     crayon_1.3.4        
[25] S4Vectors_0.20.1     bitops_1.0-6         curl_3.3            
[28] RCurl_1.95-4.12      memoise_1.1.0        RSQLite_2.1.1       
[31] stringi_1.4.3        pillar_1.3.1         compiler_3.5.2      
[34] prettyunits_1.0.2    stats4_3.5.2         XML_3.98-1.19       
[37] pkgconfig_2.0.2
biomart go r bioconductor symbol • 984 views
ADD COMMENT
2
Entering edit mode
Mike Smith ★ 6.5k
@mike-smith
Last seen 4 hours ago
EMBL Heidelberg

This occurs because a small number of the GO term descriptions contain new line characters. In the background biomaRt is basically parsing a tsv file with the results, and these new lines mess that up and you get the error you're seeing.

There a fix for this in the devel branch of biomaRt e.g.

> getBM(attributes = c("external_gene_name", "definition_1006"), 
+       filters = "external_gene_name", 
+       values = "BCOR", 
+       mart = BM) %>% as_tibble()
# A tibble: 24 x 2
   external_gene_name definition_1006                                          
   <chr>              <chr>                                                    
 1 BCOR               Interacting selectively and non-covalently with any prot…
 2 BCOR               A membrane-bounded organelle of eukaryotic cells in whic…
 3 BCOR               Any process that results in the specification, formation…
 4 BCOR               Catalysis of the transfer of ubiquitin from one protein …
 5 BCOR               Any process that stops, prevents, or reduces the frequen…
 6 BCOR               A protein or a member of a complex that interacts specif…
 7 BCOR               The process whose specific outcome is the progression of…
 8 BCOR               Interacting selectively and non-covalently with a transc…
 9 BCOR               Interacting selectively and non-covalently with a DNA re…
10 BCOR               Any process that stops, prevents, or reduces the frequen…
# … with 14 more rows

The easiest way to install this with your current setup is from Github via devtools::install_github('grimbough/biomaRt'). However there's quite a lot of changes in devel, so I wouldn't like to guarantee it works with R-3.5.2. I would suggest at least upgrading to R-3.6.1 and using Bioconductor 3.8 to maximise the chances of it working. If there are still issues please report back here and I'll try to port the relevant bits of code to the release version.

ADD COMMENT

Login before adding your answer.

Traffic: 974 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6