biomaRtl returns multiple GO terms when filtering on a single GO term in ensembl database
1
0
Entering edit mode
henrydionne ▴ 10
@cd338402
Last seen 15 months ago
United States

I'm trying to get the description of a vector of go terms using the biomaRt function in the ensembl database and I'm receiving a timeout. I only have 200 go terms, so I wouldn't expect a timeout. I believe the issue is that when I try to filter on those go terms, I'm getting back more go-terms than I should be getting.

I've rerun the problematic code filtering on a single go term instead of the entire set.

ensembl <- useMart(
    biomart = "ensembl", 
    dataset = "mmusculus_gene_ensembl"
)

go_id_description_key <- getBM(
    attributes = c('go_id', 'name_1006', 'namespace_1003'),
    filters = "go",
    values ="GO:0002181",
    mart = ensembl,
    uniqueRows = TRUE
)

Doing so, received a table of 320 different go terms. I would have expected the code to return a single go term since I'm filtering on a single go term. The table returned does include a row with the go term, but also 319 rows with wrong go terms as well.

sessionInfo() results:

R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggpubr_0.6.0                tidyr_1.3.0                
 [3] purrr_1.0.2                 dplyr_1.1.3                
 [5] BiocFileCache_2.10.1        dbplyr_2.4.0               
 [7] fgsea_1.28.0                clusterProfiler_4.10.0     
 [9] DESeq2_1.42.0               SummarizedExperiment_1.32.0
[11] Biobase_2.62.0              MatrixGenerics_1.14.0      
[13] matrixStats_1.0.0           GenomicRanges_1.54.1       
[15] GenomeInfoDb_1.38.0         IRanges_2.36.0             
[17] S4Vectors_0.40.1            BiocGenerics_0.48.0        
[19] rgl_1.2.1                   latex2exp_0.9.6            
[21] EnhancedVolcano_1.20.0      ggrepel_0.9.4              
[23] UpSetR_1.4.0                pheatmap_1.0.12            
[25] ggplot2_3.4.4               svglite_2.1.2              
[27] conflicted_1.2.0            BiocManager_1.30.22        
[29] biomaRt_2.58.0             

loaded via a namespace (and not attached):
  [1] splines_4.3.1                 later_1.3.1                  
  [3] bitops_1.0-7                  ggplotify_0.1.2              
  [5] filelock_1.0.2                tibble_3.2.1                 
  [7] polyclip_1.10-6               XML_3.99-0.14                
  [9] lifecycle_1.0.4               rstatix_0.7.2                
 [11] lattice_0.21-9                MASS_7.3-60                  
 [13] backports_1.4.1               magrittr_2.0.3               
 [15] rmarkdown_2.25                yaml_2.3.7                   
 [17] httpuv_1.6.12                 cowplot_1.1.1                
 [19] DBI_1.1.3                     RColorBrewer_1.1-3           
 [21] abind_1.4-5                   zlibbioc_1.48.0              
 [23] ggraph_2.1.0                  RCurl_1.98-1.12              
 [25] yulab.utils_0.1.0             tweenr_2.0.2                 
 [27] rappdirs_0.3.3                GenomeInfoDbData_1.2.11      
 [29] enrichplot_1.22.0             tidytree_0.4.5               
 [31] codetools_0.2-19              DelayedArray_0.28.0          
 [33] DOSE_3.28.0                   xml2_1.3.5                   
 [35] ggforce_0.4.1                 tidyselect_1.2.0             
 [37] aplot_0.2.2                   farver_2.1.1                 
 [39] viridis_0.6.4                 base64enc_0.1-3              
 [41] jsonlite_1.8.7                ellipsis_0.3.2               
 [43] tidygraph_1.2.3               systemfonts_1.0.5            
 [45] tools_4.3.1                   progress_1.2.2               
 [47] treeio_1.26.0                 HPO.db_0.99.2                
 [49] Rcpp_1.0.11                   glue_1.6.2                   
 [51] gridExtra_2.3                 SparseArray_1.2.0            
 [53] xfun_0.40                     qvalue_2.34.0                
 [55] withr_2.5.2                   fastmap_1.1.1                
 [57] fansi_1.0.5                   digest_0.6.33                
 [59] R6_2.5.1                      mime_0.12                    
 [61] gridGraphics_0.5-1            colorspace_2.1-0             
 [63] GO.db_3.18.0                  RSQLite_2.3.2                
 [65] utf8_1.2.4                    generics_0.1.3               
 [67] data.table_1.14.8             prettyunits_1.2.0            
 [69] graphlayouts_1.0.1            httr_1.4.7                   
 [71] htmlwidgets_1.6.2             S4Arrays_1.2.0               
 [73] scatterpie_0.2.1              pkgconfig_2.0.3              
 [75] gtable_0.3.4                  blob_1.2.4                   
 [77] XVector_0.42.0                shadowtext_0.1.2             
 [79] htmltools_0.5.6.1             carData_3.0-5                
 [81] scales_1.2.1                  png_0.1-8                    
 [83] ggfun_0.1.3                   knitr_1.45                   
 [85] rstudioapi_0.15.0             reshape2_1.4.4               
 [87] nlme_3.1-163                  curl_5.1.0                   
 [89] cachem_1.0.8                  stringr_1.5.1                
 [91] BiocVersion_3.18.1            parallel_4.3.1               
 [93] HDO.db_0.99.1                 AnnotationDbi_1.64.1         
 [95] pillar_1.9.0                  grid_4.3.1                   
 [97] vctrs_0.6.4                   promises_1.2.1               
 [99] car_3.1-2                     xtable_1.8-4                 
[101] evaluate_0.23                 cli_3.6.1                    
[103] locfit_1.5-9.8                compiler_4.3.1               
[105] rlang_1.1.1                   crayon_1.5.2                 
[107] ggsignif_0.6.4                labeling_0.4.3               
[109] plyr_1.8.9                    fs_1.6.3                     
[111] stringi_1.7.12                viridisLite_0.4.2            
[113] BiocParallel_1.36.0           MPO.db_0.99.7                
[115] munsell_0.5.0                 Biostrings_2.70.1            
[117] lazyeval_0.2.2                GOSemSim_2.28.0              
[119] Matrix_1.6-1.1                hms_1.1.3                    
[121] patchwork_1.1.3               bit64_4.0.5                  
[123] KEGGREST_1.42.0               shiny_1.7.5.1                
[125] interactiveDisplayBase_1.40.0 AnnotationHub_3.10.0         
[127] broom_1.0.5                   igraph_1.5.1                 
[129] memoise_2.0.1                 ggtree_3.10.0                
[131] fastmatch_1.1-4               bit_4.0.5                    
[133] ape_5.7-1                     gson_0.1.0
ensembl biomaRt • 1.1k views
ADD COMMENT
2
Entering edit mode
Mike Smith ★ 6.6k
@mike-smith
Last seen 4 weeks ago
EMBL Heidelberg

The issue here is that Ensembl BioMart is not really the right resource to be getting this information from. It's structured to return results about transcripts, rather than all information in Ensembl. So what your query is actually doing is finding all transcripts annotated with "GO:0002181" and then returning all the GO IDs, names, and descriptions associated with that collection of transcripts. However this isn't obvious as you aren't returning a column with transcript or gene IDs.

Consider these two queries, which are identical except that one include the transcript IDs as well:

go_list <- getBM(attributes=c("go_id", "name_1006", "namespace_1003"),
                 filters = "go",
                 values = c("GO:0002181"),
                 mart = ensembl, 
                 uniqueRows = FALSE)

go_list2 <- getBM(attributes=c("go_id", "name_1006", "namespace_1003",
                              "ensembl_transcript_id"),
                 filters = "go",
                 values = c("GO:0002181"),
                 mart = ensembl, 
                 uniqueRows = FALSE)

identical( nrow(go_list), nrow(go_list2) )
[1] TRUE

Also note that I've used the argument uniqueRows = FALSE to more clearly demonstrate that the results are transcript-centric. The default option here will remove duplicate rows in the result, which will remove more rows from the first result as it doesn't include the transcript IDs.


Fortunately there are many ways to actually get the information you're looking for; one is to use the Ensembl REST API. Here's a small function that I think does what you're looking for.

library(httr)
library(jsonlite)
library(xml2)
library(tibble)

getGoDetails <- function(go_id) {
  server <- "https://rest.ensembl.org/ontology/id/"

  r <- GET(paste(server, go_id, sep = ""), 
           content_type("application/json"))

  stop_for_status(r)

  res <- fromJSON(toJSON(content(r)))
  tibble(id = res$accession, 
         name = res$name, 
         definition = res$definition)
}

getGoDetails("GO:0002181")
# A tibble: 1 x 3
id        name             definition                                                           
<chr>     <chr>            <chr>                                                  
1 GO:0002181 cytoplasmic translation The chemical reactions and pathways resulting in the formation of a protein...

If you're looking to do this for a large number of terms then it would be much more efficient to find a file or database that contains all the information and work with that offline. For example the GO.db package contains this information.

library(GO.db)
go_ids <- c("GO:0002181", "GO:0000001")
select(x = GO.db, keytype = "GOID", keys = go_ids, columns = c("TERM", "DEFINITION"))
#> 'select()' returned 1:1 mapping between keys and columns
#>         GOID                      TERM
#> 1 GO:0002181   cytoplasmic translation
#> 2 GO:0000001 mitochondrion inheritance
#>                                                                                                                                                                                                                                          DEFINITION
#> 1 The chemical reactions and pathways resulting in the formation of a protein in the cytoplasm. This is a ribosome-mediated process in which the information in messenger RNA (mRNA) is used to specify the sequence of amino acids in the protein.
#> 2  The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton.
ADD COMMENT
1
Entering edit mode

This worked! Thank you for your help.

ADD REPLY

Login before adding your answer.

Traffic: 1215 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6