Question: biomaRt: filtering on attributes that aren't in listFilters
0
gravatar for so
11 weeks ago by
so10
so10 wrote:

Hi, I have a “strategy” question.

I searched the documentation and forums and think it's not possible to filter by attributes that don’t come up using the listFilters function (eg. GO description). (If it’s not clear what I want to do, I essentially want to follow this example, but filter the GO description using the value “MAP kinase activity”, rather than GO IDs using the value “GO:0004707”)

My current solution is to download all the GO IDs and GO descriptions in a mart, search that table to get unique GO IDs, then use biomaRt. Is this the recommended way to do it? I think I would really only need the unique GO IDs and descriptions (vs. downloading everything from each mart), but I'm not confident the data in eg. Go.db would match the data in biomaRt.

I would appreciate any advice/comments. Thank you in advance for your help!

biomart • 109 views
ADD COMMENTlink modified 11 weeks ago by Mike Smith3.3k • written 11 weeks ago by so10
Answer: biomaRt: filtering on attributes that aren't in listFilters
2
gravatar for Mike Smith
11 weeks ago by
Mike Smith3.3k
EMBL Heidelberg / de.NBI
Mike Smith3.3k wrote:

You can use the function searchFilters() to try and find a filter you're interested in. Since the filter ids can sometimes be a bit cryptic, it looks in both the id and the more verbose description to try and find a match, and hopefully returns a list that's a bit easier to look through than getting everything back via listFilters(). Here's an example with the Human Genes mart:

library(biomaRt)
mart <- useEnsembl('ensembl', dataset = 'hsapiens_gene_ensembl')
searchFilters(mart, 'go_')
                   name             description
188      go_parent_term   Parent term accession
189      go_parent_name        Parent term name
190    go_evidence_code        GO Evidence code
230 with_cdingo_homolog Orthologous Dingo Genes

Here the code description is probably 'Parent term name' - it's still not a perfect match to how things are named on the Ensembl website, but hopefully it's easier to check the few options here if it's not immediately clear.

You can then use that as a filter on the mart e.g.

getBM(mart = mart,
      filter = "go_parent_name",
      values = "MAP kinase activity",
      attributes = c("ensembl_gene_id"))
   ensembl_gene_id
1  ENSG00000188130
2  ENSG00000166484
3  ENSG00000185386
4  ENSG00000181085
5  ENSG00000141639
...

One think to bear in mind is that the search here is case sensitive, so it's very easy to get zero results for an otherwise fine looking query e.g.

getBM(mart = mart,
      filter = "go_parent_name",
      values = "MAP Kinase Activity",
      attributes = c("ensembl_gene_id"))
[1] ensembl_gene_id
<0 rows> (or 0-length row.names)

It might be preferable to stick with using GO IDs unless you're confident that your list of description terms matches the form used internally by Ensembl.

ADD COMMENTlink modified 11 weeks ago • written 11 weeks ago by Mike Smith3.3k
1

Hi Mike,

thank you so much for the detailed answer and so so sorry about the very late reply (I'm not sure what I did to my notifications)!

I see where I went wrong now: I didn’t realise that’s what go_parent_name/"Parent term name" meant.. searchFilters definitely sounds very useful!

For anyone else who might find this useful, note the difference between go_parent_name (which can only be used as a filter) vs name_1006 (which can only be used as an attribute):

library(biomaRt)

mart <- useEnsembl('ensembl', dataset = 'hsapiens_gene_ensembl')

strict_res <- getBM(mart = mart,
  filters = "go_parent_name",
  values = "MAP kinase activity",
  attributes = c("ensembl_gene_id", "go_id", "name_1006"))

dim(strict_res)
#> [1] 18  3
unique(strict_res$name_1006)
#> [1] "MAP kinase activity" "JUN kinase activity"
unique(strict_res$go_id)
#> [1] "GO:0004707" "GO:0004705"

Created on 2019-03-27 by the reprex package (v0.2.1)

For my particular use case, I’d prefer making the search case insensitive and to detect a word/phrase vs. looking for an exact match. In case someone else has a similar use case, this is what I’m using (any feedback is always welcomed!):

library(biomaRt)
library(GO.db)
ibrary(dplyr)

mart <- useEnsembl('ensembl', dataset = 'hsapiens_gene_ensembl')

flexible_query <-
  Term(GOTERM) %>%
  tibble(go_id = names(.), go_description = .) %>%
  rowwise() %>% 
  mutate(
    match = base::grepl(tolower("MAP Kinase Activity"), tolower(go_description)) %>% 
      any()
  ) %>% 
  filter(match) %>% 
  pull(go_description)

flexible_res <- getBM(mart = mart,
  filters = "go_parent_name",
  values = flexible_query,
  attributes = c("ensembl_gene_id", "go_id", "name_1006"))

dim(flexible_res)
#> [1] 484   3
unique(flexible_res$name_1006)
#>  [1] "activation of MAPK activity"                                            
#>  [2] "positive regulation of MAP kinase activity"                             
#>  [3] "positive regulation of JUN kinase activity"                             
#>  [4] "activation of MAPKKK activity"                                          
#>  [5] "negative regulation of MAP kinase activity"                             
#>  [6] "negative regulation of JUN kinase activity"                             
#>  [7] "activation of MAPKK activity"                                           
#>  [8] "inactivation of MAPK activity"                                          
#>  [9] "inactivation of MAPKK activity"                                         
#> [10] "activation of JUN kinase activity"                                      
#> [11] "MAP kinase activity"                                                    
#> [12] "regulation of MAP kinase activity"                                      
#> [13] "activation of JNKK activity"                                            
#> [14] "inactivation of MAPK activity involved in osmosensory signaling pathway"
#> [15] "JUN kinase activity"                                                    
#> [16] "regulation of JUN kinase activity"
unique(flexible_res$go_id)
#>  [1] "GO:0000187" "GO:0043406" "GO:0043507" "GO:0000185" "GO:0043407"
#>  [6] "GO:0043508" "GO:0000186" "GO:0000188" "GO:0051389" "GO:0007257"
#> [11] "GO:0004707" "GO:0043405" "GO:0007256" "GO:0000173" "GO:0004705"
#> [16] "GO:0043506"

# filtering for just the results from the strict search:
filtered_flexible_res <- 
  filter(flexible_res, name_1006 %in% c("MAP kinase activity", "JUN kinase activity"))
dim(filtered_flexible_res)
#> [1] 18  3
unique(filtered_flexible_res$name_1006)
#> [1] "MAP kinase activity" "JUN kinase activity"
unique(filtered_flexible_res$go_id)
#> [1] "GO:0004707" "GO:0004705"

Created on 2019-03-27 by the reprex package (v0.2.1)

ADD REPLYlink written 24 days ago by so10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 103 users visited in the last hour