Question: Getting R to talk to BioMart using biomaRt
0
12 months ago by
lm7950
United Kingdom
lm7950 wrote:

Greetings! I am new to this forum and posted a question recently to the RStudio community. A kind soul, Mara Averick pointed me to this forum.

My objective is to annotate a table of gene expression data "say for organism A" with information on orthologues from "organism B", as an additional column on a table. I watched tutorials online (YouTube) on how do do this using the BioMart GUI. Many clicks involved...

I have installed biomaRt, a Bioconductor package but have found no guidance on how to download orthologue genes (IDs, names, etc) from a specific species to my local table containing "organism A". As an example, I have table with expression data from Drosophila melanogaster and wanted to insert a new vector containing the names of orthologues in another species, mus musculus, for example. Anyone has a code example for this?

Thank you in advance.

Miguel

biomart • 609 views
modified 12 months ago by James W. MacDonald51k • written 12 months ago by lm7950
Answer: Getting R to talk to BioMart using biomaRt
1
12 months ago by
Mike Smith4.0k
EMBL Heidelberg / de.NBI
Mike Smith4.0k wrote:

Any BioMart query you can do in the browser you can also do via the biomaRt package, although I'll agree that it's not always obvious what the correct options are.  Here's a little example trying to get mouse orthologs for three fly genes we're interest in.

First, lets load biomaRt and define that we want to use the fruit fly dataset:

library(biomaRt)
ensembl <- useEnsembl(biomart = "ensembl",
dataset = "dmelanogaster_gene_ensembl")


Now lets define our three genes we're interest in. I'm using what Ensembl refers to as the 'Gene Stable ID' here, which I guess is also the FlyBase ID. You'll have to adapt the filters argument in final query if your IDs are of a different type.

genes_of_interest <- c("FBgn0036531","FBgn0037375","FBgn0035252")


You can now query BioMart using something like:

getBM(mart = ensembl,
filters = "ensembl_gene_id",
values = genes_of_interest,
attributes = c("ensembl_gene_id",
"external_gene_name",
"mmusculus_homolog_ensembl_gene",
"mmusculus_homolog_associated_gene_name")
)

  ensembl_gene_id external_gene_name mmusculus_homolog_ensembl_gene mmusculus_homolog_associated_gene_name
1     FBgn0035252             CG7970             ENSMUSG00000029499                                  Pxmp2
2     FBgn0036531             CG6244             ENSMUSG00000013822                                  Elof1
3     FBgn0037375           kat-60L1


What information you retrieve is determined by the attributes argument. Here we're getting back same ID we used to search and the fly gene name, plus the analogous values for orthologs found in mouse. If you want different information, then you can use the functions listAttributes() and searchAttributes() to find what else is available (or you can look on the web interface). Missing values presumably indicate that Ensembl does not have an ortholog mapping between these two species for that gene.

You should also note that I think this may return duplicate matches, because paralogs within one genome may be matched to paralogs in the second genome. e.g. if FlyGeneA & FlyGeneB are paralogs, and MouseGeneA & MouseGeneB are also paralogs then you'll end up with 4 reported parings.

ADD COMMENTlink modified 12 months ago • written 12 months ago by Mike Smith4.0k

Many thanks for the very useful explanation Mike, I will give it a go with my table. I will come back if I encounter problems. I am a bit of a rookie

Hi again, I used the getBM function using the following code:

library(tidyverse)
library(dplyr)
gene_ids <- gene_list %>%
select (Gene_ID)

# ****Note that there is a bug and if biomaRt is loaded, select() will not work in dplyr****
# Now,load biomaRt and define that we want to use the fruit fly dataset
library(biomaRt)
fly = useMart("ensembl", dataset = "dmelanogaster_gene_ensembl")
fly_attributes <- listAttributes (fly)
getBM(mart = fly,
filters = "ensembl_gene_id",
values = gene_ids,
attributes = c("mmusculus_homolog_ensembl_gene", "hsapiens_homolog_ensembl_gene"))

I got the following error message at the end, any thoughts?

Error in [[<-.data.frame(*tmp*, vIdx, value = c("FBgn0013276", "FBgn0001223",  :
replacement has 492 rows, data has 19669

1

There isn't a bug. This is known as masking, where you can have more than one package defining a function with the same name. You can get around that using the double colon function. Instead of using select directly, you would use dplyr::select(Gene_ID). Or you could just roll old school and use base R:

gene_ids <- gene_list$Gene_ID In which case you don't need either tidyverse or dplyr. Also do note that Mike's code is useful whereas yours is not! The query to the Biomart will return data in random order, so you need to ask for the filter to be returned as well in order to complete your mappings (you will need to probably get rid of some of the duplicated paralogs, as well as re-order the results to conform to your input data). Anyway, you need to show a self-contained example that does what you show, as this works for me, using the genes that Mike originally used: > library(biomaRt) > genes_of_interest <- c("FBgn0036531","FBgn0037375","FBgn0035252") > ensembl <- useEnsembl(biomart = "ensembl", dataset = "dmelanogaster_gene_ensembl") ensembl <- useEnsembl(biomart = "ensembl", + dataset = "dmelanogaster_gene_ensembl") > getBM(c("ensembl_gene_id","external_gene_name", "mmusculus_homolog_ensembl_gene","hsapiens_homolog_ensembl_gene"), "ensembl_gene_id", genes_of_interest, ensembl) ensembl_gene_id external_gene_name mmusculus_homolog_ensembl_gene 1 FBgn0035252 CG7970 ENSMUSG00000029499 2 FBgn0036531 CG6244 ENSMUSG00000013822 3 FBgn0037375 kat-60L1 hsapiens_homolog_ensembl_gene 1 ENSG00000176894 2 ENSG00000130165 Your original code will return this: > getBM(c("mmusculus_homolog_ensembl_gene","hsapiens_homolog_ensembl_gene"), "ensembl_gene_id", genes_of_interest, ensembl) mmusculus_homolog_ensembl_gene hsapiens_homolog_ensembl_gene 1 ENSMUSG00000029499 ENSG00000176894 2 ENSMUSG00000013822 ENSG00000130165 3  And you don't know what the paralogs are matched to. You could run your code again, and after getting the error, run traceback() and show what that output is. But all things equal it's better if you present a self-contained bit of code that people can use to debug. ADD REPLYlink written 12 months ago by James W. MacDonald51k Thanks James, I will troubleshoot my code and when I do so post an update here. I am unfamiliar on how to post properly formatted code in this platform, is there a tutorial available? ADD REPLYlink written 12 months ago by lm7950 When you start composing a response, note that at the top left there is a dropdown that by default says 'Text'. If you want to add code, put it in a separate paragraph block, highlight, and then change to 'Code' using that dropdown. You can also highlight inline code blocks and use 'Inline Code'. ADD REPLYlink written 12 months ago by James W. MacDonald51k Hi again, I have troubleshooted and got it to work with Mike's example. Here is my code: # Project 01: Annotate Drosophila gene IDs with human IDs from BioMart # Loading the required libraries library(tidyverse) library(biomaRt) gene_list <- read_csv("/Users/miguel/Dropbox/RStudio/Martins_Sanz/data/Thermal_Stress_GSA.csv") genes_of_interest <- c("FBgn0036531","FBgn0037375","FBgn0035252") #Test data gene_ids <- gene_list %>% dplyr::select(Gene_ID) # Load gene IDs in a single vector ensembl <- useEnsembl(biomart = "ensembl", dataset = "dmelanogaster_gene_ensembl") #Load relevant database from Ensembl fly = useMart("ensembl", dataset = "dmelanogaster_gene_ensembl") #Equivalent to the line above getBM(mart = fly, filters = "ensembl_gene_id", values = genes_of_interest, #replace this line with relevant input data attributes = c("ensembl_gene_id","mmusculus_homolog_ensembl_gene", "hsapiens_homolog_ensembl_gene")) It all goes well with the test set, but my data returns an error: Error in [[<-.data.frame(*tmp*, vIdx, value = c("FBgn0013276", "FBgn0001223", : replacement has 492 rows, data has 19669 If I follow your advice an run traceback(), the output is the following: traceback_output I had to create a link to the output as it has more than 5000 characters. As the reason for using tidyverse, I was thinking of combining the data I get from the Biomart query with the original CSV file using joins. Thanks again for helping me out on my first project! ADD REPLYlink written 12 months ago by lm7950 You'll probably have to put gene_ids on DropBox so people can troubleshoot. ADD REPLYlink written 12 months ago by James W. MacDonald51k Ok, I have Dropboxed gene_ids, with the header as a csv file here. Thanks again! ADD REPLYlink written 12 months ago by lm7950 This is because getBM is expecting a character vector, but gene_ids is a tibble. dplyr::select always returns a tibble even if you're only selecting a single column. The simplest thing to do is forget using dplyr::select and use the $ operator to extract the correct column.

In the call to getBM you could just do values = gene_list\$Gene_ID

Many thanks Mike, the issue is now resolved.

Answer: Getting R to talk to BioMart using biomaRt
0
12 months ago by
United States
James W. MacDonald51k wrote:

See ?getLDS particularly the example.