Question: Getting R to talk to BioMart using biomaRt
gravatar for lm795
9 weeks ago by
United Kingdom
lm7950 wrote:

Greetings! I am new to this forum and posted a question recently to the RStudio community. A kind soul, Mara Averick pointed me to this forum.

My objective is to annotate a table of gene expression data "say for organism A" with information on orthologues from "organism B", as an additional column on a table. I watched tutorials online (YouTube) on how do do this using the BioMart GUI. Many clicks involved...

I have installed biomaRt, a Bioconductor package but have found no guidance on how to download orthologue genes (IDs, names, etc) from a specific species to my local table containing "organism A". As an example, I have table with expression data from Drosophila melanogaster and wanted to insert a new vector containing the names of orthologues in another species, mus musculus, for example. Anyone has a code example for this?

Thank you in advance.


ADD COMMENTlink modified 9 weeks ago by James W. MacDonald48k • written 9 weeks ago by lm7950
gravatar for Mike Smith
9 weeks ago by
Mike Smith3.1k
EMBL Heidelberg / de.NBI
Mike Smith3.1k wrote:

Any BioMart query you can do in the browser you can also do via the biomaRt package, although I'll agree that it's not always obvious what the correct options are.  Here's a little example trying to get mouse orthologs for three fly genes we're interest in.

First, lets load biomaRt and define that we want to use the fruit fly dataset:

ensembl <- useEnsembl(biomart = "ensembl", 
                      dataset = "dmelanogaster_gene_ensembl")

Now lets define our three genes we're interest in. I'm using what Ensembl refers to as the 'Gene Stable ID' here, which I guess is also the FlyBase ID. You'll have to adapt the filters argument in final query if your IDs are of a different type.

genes_of_interest <- c("FBgn0036531","FBgn0037375","FBgn0035252")

You can now query BioMart using something like:

getBM(mart = ensembl,
      filters = "ensembl_gene_id",
      values = genes_of_interest,
      attributes = c("ensembl_gene_id", 
  ensembl_gene_id external_gene_name mmusculus_homolog_ensembl_gene mmusculus_homolog_associated_gene_name
1     FBgn0035252             CG7970             ENSMUSG00000029499                                  Pxmp2
2     FBgn0036531             CG6244             ENSMUSG00000013822                                  Elof1
3     FBgn0037375           kat-60L1

What information you retrieve is determined by the attributes argument. Here we're getting back same ID we used to search and the fly gene name, plus the analogous values for orthologs found in mouse. If you want different information, then you can use the functions listAttributes() and searchAttributes() to find what else is available (or you can look on the web interface). Missing values presumably indicate that Ensembl does not have an ortholog mapping between these two species for that gene.

You should also note that I think this may return duplicate matches, because paralogs within one genome may be matched to paralogs in the second genome. e.g. if FlyGeneA & FlyGeneB are paralogs, and MouseGeneA & MouseGeneB are also paralogs then you'll end up with 4 reported parings.

ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by Mike Smith3.1k

Many thanks for the very useful explanation Mike, I will give it a go with my table. I will come back if I encounter problems. I am a bit of a rookie

ADD REPLYlink written 9 weeks ago by lm7950


Hi again, I used the getBM function using the following code:

gene_list <- read_csv("/Users/miguel/Dropbox/RStudio/Martins_Sanz/data/Thermal_Stress_GSA.csv")
gene_ids <- gene_list %>%
  select (Gene_ID)

# ****Note that there is a bug and if biomaRt is loaded, select() will not work in dplyr****
# Now,load biomaRt and define that we want to use the fruit fly dataset
fly = useMart("ensembl", dataset = "dmelanogaster_gene_ensembl")
fly_attributes <- listAttributes (fly)
getBM(mart = fly,
      filters = "ensembl_gene_id",
      values = gene_ids,
      attributes = c("mmusculus_homolog_ensembl_gene", "hsapiens_homolog_ensembl_gene"))

I got the following error message at the end, any thoughts?

Error in `[[<`(`*tmp*`, vIdx, value = c("FBgn0013276", "FBgn0001223",  : 
  replacement has 492 rows, data has 19669

ADD REPLYlink written 9 weeks ago by lm7950

There isn't a bug. This is known as masking, where you can have more than one package defining a function with the same name. You can get around that using the double colon function. Instead of using select directly, you would use dplyr::select(Gene_ID). Or you could just roll old school and use base R:

gene_ids <- gene_list$Gene_ID

In which case you don't need either tidyverse or dplyr.

Also do note that Mike's code is useful whereas yours is not! The query to the Biomart will return data in random order, so you need to ask for the filter to be returned as well in order to complete your mappings (you will need to probably get rid of some of the duplicated paralogs, as well as re-order the results to conform to your input data).

Anyway, you need to show a self-contained example that does what you show, as this works for me, using the genes that Mike originally used:

> library(biomaRt)
>  genes_of_interest <- c("FBgn0036531","FBgn0037375","FBgn0035252")
> ensembl <- useEnsembl(biomart = "ensembl",
                      dataset = "dmelanogaster_gene_ensembl")
ensembl <- useEnsembl(biomart = "ensembl",
+                       dataset = "dmelanogaster_gene_ensembl")
> getBM(c("ensembl_gene_id","external_gene_name", "mmusculus_homolog_ensembl_gene","hsapiens_homolog_ensembl_gene"), "ensembl_gene_id", genes_of_interest, ensembl)
  ensembl_gene_id external_gene_name mmusculus_homolog_ensembl_gene
1     FBgn0035252             CG7970             ENSMUSG00000029499
2     FBgn0036531             CG6244             ENSMUSG00000013822
3     FBgn0037375           kat-60L1                               
1               ENSG00000176894
2               ENSG00000130165

Your original code will return this:

> getBM(c("mmusculus_homolog_ensembl_gene","hsapiens_homolog_ensembl_gene"), "ensembl_gene_id", genes_of_interest, ensembl)
  mmusculus_homolog_ensembl_gene hsapiens_homolog_ensembl_gene
1             ENSMUSG00000029499               ENSG00000176894
2             ENSMUSG00000013822               ENSG00000130165

And you don't know what the paralogs are matched to.

You could run your code again, and after getting the error, run traceback() and show what that output is. But all things equal it's better if you present a self-contained bit of code that people can use to debug.

ADD REPLYlink written 9 weeks ago by James W. MacDonald48k

Thanks James, I will troubleshoot my code and when I do so post an update here. I am unfamiliar on how to post properly formatted code in this platform, is there a tutorial available?

ADD REPLYlink written 9 weeks ago by lm7950

When you start composing a response, note that at the top left there is a dropdown that by default says 'Text'. If you want to add code, put it in a separate paragraph block, highlight, and then change to 'Code' using that dropdown. You can also highlight inline code blocks and use 'Inline Code'.

ADD REPLYlink written 9 weeks ago by James W. MacDonald48k


Hi again, I have troubleshooted and got it to work with Mike's example. Here is my code:

# Project 01: Annotate Drosophila gene IDs with human IDs from BioMart

# Loading the required libraries

gene_list <- read_csv("/Users/miguel/Dropbox/RStudio/Martins_Sanz/data/Thermal_Stress_GSA.csv")
genes_of_interest <- c("FBgn0036531","FBgn0037375","FBgn0035252") #Test data

gene_ids <- gene_list %>%
  dplyr::select(Gene_ID) # Load gene IDs in a single vector

ensembl <- useEnsembl(biomart = "ensembl",
                      dataset = "dmelanogaster_gene_ensembl") #Load relevant database from Ensembl
fly = useMart("ensembl", dataset = "dmelanogaster_gene_ensembl") #Equivalent to the line above

getBM(mart = fly,
      filters = "ensembl_gene_id",
      values = genes_of_interest, #replace this line with relevant input data
      attributes = c("ensembl_gene_id","mmusculus_homolog_ensembl_gene", "hsapiens_homolog_ensembl_gene"))

It all goes well with the test set, but my data returns an error:

Error in `[[<`(`*tmp*`, vIdx, value = c("FBgn0013276", "FBgn0001223",  : 
  replacement has 492 rows, data has 19669

If I follow your advice an run traceback(), the output is the following:


I had to create a link to the output as it has more than 5000 characters. As the reason for using tidyverse, I was thinking of combining the data I get from the Biomart query with the original CSV file using joins.

Thanks again for helping me out on my first project!


ADD REPLYlink written 9 weeks ago by lm7950

You'll probably have to put gene_ids on DropBox so people can troubleshoot.

ADD REPLYlink written 9 weeks ago by James W. MacDonald48k

Ok, I have Dropboxed gene_ids, with the header as a csv file here. Thanks again!

ADD REPLYlink written 9 weeks ago by lm7950

This is because getBM is expecting a character vector, but gene_ids is a tibble. dplyr::select always returns a tibble even if you're only selecting a single column.

The simplest thing to do is forget using dplyr::select and use the $ operator to extract the correct column.

In the call to getBM you could just do values = gene_list$Gene_ID

ADD REPLYlink written 9 weeks ago by Mike Smith3.1k

Many thanks Mike, the issue is now resolved.

ADD REPLYlink written 8 weeks ago by lm7950
gravatar for James W. MacDonald
9 weeks ago by
United States
James W. MacDonald48k wrote:

See ?getLDS particularly the example.

ADD COMMENTlink written 9 weeks ago by James W. MacDonald48k

Thanks James, I have tried getBM but will dive into getLDS soon. As you see from above I am still having a few issues with getBM

ADD REPLYlink written 9 weeks ago by lm7950
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 189 users visited in the last hour