Hi all,
I am using biomaRt to annotate Ensembl IDs, from Mus musculus genome, version 80.
As the current version is 81, I am using an archived version, here is how I proceed:
# connecting to the right version of Ensembl, this works well:
my_mart <- useMart(host="may2015.archive.ensembl.org", biomart="ENSEMBL_MART_ENSEMBL", dataset="mmusculus_gene_ensembl")
# mapping Ensembl IDs to retrieve more detailed annotation:
getBM(attributes=c("ensembl_gene_id", "chromosome_name", "start_position", "end_position", "strand", "description", "external_gene_name"), filters ="ensembl_gene_id", values = "ENSMUSG00000071528", mart=my_mart)
Here I get the following error:
"
Error in getBM(attributes = c("ensembl_gene_id", "chromosome_name", "start_position", :
Query ERROR: caught BioMart::Exception::Database: Could not connect to mysql database ensembl_mart_80: DBI connect('database=ensembl_mart_80;host=ensdb-web-13;port=5314','ensro',...) failed: Can't connect to MySQL server on 'ensdb-web-13' (110) at /ensemblweb/archive/www_80/biomart-perl/lib/BioMart/Configuration/DBLocation.pm line 98.
"
Trying the same thing with the current version (81), I do not get this error:
my_mart2 <- useMart(host="www.ensembl.org", biomart="ENSEMBL_MART_ENSEMBL", dataset="mmusculus_gene_ensembl")
getBM(attributes=c("ensembl_gene_id", "chromosome_name", "start_position", "end_position", "strand", "description", "external_gene_name"), filters ="ensembl_gene_id", values = "ENSMUSG00000071528", mart=my_mart2)
ensembl_gene_id chromosome_name start_position end_position strand
1 ENSMUSG00000071528 19 47083471 47090625 -1
description
1 upregulated during skeletal muscle growth 5 [Source:MGI Symbol;Acc:MGI:1891435]
external_gene_name
1 Usmg5
Anything I can do about it?
Thanks!
Sarah
Hi Sarah,
An alternate way of annotating your Ensembl gene would be to use the Bioconductor package AnnotationHub.
It contains gtf files from Ensembl release 69 to 81 for all organisms released by Ensembl.
The data is presented as GRanges which can easily be manipulated to get information about the gene, exons, CDS etc..
Load the package
> library(AnnotationHub) > ah = AnnotationHub() snapshotDate(): 2015-08-26
Search for a GTF file coming from Ensembl for mus musculus for release-80
> gtf <- query(ah, c("gtf","mus musculus", "80", "ensembl")) > gtf AnnotationHub with 1 record # snapshotDate(): 2015-08-26 # names(): AH47076 # $dataprovider: Ensembl # $species: Mus musculus # $rdataclass: GRanges # $title: Mus_musculus.GRCm38.80.gtf # $description: Gene Annotation for Mus musculus # $taxonomyid: 10090 # $genome: GRCm38 # $sourcetype: GTF # $sourceurl: ftp://ftp.ensembl.org/pub/release-80/gtf/mus_musculus/Mus_musculus.GRCm38.80.gtf.gz # $sourcelastmodifieddate: 2015-05-01 # $sourcesize: 25292510 # $tags: GTF, ensembl, Gene, Transcript, Annotation # retrieve record with 'object[["AH47076"]]'
Download the File
> gtfFile <- gtf[[1]] require(“GenomicRanges”) retrieving 1 resource |===========================================================================================| 100% using guess work to populate seqinfo There were 50 or more warnings (use warnings() to see the first 50)
This object is downloaded as a GenomicRanges object which contains data on all the genes, The ensembl gene names are contained in the mcols() "gene_id"
> gtfFile GRanges object with 1524100 ranges and 22 metadata columns: seqnames ranges strand | source type score phase <Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer> [1] 1 [3073253, 3074322] + | havana gene <NA> <NA> [2] 1 [3073253, 3074322] + | havana transcript <NA> <NA> [3] 1 [3073253, 3074322] + | havana exon <NA> <NA> [4] 1 [3102016, 3102125] + | ensembl gene <NA> <NA> [5] 1 [3102016, 3102125] + | ensembl transcript <NA> <NA> ... ... ... ... ... ... ... ... ... [1524096] JH584295.1 [708, 752] - | ensembl CDS <NA> 2 [1524097] JH584295.1 [565, 633] - | ensembl exon <NA> <NA> [1524098] JH584295.1 [565, 633] - | ensembl CDS <NA> 2 [1524099] JH584295.1 [ 66, 109] - | ensembl exon <NA> <NA> [1524100] JH584295.1 [ 66, 109] - | ensembl CDS <NA> 2 gene_id gene_version gene_name gene_source gene_biotype <character> <numeric> <character> <character> <character> [1] ENSMUSG00000102693 1 4933401J01Rik havana TEC [2] ENSMUSG00000102693 1 4933401J01Rik havana TEC [3] ENSMUSG00000102693 1 4933401J01Rik havana TEC [4] ENSMUSG00000064842 1 Gm26206 ensembl snRNA [5] ENSMUSG00000064842 1 Gm26206 ensembl snRNA ... ... ... ... ... ... [1524096] ENSMUSG00000095742 1 CAAA01147332.1 ensembl protein_coding [1524097] ENSMUSG00000095742 1 CAAA01147332.1 ensembl protein_coding [1524098] ENSMUSG00000095742 1 CAAA01147332.1 ensembl protein_coding [1524099] ENSMUSG00000095742 1 CAAA01147332.1 ensembl protein_coding [1524100] ENSMUSG00000095742 1 CAAA01147332.1 ensembl protein_coding transcript_id transcript_version transcript_name transcript_source <character> <numeric> <character> <character> [1] <NA> <NA> <NA> <NA> [2] ENSMUST00000193812 1 4933401J01Rik-001 havana [3] ENSMUST00000193812 1 4933401J01Rik-001 havana [4] <NA> <NA> <NA> <NA> [5] ENSMUST00000082908 1 Gm26206-201 ensembl ... ... ... ... ... [1524096] ENSMUST00000179436 1 CAAA01147332.1-201 ensembl [1524097] ENSMUST00000179436 1 CAAA01147332.1-201 ensembl [1524098] ENSMUST00000179436 1 CAAA01147332.1-201 ensembl [1524099] ENSMUST00000179436 1 CAAA01147332.1-201 ensembl [1524100] ENSMUST00000179436 1 CAAA01147332.1-201 ensembl transcript_biotype tag exon_number exon_id exon_version <character> <character> <numeric> <character> <numeric> [1] <NA> <NA> <NA> <NA> <NA> [2] TEC basic <NA> <NA> <NA> [3] TEC basic 1 ENSMUSE00001343744 1 [4] <NA> <NA> <NA> <NA> <NA> [5] snRNA basic <NA> <NA> <NA> ... ... ... ... ... ... [1524096] protein_coding basic 5 <NA> <NA> [1524097] protein_coding basic 6 ENSMUSE00000997159 1 [1524098] protein_coding basic 6 <NA> <NA> [1524099] protein_coding basic 7 ENSMUSE00001007635 1 [1524100] protein_coding basic 7 <NA> <NA> transcript_support_level ccds_id protein_id protein_version <character> <character> <character> <numeric> [1] <NA> <NA> <NA> <NA> [2] <NA> <NA> <NA> <NA> [3] <NA> <NA> <NA> <NA> [4] <NA> <NA> <NA> <NA> [5] NA <NA> <NA> <NA> ... ... ... ... ... [1524096] 5 <NA> ENSMUSP00000137004 1 [1524097] 5 <NA> <NA> <NA> [1524098] 5 <NA> ENSMUSP00000137004 1 [1524099] 5 <NA> <NA> <NA> [1524100] 5 <NA> ENSMUSP00000137004 1 ------- seqinfo: 61 sequences (1 circular) from GRCm38 genome; no seqlengths
Simple search to find if your gene of interest is present or not!
> which(mcols(gtfFile)$gene_id=="ENSMUSG00000071528") [1] 1512778 1512779 1512780 1512781 1512782 1512783 1512784 1512785 1512786 1512787 1512788 1512789 [13] 1512790 1512791
Subset the GenomicRanges object to make a smaller one which contains data only for your gene of interest
and store it in want.
> want <- gtfFile[which(mcols(gtfFile)$gene_id=="ENSMUSG00000071528"),] > want GRanges object with 14 ranges and 22 metadata columns: seqnames ranges strand | source type score phase <Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer> [1] 19 [47083471, 47090625] - | ensembl gene <NA> <NA> [2] 19 [47083471, 47090625] - | ensembl transcript <NA> <NA> [3] 19 [47090573, 47090625] - | ensembl exon <NA> <NA> [4] 19 [47086134, 47086229] - | ensembl exon <NA> <NA> [5] 19 [47086134, 47086220] - | ensembl CDS <NA> 0 ... ... ... ... ... ... ... ... ... [10] 19 [47083471, 47083569] - | ensembl exon <NA> <NA> [11] 19 [47090573, 47090625] - | ensembl UTR <NA> <NA> [12] 19 [47086221, 47086229] - | ensembl UTR <NA> <NA> [13] 19 [47085955, 47085957] - | ensembl UTR <NA> <NA> [14] 19 [47083471, 47083569] - | ensembl UTR <NA> <NA> gene_id gene_version gene_name gene_source gene_biotype transcript_id <character> <numeric> <character> <character> <character> <character> [1] ENSMUSG00000071528 3 Usmg5 ensembl protein_coding <NA> [2] ENSMUSG00000071528 3 Usmg5 ensembl protein_coding ENSMUST00000096014 [3] ENSMUSG00000071528 3 Usmg5 ensembl protein_coding ENSMUST00000096014 [4] ENSMUSG00000071528 3 Usmg5 ensembl protein_coding ENSMUST00000096014 [5] ENSMUSG00000071528 3 Usmg5 ensembl protein_coding ENSMUST00000096014 ... ... ... ... ... ... ... [10] ENSMUSG00000071528 3 Usmg5 ensembl protein_coding ENSMUST00000096014 [11] ENSMUSG00000071528 3 Usmg5 ensembl protein_coding ENSMUST00000096014 [12] ENSMUSG00000071528 3 Usmg5 ensembl protein_coding ENSMUST00000096014 [13] ENSMUSG00000071528 3 Usmg5 ensembl protein_coding ENSMUST00000096014 [14] ENSMUSG00000071528 3 Usmg5 ensembl protein_coding ENSMUST00000096014 transcript_version transcript_name transcript_source transcript_biotype tag <numeric> <character> <character> <character> <character> [1] <NA> <NA> <NA> <NA> <NA> [2] 3 Usmg5-201 ensembl protein_coding basic [3] 3 Usmg5-201 ensembl protein_coding basic [4] 3 Usmg5-201 ensembl protein_coding basic [5] 3 Usmg5-201 ensembl protein_coding basic ... ... ... ... ... ... [10] 3 Usmg5-201 ensembl protein_coding basic [11] 3 Usmg5-201 ensembl protein_coding basic [12] 3 Usmg5-201 ensembl protein_coding basic [13] 3 Usmg5-201 ensembl protein_coding basic [14] 3 Usmg5-201 ensembl protein_coding basic exon_number exon_id exon_version transcript_support_level ccds_id <numeric> <character> <numeric> <character> <character> [1] <NA> <NA> <NA> <NA> <NA> [2] <NA> <NA> <NA> 1 CCDS38014 [3] 1 ENSMUSE00000617995 3 1 CCDS38014 [4] 2 ENSMUSE00000617994 1 1 CCDS38014 [5] 2 <NA> <NA> 1 CCDS38014 ... ... ... ... ... ... [10] 4 ENSMUSE00000617992 3 1 CCDS38014 [11] <NA> <NA> <NA> 1 CCDS38014 [12] <NA> <NA> <NA> 1 CCDS38014 [13] <NA> <NA> <NA> 1 CCDS38014 [14] <NA> <NA> <NA> 1 CCDS38014 protein_id protein_version <character> <numeric> [1] <NA> <NA> [2] <NA> <NA> [3] <NA> <NA> [4] <NA> <NA> [5] ENSMUSP00000093713 3 ... ... ... [10] <NA> <NA> [11] <NA> <NA> [12] <NA> <NA> [13] <NA> <NA> [14] <NA> <NA> ------- seqinfo: 61 sequences (1 circular) from GRCm38 genome; no seqlengths
The "type" column tells you what information is available for the Ensembl gene id that you're interested in
> mcols(want)$type [1] gene transcript exon exon CDS start_codon exon CDS [9] stop_codon exon UTR UTR UTR UTR Levels: CDS exon gene Selenocysteine start_codon stop_codon transcript UTR
All the information that you want is found here - the gene's start,end chromosome co-ordinate,
strand, external gene name (gene_name) can be found with type=="gene"
> want[mcols(want)$type=="gene",] GRanges object with 1 range and 22 metadata columns: seqnames ranges strand | source type score phase <Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer> [1] 19 [47083471, 47090625] - | ensembl gene <NA> <NA> gene_id gene_version gene_name gene_source gene_biotype transcript_id <character> <numeric> <character> <character> <character> <character> [1] ENSMUSG00000071528 3 Usmg5 ensembl protein_coding <NA> transcript_version transcript_name transcript_source transcript_biotype tag <numeric> <character> <character> <character> <character> [1] <NA> <NA> <NA> <NA> <NA> exon_number exon_id exon_version transcript_support_level ccds_id protein_id <numeric> <character> <numeric> <character> <character> <character> [1] <NA> <NA> <NA> <NA> <NA> <NA> protein_version <numeric> [1] <NA> ------- seqinfo: 61 sequences (1 circular) from GRCm38 genome; no seqlengths
Hope that helps!
Sonali.
Or a similar approach, based on Sonali's answer above:
Generate an EnsDb
(Ensembl DB annotation object/database from the ensembldb
package) for the specified Ensembl version:
First using Sonali's code to get the GRanges
object:
> library(ensembldb) > library(AnnotationHub) > ah = AnnotationHub() gtf <- query(ah, c("gtf","mus musculus", "80", "ensembl")) gtfFile <- gtf[[1]] snapshotDate(): 2015-08-26
Then build an EnsDb database file from that
> edb <- ensDbFromGRanges(gtfFile, organism="Mus_musculus", version="80", + genomeVersion="GRCm38") > makeEnsembldbPackage(edb, version="0.1.0", maintainer="S. Bonnin", + author="S. Bonnin", + destDir=".", license="Artistic-2.0") Creating package in ./EnsDb.Mmusculus.v80
Which you can R CMD build
and R CMD INSTALL
and thus have it always available locally, or just use it right away:
> ensMm80 <- EnsDb(edb) > genes(ensMm80, filter=GeneidFilter("ENSMUSG00000071528")) GRanges object with 1 range and 5 metadata columns: seqnames ranges strand | gene_id <Rle> <IRanges> <Rle> | <character> ENSMUSG00000071528 19 [47083471, 47090625] - | ENSMUSG00000071528 gene_name entrezid gene_biotype seq_coord_system <character> <integer> <character> <integer> ENSMUSG00000071528 Usmg5 <NA> protein_coding <NA> ------- seqinfo: 1 sequence from GRCm38 genome
check the vignette of the ensembldb
package for some more use cases.
hope this helps!
cheers, jo