Question

biomaRt fails for M.musculus annotation, Ensembl version 80

0

Entering edit mode

sbcn ▴ 80

@sbcn-4752

Last seen 3.3 years ago

Spain

Hi all,

I am using biomaRt to annotate Ensembl IDs, from Mus musculus genome, version 80.

As the current version is 81, I am using an archived version, here is how I proceed:

# connecting to the right version of Ensembl, this works well:

my_mart <- useMart(host="may2015.archive.ensembl.org", biomart="ENSEMBL_MART_ENSEMBL", dataset="mmusculus_gene_ensembl")

# mapping Ensembl IDs to retrieve more detailed annotation:

getBM(attributes=c("ensembl_gene_id", "chromosome_name", "start_position", "end_position", "strand", "description", "external_gene_name"), filters ="ensembl_gene_id", values = "ENSMUSG00000071528", mart=my_mart)

Here I get the following error:

"
Error in getBM(attributes = c("ensembl_gene_id", "chromosome_name", "start_position", :
Query ERROR: caught BioMart::Exception::Database: Could not connect to mysql database ensembl_mart_80: DBI connect('database=ensembl_mart_80;host=ensdb-web-13;port=5314','ensro',...) failed: Can't connect to MySQL server on 'ensdb-web-13' (110) at /ensemblweb/archive/www_80/biomart-perl/lib/BioMart/Configuration/DBLocation.pm line 98.
"

Trying the same thing with the current version (81), I do not get this error:

my_mart2 <- useMart(host="www.ensembl.org", biomart="ENSEMBL_MART_ENSEMBL", dataset="mmusculus_gene_ensembl")

getBM(attributes=c("ensembl_gene_id", "chromosome_name", "start_position", "end_position", "strand", "description", "external_gene_name"), filters ="ensembl_gene_id", values = "ENSMUSG00000071528", mart=my_mart2)

     ensembl_gene_id chromosome_name start_position end_position strand
1 ENSMUSG00000071528              19       47083471     47090625 -1
description
1 upregulated during skeletal muscle growth 5 [Source:MGI Symbol;Acc:MGI:1891435]
external_gene_name
1              Usmg5

Anything I can do about it?

Thanks!
Sarah

biomart ensembl • 3.1k views

ADD COMMENT • link updated 10.2 years ago by amonida ▴ 20 • written 10.2 years ago by sbcn ▴ 80

score 0 · Answer 1 · 2015-09-10

Hi Sarah,

An alternate way of annotating your Ensembl gene would be to use the Bioconductor package AnnotationHub.
It contains gtf files from Ensembl release 69 to 81 for all organisms released by Ensembl.
The data is presented as GRanges which can easily be manipulated to get information about the gene, exons, CDS etc..

Load the package

> library(AnnotationHub)
> ah = AnnotationHub()
snapshotDate(): 2015-08-26

Search for a GTF file coming from Ensembl for mus musculus for release-80

> gtf <- query(ah, c("gtf","mus musculus", "80", "ensembl"))
> gtf
AnnotationHub with 1 record
# snapshotDate(): 2015-08-26 
# names(): AH47076
# $dataprovider: Ensembl
# $species: Mus musculus
# $rdataclass: GRanges
# $title: Mus_musculus.GRCm38.80.gtf
# $description: Gene Annotation for Mus musculus
# $taxonomyid: 10090
# $genome: GRCm38
# $sourcetype: GTF
# $sourceurl: ftp://ftp.ensembl.org/pub/release-80/gtf/mus_musculus/Mus_musculus.GRCm38.80.gtf.gz
# $sourcelastmodifieddate: 2015-05-01
# $sourcesize: 25292510
# $tags: GTF, ensembl, Gene, Transcript, Annotation 
# retrieve record with 'object[["AH47076"]]'

Download the File

> gtfFile <- gtf[[1]]
require(“GenomicRanges”)
retrieving 1 resource
  |===========================================================================================| 100%
using guess work to populate seqinfo
There were 50 or more warnings (use warnings() to see the first 50)

This object is downloaded as a GenomicRanges object which contains data on all the genes, The ensembl gene names are contained in the mcols() "gene_id"

> gtfFile
GRanges object with 1524100 ranges and 22 metadata columns:
              seqnames             ranges strand   |   source       type     score     phase
                 <Rle>          <IRanges>  <Rle>   | <factor>   <factor> <numeric> <integer>
        [1]          1 [3073253, 3074322]      +   |   havana       gene      <NA>      <NA>
        [2]          1 [3073253, 3074322]      +   |   havana transcript      <NA>      <NA>
        [3]          1 [3073253, 3074322]      +   |   havana       exon      <NA>      <NA>
        [4]          1 [3102016, 3102125]      +   |  ensembl       gene      <NA>      <NA>
        [5]          1 [3102016, 3102125]      +   |  ensembl transcript      <NA>      <NA>
        ...        ...                ...    ... ...      ...        ...       ...       ...
  [1524096] JH584295.1         [708, 752]      -   |  ensembl        CDS      <NA>         2
  [1524097] JH584295.1         [565, 633]      -   |  ensembl       exon      <NA>      <NA>
  [1524098] JH584295.1         [565, 633]      -   |  ensembl        CDS      <NA>         2
  [1524099] JH584295.1         [ 66, 109]      -   |  ensembl       exon      <NA>      <NA>
  [1524100] JH584295.1         [ 66, 109]      -   |  ensembl        CDS      <NA>         2
                       gene_id gene_version      gene_name gene_source   gene_biotype
                   <character>    <numeric>    <character> <character>    <character>
        [1] ENSMUSG00000102693            1  4933401J01Rik      havana            TEC
        [2] ENSMUSG00000102693            1  4933401J01Rik      havana            TEC
        [3] ENSMUSG00000102693            1  4933401J01Rik      havana            TEC
        [4] ENSMUSG00000064842            1        Gm26206     ensembl          snRNA
        [5] ENSMUSG00000064842            1        Gm26206     ensembl          snRNA
        ...                ...          ...            ...         ...            ...
  [1524096] ENSMUSG00000095742            1 CAAA01147332.1     ensembl protein_coding
  [1524097] ENSMUSG00000095742            1 CAAA01147332.1     ensembl protein_coding
  [1524098] ENSMUSG00000095742            1 CAAA01147332.1     ensembl protein_coding
  [1524099] ENSMUSG00000095742            1 CAAA01147332.1     ensembl protein_coding
  [1524100] ENSMUSG00000095742            1 CAAA01147332.1     ensembl protein_coding
                 transcript_id transcript_version    transcript_name transcript_source
                   <character>          <numeric>        <character>       <character>
        [1]               <NA>               <NA>               <NA>              <NA>
        [2] ENSMUST00000193812                  1  4933401J01Rik-001            havana
        [3] ENSMUST00000193812                  1  4933401J01Rik-001            havana
        [4]               <NA>               <NA>               <NA>              <NA>
        [5] ENSMUST00000082908                  1        Gm26206-201           ensembl
        ...                ...                ...                ...               ...
  [1524096] ENSMUST00000179436                  1 CAAA01147332.1-201           ensembl
  [1524097] ENSMUST00000179436                  1 CAAA01147332.1-201           ensembl
  [1524098] ENSMUST00000179436                  1 CAAA01147332.1-201           ensembl
  [1524099] ENSMUST00000179436                  1 CAAA01147332.1-201           ensembl
  [1524100] ENSMUST00000179436                  1 CAAA01147332.1-201           ensembl
            transcript_biotype         tag exon_number            exon_id exon_version
                   <character> <character>   <numeric>        <character>    <numeric>
        [1]               <NA>        <NA>        <NA>               <NA>         <NA>
        [2]                TEC       basic        <NA>               <NA>         <NA>
        [3]                TEC       basic           1 ENSMUSE00001343744            1
        [4]               <NA>        <NA>        <NA>               <NA>         <NA>
        [5]              snRNA       basic        <NA>               <NA>         <NA>
        ...                ...         ...         ...                ...          ...
  [1524096]     protein_coding       basic           5               <NA>         <NA>
  [1524097]     protein_coding       basic           6 ENSMUSE00000997159            1
  [1524098]     protein_coding       basic           6               <NA>         <NA>
  [1524099]     protein_coding       basic           7 ENSMUSE00001007635            1
  [1524100]     protein_coding       basic           7               <NA>         <NA>
            transcript_support_level     ccds_id         protein_id protein_version
                         <character> <character>        <character>       <numeric>
        [1]                     <NA>        <NA>               <NA>            <NA>
        [2]                     <NA>        <NA>               <NA>            <NA>
        [3]                     <NA>        <NA>               <NA>            <NA>
        [4]                     <NA>        <NA>               <NA>            <NA>
        [5]                       NA        <NA>               <NA>            <NA>
        ...                      ...         ...                ...             ...
  [1524096]                        5        <NA> ENSMUSP00000137004               1
  [1524097]                        5        <NA>               <NA>            <NA>
  [1524098]                        5        <NA> ENSMUSP00000137004               1
  [1524099]                        5        <NA>               <NA>            <NA>
  [1524100]                        5        <NA> ENSMUSP00000137004               1
  -------
  seqinfo: 61 sequences (1 circular) from GRCm38 genome; no seqlengths

Simple search to find if your gene of interest is present or not!

> which(mcols(gtfFile)$gene_id=="ENSMUSG00000071528")
 [1] 1512778 1512779 1512780 1512781 1512782 1512783 1512784 1512785 1512786 1512787 1512788 1512789
[13] 1512790 1512791

Subset the GenomicRanges object to make a smaller one which contains data only for your gene of interest
and store it in want.

> want <- gtfFile[which(mcols(gtfFile)$gene_id=="ENSMUSG00000071528"),]

> want
GRanges object with 14 ranges and 22 metadata columns:
       seqnames               ranges strand   |   source       type     score     phase
          <Rle>            <IRanges>  <Rle>   | <factor>   <factor> <numeric> <integer>
   [1]       19 [47083471, 47090625]      -   |  ensembl       gene      <NA>      <NA>
   [2]       19 [47083471, 47090625]      -   |  ensembl transcript      <NA>      <NA>
   [3]       19 [47090573, 47090625]      -   |  ensembl       exon      <NA>      <NA>
   [4]       19 [47086134, 47086229]      -   |  ensembl       exon      <NA>      <NA>
   [5]       19 [47086134, 47086220]      -   |  ensembl        CDS      <NA>         0
   ...      ...                  ...    ... ...      ...        ...       ...       ...
  [10]       19 [47083471, 47083569]      -   |  ensembl       exon      <NA>      <NA>
  [11]       19 [47090573, 47090625]      -   |  ensembl        UTR      <NA>      <NA>
  [12]       19 [47086221, 47086229]      -   |  ensembl        UTR      <NA>      <NA>
  [13]       19 [47085955, 47085957]      -   |  ensembl        UTR      <NA>      <NA>
  [14]       19 [47083471, 47083569]      -   |  ensembl        UTR      <NA>      <NA>
                  gene_id gene_version   gene_name gene_source   gene_biotype      transcript_id
              <character>    <numeric> <character> <character>    <character>        <character>
   [1] ENSMUSG00000071528            3       Usmg5     ensembl protein_coding               <NA>
   [2] ENSMUSG00000071528            3       Usmg5     ensembl protein_coding ENSMUST00000096014
   [3] ENSMUSG00000071528            3       Usmg5     ensembl protein_coding ENSMUST00000096014
   [4] ENSMUSG00000071528            3       Usmg5     ensembl protein_coding ENSMUST00000096014
   [5] ENSMUSG00000071528            3       Usmg5     ensembl protein_coding ENSMUST00000096014
   ...                ...          ...         ...         ...            ...                ...
  [10] ENSMUSG00000071528            3       Usmg5     ensembl protein_coding ENSMUST00000096014
  [11] ENSMUSG00000071528            3       Usmg5     ensembl protein_coding ENSMUST00000096014
  [12] ENSMUSG00000071528            3       Usmg5     ensembl protein_coding ENSMUST00000096014
  [13] ENSMUSG00000071528            3       Usmg5     ensembl protein_coding ENSMUST00000096014
  [14] ENSMUSG00000071528            3       Usmg5     ensembl protein_coding ENSMUST00000096014
       transcript_version transcript_name transcript_source transcript_biotype         tag
                <numeric>     <character>       <character>        <character> <character>
   [1]               <NA>            <NA>              <NA>               <NA>        <NA>
   [2]                  3       Usmg5-201           ensembl     protein_coding       basic
   [3]                  3       Usmg5-201           ensembl     protein_coding       basic
   [4]                  3       Usmg5-201           ensembl     protein_coding       basic
   [5]                  3       Usmg5-201           ensembl     protein_coding       basic
   ...                ...             ...               ...                ...         ...
  [10]                  3       Usmg5-201           ensembl     protein_coding       basic
  [11]                  3       Usmg5-201           ensembl     protein_coding       basic
  [12]                  3       Usmg5-201           ensembl     protein_coding       basic
  [13]                  3       Usmg5-201           ensembl     protein_coding       basic
  [14]                  3       Usmg5-201           ensembl     protein_coding       basic
       exon_number            exon_id exon_version transcript_support_level     ccds_id
         <numeric>        <character>    <numeric>              <character> <character>
   [1]        <NA>               <NA>         <NA>                     <NA>        <NA>
   [2]        <NA>               <NA>         <NA>                        1   CCDS38014
   [3]           1 ENSMUSE00000617995            3                        1   CCDS38014
   [4]           2 ENSMUSE00000617994            1                        1   CCDS38014
   [5]           2               <NA>         <NA>                        1   CCDS38014
   ...         ...                ...          ...                      ...         ...
  [10]           4 ENSMUSE00000617992            3                        1   CCDS38014
  [11]        <NA>               <NA>         <NA>                        1   CCDS38014
  [12]        <NA>               <NA>         <NA>                        1   CCDS38014
  [13]        <NA>               <NA>         <NA>                        1   CCDS38014
  [14]        <NA>               <NA>         <NA>                        1   CCDS38014
               protein_id protein_version
              <character>       <numeric>
   [1]               <NA>            <NA>
   [2]               <NA>            <NA>
   [3]               <NA>            <NA>
   [4]               <NA>            <NA>
   [5] ENSMUSP00000093713               3
   ...                ...             ...
  [10]               <NA>            <NA>
  [11]               <NA>            <NA>
  [12]               <NA>            <NA>
  [13]               <NA>            <NA>
  [14]               <NA>            <NA>
  -------
  seqinfo: 61 sequences (1 circular) from GRCm38 genome; no seqlengths

The "type" column tells you what information is available for the Ensembl gene id that you're interested in

> mcols(want)$type
 [1] gene        transcript  exon        exon        CDS         start_codon exon        CDS        
 [9] stop_codon  exon        UTR         UTR         UTR         UTR        
Levels: CDS exon gene Selenocysteine start_codon stop_codon transcript UTR

All the information that you want is found here - the gene's start,end chromosome co-ordinate,
strand, external gene name (gene_name) can be found with type=="gene"

> want[mcols(want)$type=="gene",]
GRanges object with 1 range and 22 metadata columns:
      seqnames               ranges strand |   source     type     score     phase
         <Rle>            <IRanges>  <Rle> | <factor> <factor> <numeric> <integer>
  [1]       19 [47083471, 47090625]      - |  ensembl     gene      <NA>      <NA>
                 gene_id gene_version   gene_name gene_source   gene_biotype transcript_id
             <character>    <numeric> <character> <character>    <character>   <character>
  [1] ENSMUSG00000071528            3       Usmg5     ensembl protein_coding          <NA>
      transcript_version transcript_name transcript_source transcript_biotype         tag
               <numeric>     <character>       <character>        <character> <character>
  [1]               <NA>            <NA>              <NA>               <NA>        <NA>
      exon_number     exon_id exon_version transcript_support_level     ccds_id  protein_id
        <numeric> <character>    <numeric>              <character> <character> <character>
  [1]        <NA>        <NA>         <NA>                     <NA>        <NA>        <NA>
      protein_version
            <numeric>
  [1]            <NA>
  -------
  seqinfo: 61 sequences (1 circular) from GRCm38 genome; no seqlengths

Hope that helps!

Sonali.

score 0 · Answer 2 · 2015-09-10

Or a similar approach, based on Sonali's answer above:

Generate an EnsDb (Ensembl DB annotation object/database from the ensembldb package) for the specified Ensembl version:

First using Sonali's code to get the GRanges object:

> library(ensembldb)
> library(AnnotationHub)
> ah = AnnotationHub()
gtf <- query(ah, c("gtf","mus musculus", "80", "ensembl"))
gtfFile <- gtf[[1]]
snapshotDate(): 2015-08-26

Then build an EnsDb database file from that

> edb <- ensDbFromGRanges(gtfFile, organism="Mus_musculus", version="80",
+                         genomeVersion="GRCm38")

> makeEnsembldbPackage(edb, version="0.1.0", maintainer="S. Bonnin",
+                      author="S. Bonnin",
+                      destDir=".", license="Artistic-2.0")
Creating package in ./EnsDb.Mmusculus.v80

Which you can R CMD build and R CMD INSTALL and thus have it always available locally, or just use it right away:

> ensMm80 <- EnsDb(edb)
> genes(ensMm80, filter=GeneidFilter("ENSMUSG00000071528"))
GRanges object with 1 range and 5 metadata columns:
                     seqnames               ranges strand |            gene_id
                        <Rle>            <IRanges>  <Rle> |        <character>
  ENSMUSG00000071528       19 [47083471, 47090625]      - | ENSMUSG00000071528
                       gene_name  entrezid   gene_biotype seq_coord_system
                     <character> <integer>    <character>        <integer>
  ENSMUSG00000071528       Usmg5      <NA> protein_coding             <NA>
  -------
  seqinfo: 1 sequence from GRCm38 genome

check the vignette of the ensembldb package for some more use cases.

hope this helps!

cheers, jo

score 0 · Answer 3 · 2015-09-10

Hi Sarah, Sorry that you're having problems with Ensembl data. As it happens, it seems that a redirection was not fully in place. The problem should now have been fixed already. Again, sorry for the inconvenience. Best regards, Amonida -- Amonida Zadissa Ensembl Production Team EMBL-EBI Hinxton England On 09/09/2015 16:12, Sarah Bonnin [bioc] wrote: > Sarah Bonnin posted the Question: "biomaRt fails for M.musculus annotation, Ensembl version 80": > > Hi all, I am using biomaRt to annotate Ensembl IDs, from Mus musculus genome, version 80. As the current version is 81, I am using an archived version, here is how I proceed: # connecting to the right version of Ensembl, this works well: my_mart <- useMart(host="may2015.archive.ensembl.org", biomart="ENSEMBL_MART_ENSEMBL", dataset="mmusculus_gene_ensembl") # mapping Ensembl IDs to retrieve more detailed annotation: getBM(attributes=c("ensembl_gene_id", "chromosome_name", "start_position", "end_position", "strand", "description", "external_gene_name"), filters ="ensembl_gene_id", values = "ENSMUSG00000071528", mart=my_mart) Here I get the following error: " Error in getBM(attributes = c("ensembl_gene_id", "chromosome_name", "start_position", : Query ERROR: caught BioMart::Exception::Database: Could not connect to mysql database ensembl_mart_80: DBI connect('database=ensembl_mart_80;host=ensdb-web-13;port=5314','ensro',...) failed: Can't connect to MySQL server on 'ensdb-w eb > -13' (110) at /ensemblweb/archive/www_80/biomart-perl/lib/BioMart/Configuration/DBLocation.pm line 98. " Trying the same thing with the current version (81), I do not get this error: my_mart2 <- useMart(host="www.ensembl.org", biomart="ENSEMBL_MART_ENSEMBL", dataset="mmusculus_gene_ensembl") getBM(attributes=c("ensembl_gene_id", "chromosome_name", "start_position", "end_position", "strand", "description", "external_gene_name"), filters ="ensembl_gene_id", values = "ENSMUSG00000071528", mart=my_mart2) ensembl_gene_id chromosome_name start_position end_position ... > > --- > See the full post at: biomaRt fails for M.musculus annotation, Ensembl version 80 > Replying to this email will post an answer to the question above. >