Question

How to get Genomic feature in txdb format from biomart for sacCer2 (EF4) assembly?

0

Entering edit mode

vinod.acear ▴ 50

@vinodacear-8884

Last seen 4.4 years ago

India

Hi i want to have genomic features in txdb format using "makeTxDbFromBiomart" but when i used following command i got it for sacCer3 assembly (R64-1-1) but i want these fetaures for sacCer2 (EF4) using biomart

Please suggest me the possible way to do this

> txdb <- makeTxDbFromBiomart(biomart = "ensembl", dataset = "scerevisiae_gene_ensembl")
Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Download and preprocess the 'splicings' data frame ... OK
Download and preprocess the 'genes' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: BioMart
# Organism: Saccharomyces cerevisiae
# Taxonomy ID: 4932
# Resource URL: www.biomart.org:80
# BioMart database: ensembl
# BioMart database version: ENSEMBL GENES 82 (SANGER UK)
# BioMart dataset: scerevisiae_gene_ensembl
# BioMart dataset description: Saccharomyces cerevisiae genes (R64-1-1)
# BioMart dataset version: R64-1-1
# Full dataset: yes
# miRBase build ID: NA
# transcript_nrow: 7126
# exon_nrow: 7553
# cds_nrow: 7034
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2015-10-28 11:44:12 +0500 (Wed, 28 Oct 2015)
# GenomicFeatures version at creation time: 1.22.0
# RSQLite version at creation time: 1.0.0
# DBSCHEMAVERSION: 1.1

makeTxDbFromBiomart genomicfeatures txdb biomart • 3.0k views

ADD COMMENT • link updated 9.2 years ago by James W. MacDonald 67k • written 9.2 years ago by vinod.acear ▴ 50

score 4 · Accepted Answer · 2015-10-28

This takes some fiddling around in order to get what you want. Since you need an archive version of a species, you first go to www.ensembl.org, then under the 'Browse a Genome' box on the upper left, choose S. cerevisiae. When the new page loads, click on the 'view in archive site' link at the bottom right. That will bring up a box listing all the available archives, the last one for EF4 was Ensembl74 from December 2013. Click that link. The URI for the page that loads is http://dec2013.archive.ensembl.org/Saccharomyces_cerevisiae/Info/Index. We want the first part of that, dec2013.archive.ensembl.org

Now we need to figure out the correct mart name, so do this:

> listMarts(host = "dec2013.archive.ensembl.org")
               biomart               version
1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 74
2     ENSEMBL_MART_SNP  Ensembl Variation 74
3 ENSEMBL_MART_FUNCGEN Ensembl Regulation 74
4    ENSEMBL_MART_VEGA               Vega 54
5                pride        PRIDE (EBI UK)

So we want "ENSEMBL_MART_ENSEMBL". Now note that you can specify things to makeTxDbFromBiomart(), like the mart or host:

> args(makeTxDbFromBiomart)
function (biomart = "ensembl", dataset = "hsapiens_gene_ensembl",
    transcript_ids = NULL, circ_seqs = DEFAULT_CIRC_SEQS, filters = "",
    id_prefix = "ensembl_", host = "www.biomart.org", port = 80,
    taxonomyId = NA, miRBaseBuild = NA)

So we now plug in the information we have.

> z <- makeTxDbFromBiomart("ENSEMBL_MART_ENSEMBL", "scerevisiae_gene_ensembl", host = "dec2013.archive.ensembl.org")
Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... FAILED! (=> skipped)
Download and preprocess the 'splicings' data frame ... OK
Download and preprocess the 'genes' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .normarg_makeTxDb_chrominfo(chrominfo, transcripts$tx_chrom,  :
  chromosome lengths and circularity flags are not available for this TxDb object
> z
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: BioMart
# Organism: Saccharomyces cerevisiae
# Taxonomy ID: 4932
# Resource URL: Dec2013.archive.ensembl.org:80
# BioMart database: ENSEMBL_MART_ENSEMBL
# BioMart database version: Ensembl Genes 74
# BioMart dataset: scerevisiae_gene_ensembl
# BioMart dataset description: Saccharomyces cerevisiae genes (EF4)
# BioMart dataset version: EF4
# Full dataset: yes
# miRBase build ID: NA
# transcript_nrow: 7126
# exon_nrow: 7553
# cds_nrow: 7034
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2015-10-28 06:38:52 -0700 (Wed, 28 Oct 2015)
# GenomicFeatures version at creation time: 1.22.0
# RSQLite version at creation time: 1.0.0
# DBSCHEMAVERSION: 1.1

Or you could use the AnnotationHub package:

> library(AnnotationHub)

Attaching package:  AnnotationHub

The following object is masked from  package:Biobase :

    cache

> hub <- AnnotationHub()
snapshotDate(): 2015-08-26
> query(hub, c("cerevisiae", "GRanges", "Ensembl"))
AnnotationHub with 15 records
# snapshotDate(): 2015-08-26
# $dataprovider: Ensembl, UCSC
# $species: Saccharomyces cerevisiae
# $rdataclass: GRanges
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
#   sourcetype
# retrieve records with, e.g., 'object[["AH7049"]]'

            title                                  
  AH7049  | Ensembl Genes                          
  AH7057  | Ensembl Genes                          
  AH7523  | Saccharomyces_cerevisiae.EF4.69.gtf    
  AH7584  | Saccharomyces_cerevisiae.EF4.70.gtf    
  AH7692  | Saccharomyces_cerevisiae.EF4.71.gtf    
  ...       ...                                    
  AH28704 | Saccharomyces_cerevisiae.R64-1-1.76.gtf
  AH28773 | Saccharomyces_cerevisiae.R64-1-1.79.gtf
  AH28842 | Saccharomyces_cerevisiae.R64-1-1.77.gtf
  AH47096 | Saccharomyces_cerevisiae.R64-1-1.80.gtf
  AH47993 | Saccharomyces_cerevisiae.R64-1-1.81.gtf

So that looks pretty close, but we can't see all the available things, so let's look closer.

> mcols(query(hub, c("cerevisiae", "GRanges", "Ensembl")))$title
 [1] "Ensembl Genes"                          
 [2] "Ensembl Genes"                          
 [3] "Saccharomyces_cerevisiae.EF4.69.gtf"    
 [4] "Saccharomyces_cerevisiae.EF4.70.gtf"    
 [5] "Saccharomyces_cerevisiae.EF4.71.gtf"    
 [6] "Saccharomyces_cerevisiae.EF4.72.gtf"    
 [7] "Saccharomyces_cerevisiae.EF4.73.gtf"    
 [8] "Saccharomyces_cerevisiae.EF4.74.gtf"    
 [9] "Saccharomyces_cerevisiae.R64-1-1.75.gtf"
[10] "Saccharomyces_cerevisiae.R64-1-1.78.gtf"
[11] "Saccharomyces_cerevisiae.R64-1-1.76.gtf"
[12] "Saccharomyces_cerevisiae.R64-1-1.79.gtf"
[13] "Saccharomyces_cerevisiae.R64-1-1.77.gtf"
[14] "Saccharomyces_cerevisiae.R64-1-1.80.gtf"
[15] "Saccharomyces_cerevisiae.R64-1-1.81.gtf"

> mcols(query(hub, c("cerevisiae", "GRanges", "Ensembl")))$sourceurl
 [1] "rtracklayer://hgdownload.cse.ucsc.edu/goldenpath/sacCer2/database/ensGene"                                   
 [2] "rtracklayer://hgdownload.cse.ucsc.edu/goldenpath/sacCer1/database/ensGene"                                   
 [3] "ftp://ftp.ensembl.org/pub/release-69/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.EF4.69.gtf.gz"    
 [4] "ftp://ftp.ensembl.org/pub/release-70/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.EF4.70.gtf.gz"    
 [5] "ftp://ftp.ensembl.org/pub/release-71/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.EF4.71.gtf.gz"    
 [6] "ftp://ftp.ensembl.org/pub/release-72/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.EF4.72.gtf.gz"    
 [7] "ftp://ftp.ensembl.org/pub/release-73/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.EF4.73.gtf.gz"    
 [8] "ftp://ftp.ensembl.org/pub/release-74/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.EF4.74.gtf.gz"    
 [9] "ftp://ftp.ensembl.org/pub/release-75/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.75.gtf.gz"
[10] "ftp://ftp.ensembl.org/pub/release-78/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.78.gtf.gz"
[11] "ftp://ftp.ensembl.org/pub/release-76/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.76.gtf.gz"
[12] "ftp://ftp.ensembl.org/pub/release-79/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.79.gtf.gz"
[13] "ftp://ftp.ensembl.org/pub/release-77/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.77.gtf.gz"
[14] "ftp://ftp.ensembl.org/pub/release-80/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.80.gtf.gz"
[15] "ftp://ftp.ensembl.org/pub/release-81/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.81.gtf.gz"

So we want either the first or the seventh item.

> names(query(hub, c("cerevisiae", "GRanges", "Ensembl")))
 [1] "AH7049"  "AH7057"  "AH7523"  "AH7584"  "AH7692"  "AH7753"  "AH7816"
 [8] "AH8781"  "AH10712" "AH28636" "AH28704" "AH28773" "AH28842" "AH47096"
[15] "AH47993"

> hub[["AH7049"]]
GRanges object with 7130 ranges and 5 metadata columns:
         seqnames           ranges strand   |        name     score     itemRgb
            <Rle>        <IRanges>  <Rle>   | <character> <numeric> <character>
     [1]     chrI [130802, 131986]      +   |     YAL012W         0        <NA>
     [2]     chrI [   335,    649]      +   |     YAL069W         0        <NA>
     [3]     chrI [   538,    792]      +   |   YAL068W-A         0        <NA>
     [4]     chrI [  1807,   2169]      -   |     YAL068C         0        <NA>
     [5]     chrI [  2480,   2707]      +   |   YAL067W-A         0        <NA>
     ...      ...              ...    ... ...         ...       ...         ...
  [7126]  chrXIII [923492, 923800]      -   |     YMR326C         0        <NA>
  [7127]  2micron [   252,   1523]      +   |      R0010W         0        <NA>
  [7128]  2micron [  1887,   3008]      -   |      R0020C         0        <NA>
  [7129]  2micron [  3271,   3816]      +   |      R0030W         0        <NA>
  [7130]  2micron [  5308,   6198]      -   |      R0040C         0        <NA>
                    thick        blocks
                <IRanges> <IRangesList>
     [1] [130802, 131986]     [1, 1185]
     [2] [   335,    649]      [1, 315]
     [3] [   538,    792]      [1, 255]
     [4] [  1807,   2169]      [1, 363]
     [5] [  2480,   2707]      [1, 228]
     ...              ...           ...
  [7126] [923492, 923800]      [1, 309]
  [7127] [   252,   1523]     [1, 1272]
  [7128] [  1887,   3008]     [1, 1122]
  [7129] [  3271,   3816]      [1, 546]
  [7130] [  5308,   6198]      [1, 891]
  -------
  seqinfo: 18 sequences (2 circular) from sacCer2 genome

> hub[["AH7816"]]
downloading from  https://annotationhub.bioconductor.org/fetch/7816
retrieving 1 resource
  |======================================================================| 100%
using guess work to populate seqinfo
GRanges object with 27995 ranges and 12 metadata columns:
          seqnames         ranges strand   |         source        type
             <Rle>      <IRanges>  <Rle>   |       <factor>    <factor>
      [1]       IV   [1802, 2953]      +   | protein_coding        exon
      [2]       IV   [1802, 2950]      +   | protein_coding         CDS
      [3]       IV   [1802, 1804]      +   | protein_coding start_codon
      [4]       IV   [2951, 2953]      +   | protein_coding  stop_codon
      [5]       IV   [3762, 3836]      +   | protein_coding        exon
      ...      ...            ...    ... ...            ...         ...
  [27991]     Mito [85295, 85777]      +   |          ncRNA        exon
  [27992]     Mito [85554, 85709]      +   | protein_coding        exon
  [27993]     Mito [85554, 85706]      +   | protein_coding         CDS
  [27994]     Mito [85554, 85556]      +   | protein_coding start_codon
  [27995]     Mito [85707, 85709]      +   | protein_coding  stop_codon
              score     phase     gene_id transcript_id exon_number   gene_name
          <numeric> <integer> <character>   <character>   <numeric> <character>
      [1]      <NA>      <NA>     YDL248W       YDL248W           1        COS7
      [2]      <NA>         0     YDL248W       YDL248W           1        COS7
      [3]      <NA>         0     YDL248W       YDL248W           1        COS7
      [4]      <NA>         0     YDL248W       YDL248W           1        COS7
      [5]      <NA>      <NA>   YDL247W-A     YDL247W-A           1   YDL247W-A
      ...       ...       ...         ...           ...         ...         ...
  [27991]      <NA>      <NA>        RPM1          RPM1           1        RPM1
  [27992]      <NA>      <NA>       Q0297         Q0297           1       Q0297
  [27993]      <NA>         0       Q0297         Q0297           1       Q0297
  [27994]      <NA>         0       Q0297         Q0297           1       Q0297
  [27995]      <NA>         0       Q0297         Q0297           1       Q0297
            gene_biotype transcript_name     exon_id  protein_id
             <character>     <character> <character> <character>
      [1] protein_coding            COS7   YDL248W.1        <NA>
      [2] protein_coding            COS7        <NA>     YDL248W
      [3] protein_coding            COS7        <NA>        <NA>
      [4] protein_coding            COS7        <NA>        <NA>
      [5] protein_coding       YDL247W-A YDL247W-A.1        <NA>
      ...            ...             ...         ...         ...
  [27991]          ncRNA            RPM1      RPM1.1        <NA>
  [27992] protein_coding           Q0297     Q0297.1        <NA>
  [27993] protein_coding           Q0297        <NA>       Q0297
  [27994] protein_coding           Q0297        <NA>        <NA>
  [27995] protein_coding           Q0297        <NA>        <NA>
  -------
  seqinfo: 17 sequences (1 circular) from EF4 genome
There were 50 or more warnings (use warnings() to see the first 50)

So depending on what you want, you can often get things from AnnotationHub() without having to 'roll your own'.