How to get Genomic feature in txdb format from biomart for sacCer2 (EF4) assembly?
1
0
Entering edit mode
vinod.acear ▴ 50
@vinodacear-8884
Last seen 4.4 years ago
India

Hi i want to have genomic features in txdb format using "makeTxDbFromBiomart" but when i used following command i got it for sacCer3 assembly (R64-1-1) but i want these fetaures for sacCer2 (EF4) using biomart

Please suggest me the possible way to do this

 

> txdb <- makeTxDbFromBiomart(biomart = "ensembl", dataset = "scerevisiae_gene_ensembl")
Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Download and preprocess the 'splicings' data frame ... OK
Download and preprocess the 'genes' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: BioMart
# Organism: Saccharomyces cerevisiae
# Taxonomy ID: 4932
# Resource URL: www.biomart.org:80
# BioMart database: ensembl
# BioMart database version: ENSEMBL GENES 82 (SANGER UK)
# BioMart dataset: scerevisiae_gene_ensembl
# BioMart dataset description: Saccharomyces cerevisiae genes (R64-1-1)
# BioMart dataset version: R64-1-1
# Full dataset: yes
# miRBase build ID: NA
# transcript_nrow: 7126
# exon_nrow: 7553
# cds_nrow: 7034
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2015-10-28 11:44:12 +0500 (Wed, 28 Oct 2015)
# GenomicFeatures version at creation time: 1.22.0
# RSQLite version at creation time: 1.0.0
# DBSCHEMAVERSION: 1.1
makeTxDbFromBiomart genomicfeatures txdb biomart • 3.0k views
ADD COMMENT
4
Entering edit mode
@james-w-macdonald-5106
Last seen 24 minutes ago
United States

This takes some fiddling around in order to get what you want. Since you need an archive version of a species, you first go to www.ensembl.org, then under the 'Browse a Genome' box on the upper left, choose S. cerevisiae. When the new page loads, click on the 'view in archive site' link at the bottom right. That will bring up a box listing all the available archives, the last one for EF4 was Ensembl74 from December 2013. Click that link. The URI for the page that loads is http://dec2013.archive.ensembl.org/Saccharomyces_cerevisiae/Info/Index. We want the first part of that, dec2013.archive.ensembl.org

Now we need to figure out the correct mart name, so do this:

> listMarts(host = "dec2013.archive.ensembl.org")
               biomart               version
1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 74
2     ENSEMBL_MART_SNP  Ensembl Variation 74
3 ENSEMBL_MART_FUNCGEN Ensembl Regulation 74
4    ENSEMBL_MART_VEGA               Vega 54
5                pride        PRIDE (EBI UK)

So we want "ENSEMBL_MART_ENSEMBL". Now note that you can specify things to makeTxDbFromBiomart(), like the mart or host:

> args(makeTxDbFromBiomart)
function (biomart = "ensembl", dataset = "hsapiens_gene_ensembl",
    transcript_ids = NULL, circ_seqs = DEFAULT_CIRC_SEQS, filters = "",
    id_prefix = "ensembl_", host = "www.biomart.org", port = 80,
    taxonomyId = NA, miRBaseBuild = NA)

So we now plug in the information we have.

> z <- makeTxDbFromBiomart("ENSEMBL_MART_ENSEMBL", "scerevisiae_gene_ensembl", host = "dec2013.archive.ensembl.org")
Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... FAILED! (=> skipped)
Download and preprocess the 'splicings' data frame ... OK
Download and preprocess the 'genes' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .normarg_makeTxDb_chrominfo(chrominfo, transcripts$tx_chrom,  :
  chromosome lengths and circularity flags are not available for this TxDb object
> z
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: BioMart
# Organism: Saccharomyces cerevisiae
# Taxonomy ID: 4932
# Resource URL: Dec2013.archive.ensembl.org:80
# BioMart database: ENSEMBL_MART_ENSEMBL
# BioMart database version: Ensembl Genes 74
# BioMart dataset: scerevisiae_gene_ensembl
# BioMart dataset description: Saccharomyces cerevisiae genes (EF4)
# BioMart dataset version: EF4
# Full dataset: yes
# miRBase build ID: NA
# transcript_nrow: 7126
# exon_nrow: 7553
# cds_nrow: 7034
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2015-10-28 06:38:52 -0700 (Wed, 28 Oct 2015)
# GenomicFeatures version at creation time: 1.22.0
# RSQLite version at creation time: 1.0.0
# DBSCHEMAVERSION: 1.1

Or you could use the AnnotationHub package:

> library(AnnotationHub)

Attaching package:  AnnotationHub

The following object is masked from  package:Biobase :

    cache

> hub <- AnnotationHub()
snapshotDate(): 2015-08-26
> query(hub, c("cerevisiae", "GRanges", "Ensembl"))
AnnotationHub with 15 records
# snapshotDate(): 2015-08-26
# $dataprovider: Ensembl, UCSC
# $species: Saccharomyces cerevisiae
# $rdataclass: GRanges
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
#   sourcetype
# retrieve records with, e.g., 'object[["AH7049"]]'

            title                                  
  AH7049  | Ensembl Genes                          
  AH7057  | Ensembl Genes                          
  AH7523  | Saccharomyces_cerevisiae.EF4.69.gtf    
  AH7584  | Saccharomyces_cerevisiae.EF4.70.gtf    
  AH7692  | Saccharomyces_cerevisiae.EF4.71.gtf    
  ...       ...                                    
  AH28704 | Saccharomyces_cerevisiae.R64-1-1.76.gtf
  AH28773 | Saccharomyces_cerevisiae.R64-1-1.79.gtf
  AH28842 | Saccharomyces_cerevisiae.R64-1-1.77.gtf
  AH47096 | Saccharomyces_cerevisiae.R64-1-1.80.gtf
  AH47993 | Saccharomyces_cerevisiae.R64-1-1.81.gtf

So that looks pretty close, but we can't see all the available things, so let's look closer.

> mcols(query(hub, c("cerevisiae", "GRanges", "Ensembl")))$title
 [1] "Ensembl Genes"                          
 [2] "Ensembl Genes"                          
 [3] "Saccharomyces_cerevisiae.EF4.69.gtf"    
 [4] "Saccharomyces_cerevisiae.EF4.70.gtf"    
 [5] "Saccharomyces_cerevisiae.EF4.71.gtf"    
 [6] "Saccharomyces_cerevisiae.EF4.72.gtf"    
 [7] "Saccharomyces_cerevisiae.EF4.73.gtf"    
 [8] "Saccharomyces_cerevisiae.EF4.74.gtf"    
 [9] "Saccharomyces_cerevisiae.R64-1-1.75.gtf"
[10] "Saccharomyces_cerevisiae.R64-1-1.78.gtf"
[11] "Saccharomyces_cerevisiae.R64-1-1.76.gtf"
[12] "Saccharomyces_cerevisiae.R64-1-1.79.gtf"
[13] "Saccharomyces_cerevisiae.R64-1-1.77.gtf"
[14] "Saccharomyces_cerevisiae.R64-1-1.80.gtf"
[15] "Saccharomyces_cerevisiae.R64-1-1.81.gtf"

> mcols(query(hub, c("cerevisiae", "GRanges", "Ensembl")))$sourceurl
 [1] "rtracklayer://hgdownload.cse.ucsc.edu/goldenpath/sacCer2/database/ensGene"                                   
 [2] "rtracklayer://hgdownload.cse.ucsc.edu/goldenpath/sacCer1/database/ensGene"                                   
 [3] "ftp://ftp.ensembl.org/pub/release-69/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.EF4.69.gtf.gz"    
 [4] "ftp://ftp.ensembl.org/pub/release-70/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.EF4.70.gtf.gz"    
 [5] "ftp://ftp.ensembl.org/pub/release-71/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.EF4.71.gtf.gz"    
 [6] "ftp://ftp.ensembl.org/pub/release-72/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.EF4.72.gtf.gz"    
 [7] "ftp://ftp.ensembl.org/pub/release-73/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.EF4.73.gtf.gz"    
 [8] "ftp://ftp.ensembl.org/pub/release-74/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.EF4.74.gtf.gz"    
 [9] "ftp://ftp.ensembl.org/pub/release-75/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.75.gtf.gz"
[10] "ftp://ftp.ensembl.org/pub/release-78/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.78.gtf.gz"
[11] "ftp://ftp.ensembl.org/pub/release-76/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.76.gtf.gz"
[12] "ftp://ftp.ensembl.org/pub/release-79/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.79.gtf.gz"
[13] "ftp://ftp.ensembl.org/pub/release-77/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.77.gtf.gz"
[14] "ftp://ftp.ensembl.org/pub/release-80/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.80.gtf.gz"
[15] "ftp://ftp.ensembl.org/pub/release-81/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.81.gtf.gz"

So we want either the first or the seventh item.

> names(query(hub, c("cerevisiae", "GRanges", "Ensembl")))
 [1] "AH7049"  "AH7057"  "AH7523"  "AH7584"  "AH7692"  "AH7753"  "AH7816"
 [8] "AH8781"  "AH10712" "AH28636" "AH28704" "AH28773" "AH28842" "AH47096"
[15] "AH47993"

> hub[["AH7049"]]
GRanges object with 7130 ranges and 5 metadata columns:
         seqnames           ranges strand   |        name     score     itemRgb
            <Rle>        <IRanges>  <Rle>   | <character> <numeric> <character>
     [1]     chrI [130802, 131986]      +   |     YAL012W         0        <NA>
     [2]     chrI [   335,    649]      +   |     YAL069W         0        <NA>
     [3]     chrI [   538,    792]      +   |   YAL068W-A         0        <NA>
     [4]     chrI [  1807,   2169]      -   |     YAL068C         0        <NA>
     [5]     chrI [  2480,   2707]      +   |   YAL067W-A         0        <NA>
     ...      ...              ...    ... ...         ...       ...         ...
  [7126]  chrXIII [923492, 923800]      -   |     YMR326C         0        <NA>
  [7127]  2micron [   252,   1523]      +   |      R0010W         0        <NA>
  [7128]  2micron [  1887,   3008]      -   |      R0020C         0        <NA>
  [7129]  2micron [  3271,   3816]      +   |      R0030W         0        <NA>
  [7130]  2micron [  5308,   6198]      -   |      R0040C         0        <NA>
                    thick        blocks
                <IRanges> <IRangesList>
     [1] [130802, 131986]     [1, 1185]
     [2] [   335,    649]      [1, 315]
     [3] [   538,    792]      [1, 255]
     [4] [  1807,   2169]      [1, 363]
     [5] [  2480,   2707]      [1, 228]
     ...              ...           ...
  [7126] [923492, 923800]      [1, 309]
  [7127] [   252,   1523]     [1, 1272]
  [7128] [  1887,   3008]     [1, 1122]
  [7129] [  3271,   3816]      [1, 546]
  [7130] [  5308,   6198]      [1, 891]
  -------
  seqinfo: 18 sequences (2 circular) from sacCer2 genome

> hub[["AH7816"]]
downloading from  https://annotationhub.bioconductor.org/fetch/7816
retrieving 1 resource
  |======================================================================| 100%
using guess work to populate seqinfo
GRanges object with 27995 ranges and 12 metadata columns:
          seqnames         ranges strand   |         source        type
             <Rle>      <IRanges>  <Rle>   |       <factor>    <factor>
      [1]       IV   [1802, 2953]      +   | protein_coding        exon
      [2]       IV   [1802, 2950]      +   | protein_coding         CDS
      [3]       IV   [1802, 1804]      +   | protein_coding start_codon
      [4]       IV   [2951, 2953]      +   | protein_coding  stop_codon
      [5]       IV   [3762, 3836]      +   | protein_coding        exon
      ...      ...            ...    ... ...            ...         ...
  [27991]     Mito [85295, 85777]      +   |          ncRNA        exon
  [27992]     Mito [85554, 85709]      +   | protein_coding        exon
  [27993]     Mito [85554, 85706]      +   | protein_coding         CDS
  [27994]     Mito [85554, 85556]      +   | protein_coding start_codon
  [27995]     Mito [85707, 85709]      +   | protein_coding  stop_codon
              score     phase     gene_id transcript_id exon_number   gene_name
          <numeric> <integer> <character>   <character>   <numeric> <character>
      [1]      <NA>      <NA>     YDL248W       YDL248W           1        COS7
      [2]      <NA>         0     YDL248W       YDL248W           1        COS7
      [3]      <NA>         0     YDL248W       YDL248W           1        COS7
      [4]      <NA>         0     YDL248W       YDL248W           1        COS7
      [5]      <NA>      <NA>   YDL247W-A     YDL247W-A           1   YDL247W-A
      ...       ...       ...         ...           ...         ...         ...
  [27991]      <NA>      <NA>        RPM1          RPM1           1        RPM1
  [27992]      <NA>      <NA>       Q0297         Q0297           1       Q0297
  [27993]      <NA>         0       Q0297         Q0297           1       Q0297
  [27994]      <NA>         0       Q0297         Q0297           1       Q0297
  [27995]      <NA>         0       Q0297         Q0297           1       Q0297
            gene_biotype transcript_name     exon_id  protein_id
             <character>     <character> <character> <character>
      [1] protein_coding            COS7   YDL248W.1        <NA>
      [2] protein_coding            COS7        <NA>     YDL248W
      [3] protein_coding            COS7        <NA>        <NA>
      [4] protein_coding            COS7        <NA>        <NA>
      [5] protein_coding       YDL247W-A YDL247W-A.1        <NA>
      ...            ...             ...         ...         ...
  [27991]          ncRNA            RPM1      RPM1.1        <NA>
  [27992] protein_coding           Q0297     Q0297.1        <NA>
  [27993] protein_coding           Q0297        <NA>       Q0297
  [27994] protein_coding           Q0297        <NA>        <NA>
  [27995] protein_coding           Q0297        <NA>        <NA>
  -------
  seqinfo: 17 sequences (1 circular) from EF4 genome
There were 50 or more warnings (use warnings() to see the first 50)

So depending on what you want, you can often get things from AnnotationHub() without having to 'roll your own'.

ADD COMMENT
0
Entering edit mode

Just one small addition to Jim's excellent answer: then use makeTxDbFromGRanges() on the GRanges object you got from AnnotationHub to turn it into a TxDb object.

H.

ADD REPLY
0
Entering edit mode

Hi Herve, I tried your method too. Here is output ..please make me correct if i am wrong

netxdb=hub[["AH7049"]]
> netxdb
GRanges object with 7130 ranges and 5 metadata columns:
         seqnames           ranges strand   |        name     score     itemRgb            thick
            <Rle>        <IRanges>  <Rle>   | <character> <numeric> <character>        <IRanges>
     [1]     chrI [130802, 131986]      +   |     YAL012W         0        <NA> [130802, 131986]
     [2]     chrI [   335,    649]      +   |     YAL069W         0        <NA> [   335,    649]
     [3]     chrI [   538,    792]      +   |   YAL068W-A         0        <NA> [   538,    792]
     [4]     chrI [  1807,   2169]      -   |     YAL068C         0        <NA> [  1807,   2169]
     [5]     chrI [  2480,   2707]      +   |   YAL067W-A         0        <NA> [  2480,   2707]
     ...      ...              ...    ... ...         ...       ...         ...              ...
  [7126]  chrXIII [923492, 923800]      -   |     YMR326C         0        <NA> [923492, 923800]
  [7127]  2micron [   252,   1523]      +   |      R0010W         0        <NA> [   252,   1523]
  [7128]  2micron [  1887,   3008]      -   |      R0020C         0        <NA> [  1887,   3008]
  [7129]  2micron [  3271,   3816]      +   |      R0030W         0        <NA> [  3271,   3816]
  [7130]  2micron [  5308,   6198]      -   |      R0040C         0        <NA> [  5308,   6198]
                blocks
         <IRangesList>
     [1]     [1, 1185]
     [2]      [1, 315]
     [3]      [1, 255]
     [4]      [1, 363]
     [5]      [1, 228]
     ...           ...
  [7126]      [1, 309]
  [7127]     [1, 1272]
  [7128]     [1, 1122]
  [7129]      [1, 546]
  [7130]      [1, 891]
  -------
  seqinfo: 18 sequences (2 circular) from sacCer2 genome
> makeTxDbFromGRanges(netxdb)
Error in .get_type(gr_mcols) : 'gr' must have a "type" metadata column

 

ADD REPLY
0
Entering edit mode

makeTxDbFromGRanges() can only be used on a GRanges object that was obtained by importing a GFF or GTF file with rtracklayer::import(). That seems to be the case for 13 of the 15 resources returned by Jim's search but not for the first 2 (AH7049, AH7057). However with the 13 others, makeTxDbFromGRanges() should work. For example, with AH8781 (Saccharomyces_cerevisiae.EF4.74.gtf):

EF4_gtf <- hub[["AH8781"]]
library(GenomicFeatures)
EF4_txdb <- makeTxDbFromGRanges(EF4_gtf)

For the resource you tried to fetch (AH7049), its sourceurl field is

rtracklayer://hgdownload.cse.ucsc.edu/goldenpath/sacCer2/database/ensGene

which seems to indicate that it was downloaded from the UCSC Genome Browser and turned into a GRanges object with the rtracklayer package but it's not clear to me how it was obtained exactly. The problem is that the resulting GRanges object doesn't seem to have any useful information in it so maybe the recipe that was used to generate it is broken, hard to tell.

Anyway If you want to import the ensGene track for sacCer2 as a TxDb object, an easy way is to use makeTxDbFromUCSC(). It will grab the data directly from UCSC:

sacCer2_txdb <- makeTxDbFromUCSC("sacCer2", "ensGene")

See ?makeTxDbFromUCSC for more information.

Cheers,
H.

ADD REPLY
0
Entering edit mode

Hi Herve, Thanks for suggesting other ways to solve the problem. i tried both of your methods those worked fine. But  i was required to txdb file from biomart format because genfeatures() works well in that.

ADD REPLY
0
Entering edit mode

OK. Then I'm not sure why you tried to make a TxDb from hub[["AH7049"]]...

ADD REPLY
0
Entering edit mode

Thanks James,  it worked exactly as per my requirement...

ADD REPLY

Login before adding your answer.

Traffic: 596 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6