This takes some fiddling around in order to get what you want. Since you need an archive version of a species, you first go to www.ensembl.org, then under the 'Browse a Genome' box on the upper left, choose S. cerevisiae. When the new page loads, click on the 'view in archive site' link at the bottom right. That will bring up a box listing all the available archives, the last one for EF4 was Ensembl74 from December 2013. Click that link. The URI for the page that loads is http://dec2013.archive.ensembl.org/Saccharomyces_cerevisiae/Info/Index. We want the first part of that, dec2013.archive.ensembl.org
Now we need to figure out the correct mart name, so do this:
> listMarts(host = "dec2013.archive.ensembl.org")
biomart version
1 ENSEMBL_MART_ENSEMBL Ensembl Genes 74
2 ENSEMBL_MART_SNP Ensembl Variation 74
3 ENSEMBL_MART_FUNCGEN Ensembl Regulation 74
4 ENSEMBL_MART_VEGA Vega 54
5 pride PRIDE (EBI UK)
So we want "ENSEMBL_MART_ENSEMBL". Now note that you can specify things to makeTxDbFromBiomart(), like the mart or host:
> args(makeTxDbFromBiomart)
function (biomart = "ensembl", dataset = "hsapiens_gene_ensembl",
transcript_ids = NULL, circ_seqs = DEFAULT_CIRC_SEQS, filters = "",
id_prefix = "ensembl_", host = "www.biomart.org", port = 80,
taxonomyId = NA, miRBaseBuild = NA)
So we now plug in the information we have.
> z <- makeTxDbFromBiomart("ENSEMBL_MART_ENSEMBL", "scerevisiae_gene_ensembl", host = "dec2013.archive.ensembl.org")
Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... FAILED! (=> skipped)
Download and preprocess the 'splicings' data frame ... OK
Download and preprocess the 'genes' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .normarg_makeTxDb_chrominfo(chrominfo, transcripts$tx_chrom, :
chromosome lengths and circularity flags are not available for this TxDb object
> z
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: BioMart
# Organism: Saccharomyces cerevisiae
# Taxonomy ID: 4932
# Resource URL: Dec2013.archive.ensembl.org:80
# BioMart database: ENSEMBL_MART_ENSEMBL
# BioMart database version: Ensembl Genes 74
# BioMart dataset: scerevisiae_gene_ensembl
# BioMart dataset description: Saccharomyces cerevisiae genes (EF4)
# BioMart dataset version: EF4
# Full dataset: yes
# miRBase build ID: NA
# transcript_nrow: 7126
# exon_nrow: 7553
# cds_nrow: 7034
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2015-10-28 06:38:52 -0700 (Wed, 28 Oct 2015)
# GenomicFeatures version at creation time: 1.22.0
# RSQLite version at creation time: 1.0.0
# DBSCHEMAVERSION: 1.1
Or you could use the AnnotationHub package:
> library(AnnotationHub)
Attaching package: AnnotationHub
The following object is masked from package:Biobase :
cache
> hub <- AnnotationHub()
snapshotDate(): 2015-08-26
> query(hub, c("cerevisiae", "GRanges", "Ensembl"))
AnnotationHub with 15 records
# snapshotDate(): 2015-08-26
# $dataprovider: Ensembl, UCSC
# $species: Saccharomyces cerevisiae
# $rdataclass: GRanges
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
# sourcetype
# retrieve records with, e.g., 'object[["AH7049"]]'
title
AH7049 | Ensembl Genes
AH7057 | Ensembl Genes
AH7523 | Saccharomyces_cerevisiae.EF4.69.gtf
AH7584 | Saccharomyces_cerevisiae.EF4.70.gtf
AH7692 | Saccharomyces_cerevisiae.EF4.71.gtf
... ...
AH28704 | Saccharomyces_cerevisiae.R64-1-1.76.gtf
AH28773 | Saccharomyces_cerevisiae.R64-1-1.79.gtf
AH28842 | Saccharomyces_cerevisiae.R64-1-1.77.gtf
AH47096 | Saccharomyces_cerevisiae.R64-1-1.80.gtf
AH47993 | Saccharomyces_cerevisiae.R64-1-1.81.gtf
So that looks pretty close, but we can't see all the available things, so let's look closer.
> mcols(query(hub, c("cerevisiae", "GRanges", "Ensembl")))$title
[1] "Ensembl Genes"
[2] "Ensembl Genes"
[3] "Saccharomyces_cerevisiae.EF4.69.gtf"
[4] "Saccharomyces_cerevisiae.EF4.70.gtf"
[5] "Saccharomyces_cerevisiae.EF4.71.gtf"
[6] "Saccharomyces_cerevisiae.EF4.72.gtf"
[7] "Saccharomyces_cerevisiae.EF4.73.gtf"
[8] "Saccharomyces_cerevisiae.EF4.74.gtf"
[9] "Saccharomyces_cerevisiae.R64-1-1.75.gtf"
[10] "Saccharomyces_cerevisiae.R64-1-1.78.gtf"
[11] "Saccharomyces_cerevisiae.R64-1-1.76.gtf"
[12] "Saccharomyces_cerevisiae.R64-1-1.79.gtf"
[13] "Saccharomyces_cerevisiae.R64-1-1.77.gtf"
[14] "Saccharomyces_cerevisiae.R64-1-1.80.gtf"
[15] "Saccharomyces_cerevisiae.R64-1-1.81.gtf"
> mcols(query(hub, c("cerevisiae", "GRanges", "Ensembl")))$sourceurl
[1] "rtracklayer://hgdownload.cse.ucsc.edu/goldenpath/sacCer2/database/ensGene"
[2] "rtracklayer://hgdownload.cse.ucsc.edu/goldenpath/sacCer1/database/ensGene"
[3] "ftp://ftp.ensembl.org/pub/release-69/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.EF4.69.gtf.gz"
[4] "ftp://ftp.ensembl.org/pub/release-70/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.EF4.70.gtf.gz"
[5] "ftp://ftp.ensembl.org/pub/release-71/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.EF4.71.gtf.gz"
[6] "ftp://ftp.ensembl.org/pub/release-72/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.EF4.72.gtf.gz"
[7] "ftp://ftp.ensembl.org/pub/release-73/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.EF4.73.gtf.gz"
[8] "ftp://ftp.ensembl.org/pub/release-74/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.EF4.74.gtf.gz"
[9] "ftp://ftp.ensembl.org/pub/release-75/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.75.gtf.gz"
[10] "ftp://ftp.ensembl.org/pub/release-78/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.78.gtf.gz"
[11] "ftp://ftp.ensembl.org/pub/release-76/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.76.gtf.gz"
[12] "ftp://ftp.ensembl.org/pub/release-79/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.79.gtf.gz"
[13] "ftp://ftp.ensembl.org/pub/release-77/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.77.gtf.gz"
[14] "ftp://ftp.ensembl.org/pub/release-80/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.80.gtf.gz"
[15] "ftp://ftp.ensembl.org/pub/release-81/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.81.gtf.gz"
So we want either the first or the seventh item.
> names(query(hub, c("cerevisiae", "GRanges", "Ensembl")))
[1] "AH7049" "AH7057" "AH7523" "AH7584" "AH7692" "AH7753" "AH7816"
[8] "AH8781" "AH10712" "AH28636" "AH28704" "AH28773" "AH28842" "AH47096"
[15] "AH47993"
> hub[["AH7049"]]
GRanges object with 7130 ranges and 5 metadata columns:
seqnames ranges strand | name score itemRgb
<Rle> <IRanges> <Rle> | <character> <numeric> <character>
[1] chrI [130802, 131986] + | YAL012W 0 <NA>
[2] chrI [ 335, 649] + | YAL069W 0 <NA>
[3] chrI [ 538, 792] + | YAL068W-A 0 <NA>
[4] chrI [ 1807, 2169] - | YAL068C 0 <NA>
[5] chrI [ 2480, 2707] + | YAL067W-A 0 <NA>
... ... ... ... ... ... ... ...
[7126] chrXIII [923492, 923800] - | YMR326C 0 <NA>
[7127] 2micron [ 252, 1523] + | R0010W 0 <NA>
[7128] 2micron [ 1887, 3008] - | R0020C 0 <NA>
[7129] 2micron [ 3271, 3816] + | R0030W 0 <NA>
[7130] 2micron [ 5308, 6198] - | R0040C 0 <NA>
thick blocks
<IRanges> <IRangesList>
[1] [130802, 131986] [1, 1185]
[2] [ 335, 649] [1, 315]
[3] [ 538, 792] [1, 255]
[4] [ 1807, 2169] [1, 363]
[5] [ 2480, 2707] [1, 228]
... ... ...
[7126] [923492, 923800] [1, 309]
[7127] [ 252, 1523] [1, 1272]
[7128] [ 1887, 3008] [1, 1122]
[7129] [ 3271, 3816] [1, 546]
[7130] [ 5308, 6198] [1, 891]
-------
seqinfo: 18 sequences (2 circular) from sacCer2 genome
> hub[["AH7816"]]
downloading from https://annotationhub.bioconductor.org/fetch/7816
retrieving 1 resource
|======================================================================| 100%
using guess work to populate seqinfo
GRanges object with 27995 ranges and 12 metadata columns:
seqnames ranges strand | source type
<Rle> <IRanges> <Rle> | <factor> <factor>
[1] IV [1802, 2953] + | protein_coding exon
[2] IV [1802, 2950] + | protein_coding CDS
[3] IV [1802, 1804] + | protein_coding start_codon
[4] IV [2951, 2953] + | protein_coding stop_codon
[5] IV [3762, 3836] + | protein_coding exon
... ... ... ... ... ... ...
[27991] Mito [85295, 85777] + | ncRNA exon
[27992] Mito [85554, 85709] + | protein_coding exon
[27993] Mito [85554, 85706] + | protein_coding CDS
[27994] Mito [85554, 85556] + | protein_coding start_codon
[27995] Mito [85707, 85709] + | protein_coding stop_codon
score phase gene_id transcript_id exon_number gene_name
<numeric> <integer> <character> <character> <numeric> <character>
[1] <NA> <NA> YDL248W YDL248W 1 COS7
[2] <NA> 0 YDL248W YDL248W 1 COS7
[3] <NA> 0 YDL248W YDL248W 1 COS7
[4] <NA> 0 YDL248W YDL248W 1 COS7
[5] <NA> <NA> YDL247W-A YDL247W-A 1 YDL247W-A
... ... ... ... ... ... ...
[27991] <NA> <NA> RPM1 RPM1 1 RPM1
[27992] <NA> <NA> Q0297 Q0297 1 Q0297
[27993] <NA> 0 Q0297 Q0297 1 Q0297
[27994] <NA> 0 Q0297 Q0297 1 Q0297
[27995] <NA> 0 Q0297 Q0297 1 Q0297
gene_biotype transcript_name exon_id protein_id
<character> <character> <character> <character>
[1] protein_coding COS7 YDL248W.1 <NA>
[2] protein_coding COS7 <NA> YDL248W
[3] protein_coding COS7 <NA> <NA>
[4] protein_coding COS7 <NA> <NA>
[5] protein_coding YDL247W-A YDL247W-A.1 <NA>
... ... ... ... ...
[27991] ncRNA RPM1 RPM1.1 <NA>
[27992] protein_coding Q0297 Q0297.1 <NA>
[27993] protein_coding Q0297 <NA> Q0297
[27994] protein_coding Q0297 <NA> <NA>
[27995] protein_coding Q0297 <NA> <NA>
-------
seqinfo: 17 sequences (1 circular) from EF4 genome
There were 50 or more warnings (use warnings() to see the first 50)
So depending on what you want, you can often get things from AnnotationHub() without having to 'roll your own'.
Just one small addition to Jim's excellent answer: then use
makeTxDbFromGRanges()
on the GRanges object you got from AnnotationHub to turn it into a TxDb object.H.
Hi Herve, I tried your method too. Here is output ..please make me correct if i am wrong
makeTxDbFromGRanges()
can only be used on a GRanges object that was obtained by importing a GFF or GTF file withrtracklayer::import()
. That seems to be the case for 13 of the 15 resources returned by Jim's search but not for the first 2 (AH7049, AH7057). However with the 13 others,makeTxDbFromGRanges()
should work. For example, with AH8781 (Saccharomyces_cerevisiae.EF4.74.gtf):For the resource you tried to fetch (AH7049), its
sourceurl
field iswhich seems to indicate that it was downloaded from the UCSC Genome Browser and turned into a GRanges object with the rtracklayer package but it's not clear to me how it was obtained exactly. The problem is that the resulting GRanges object doesn't seem to have any useful information in it so maybe the recipe that was used to generate it is broken, hard to tell.
Anyway If you want to import the ensGene track for sacCer2 as a TxDb object, an easy way is to use
makeTxDbFromUCSC()
. It will grab the data directly from UCSC:See
?makeTxDbFromUCSC
for more information.Cheers,
H.
Hi Herve, Thanks for suggesting other ways to solve the problem. i tried both of your methods those worked fine. But i was required to txdb file from biomart format because genfeatures() works well in that.
OK. Then I'm not sure why you tried to make a TxDb from
hub[["AH7049"]]
...Thanks James, it worked exactly as per my requirement...