Getting errors with makeTxDbFromBiomart
2
1
Entering edit mode
Didi ▴ 10
@didi-10905
Last seen 10 months ago

Hi,

I'm trying to make a txDb object from biomart.

When running this command:

txdb <- makeTxDbFromBiomart (biomart="plants_mart", dataset="athaliana_eg_gene", host="plants.ensembl.org")

I get this:

Prepare the 'metadata' data frame ... Error in .Ensembl_getMySQLCoreDir(dataset) :
found 0 or more than 1 subdir for "athaliana_eg_gene" dataset at ftp://ftp.ensembl.org/pub/current_mysql/

Please can you tell me what I am doing wrong.

I get the same error when running makeTxDb package.

Thanks a lot.

genomicfeatures biomart maketxdbfrombiomart • 1.8k views
0
Entering edit mode
Johannes Rainer ★ 1.9k
@johannes-rainer-6987
Last seen 4 months ago
Italy

Now the problem here is that the function tries to extract the sequence (chromosome) lengths from the MySQL database dumps from Ensembl, but only ensemblgenomes provides these. Alternatively, you could use ensembldb to create an EnsDb database for this species (has the same functionality than the TxDb). The code below assumes that you have downloaded the GTF file from ftp.ensemblgenomes.org/pub/plants/release-34/gtf

library(ensembldb)
dbFile <- ensDbFromGtf("Arabidopsis_thaliana.TAIR10.34.gtf.gz")
edb <- EnsDb(dbFile)

You also get an error message that sequence lengths can not be retrieved from Ensembl, but they are from ensemblgenomes, as they are present if you look at the seqinfo (the function tries to find the corresponding files first on the ensembl ftp server, then on the ensemblgenomes ftp server):

seqinfo(edb)
Seqinfo object with 7 sequences from TAIR10 genome:
seqnames seqlengths isCircular genome
1          30427671         NA TAIR10
5          26975502         NA TAIR10
3          23459830         NA TAIR10
2          19698289         NA TAIR10
4          18585056         NA TAIR10
Mt           366924         NA TAIR10
Pt           154478         NA TAIR10

Hope this helps.

cheers, jo

0
Entering edit mode

Is it normal to get these warnings:

Warning messages:
1: In readLines(gtf, n = 10) : line 1 appears to contain an embedded nul
2: In readLines(gtf, n = 10) : line 3 appears to contain an embedded nul
3: In readLines(gtf, n = 10) : line 5 appears to contain an embedded nul
4: In grep(tmp, pattern = "^#") : input string 1 is invalid in this locale
5: In grep(tmp, pattern = "^#") : input string 2 is invalid in this locale
6: In grep(tmp, pattern = "^#") : input string 3 is invalid in this locale
7: In grep(tmp, pattern = "^#") : input string 4 is invalid in this locale
8: In grep(tmp, pattern = "^#") : input string 5 is invalid in this locale
9: In ensDbFromGRanges(GTF, outfile = outfile, path = path, organism = organism,  :
I'm missing column(s): 'entrezid'. The corresponding database column(s) will be empty!

0
Entering edit mode

These warnings are strange. The one related to the entrezid is fine, since they are not provided in the GTF file and hence the database column will be empty. Also warnings related to not fetching the sequence lengths should be OK - the function first tries to get them from ensembl and fails, but should be able to fetch them from the ensemblgenomes. Just check afterwards using the seqinfo if you've got sequence lengths (I did with using ensembldb from BioC 3.4).

Could you provide the output of the sessionInfo? And what exactly are you doing? Did you download the GTF file locally and use the ensDbFromGtf function?

0
Entering edit mode

Hi,

I used this script.

dbFile <- ensDbFromGtf("ftp://ftp.ensemblgenomes.org/pub/plants/release-34/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.34.gtf.gz")
edb <- EnsDb(dbFile)
seqinfo(edb)

And I've got this:

Importing GTF file...trying URL 'ftp://ftp.ensemblgenomes.org/pub/plants/release-34/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.34.gtf.gz'

OK
Processing genes...
Attribute availability:
o gene_id... OK
o gene_name... OK
o entrezid... Nope
o gene_biotype... OK
OK
Processing transcripts...
Attribute availability:
o transcript_id... OK
o gene_id... OK
o transcript_biotype... OK
OK
Processing exons...OK
Processing chromosomes...Fetch seqlengths from ensembl, dataset athaliana_gene_ensembl version 34...Error in function (type, msg, asError = TRUE)  :
Server denied you to change to the given directory

Unable to get sequence lengths from Ensembl for dataset: athaliana_gene_ensembl. Error was:

OK
OK
Generating index...OK
-------------
Verifying validity of the information in the database:
Checking transcripts...OK
Checking exons...OK
Warning messages:
1: In readLines(gtf, n = 10) : line 1 appears to contain an embedded nul
2: In readLines(gtf, n = 10) : line 5 appears to contain an embedded nul
3: In readLines(gtf, n = 10) : line 6 appears to contain an embedded nul
4: In readLines(gtf, n = 10) : line 7 appears to contain an embedded nul
5: In readLines(gtf, n = 10) : line 8 appears to contain an embedded nul
6: In readLines(gtf, n = 10) : line 9 appears to contain an embedded nul
7: In readLines(gtf, n = 10) : line 10 appears to contain an embedded nul
8: In ensDbFromGRanges(GTF, outfile = outfile, path = path, organism = organism,  :
I'm missing column(s): 'entrezid'. The corresponding database column(s) will be empty!

I want to make a TxDb object.

Thanks a lot.

D.

0
Entering edit mode

I proposed this as an alternative to the TxDb object. EnsDb objects provide the same annotations, same methods and same functionality, but are specifically designed for Ensembl annotations. There is no way to convert an EnsDb to a TxDb, but you should be able to use the EnsDb as it was a TxDb.

0
Entering edit mode

Yes it works thanks. Is there a way to add entezid?

Thanks a lot.

D.

0
Entering edit mode

I can provide you an EnsDb package for A. thaliana for ensemblgenomes-34 (corresponds to Ensembl 87) build from the MySQL database dumps and using the Ensembl perl API. But I checked, also there is no entrezid available - seems NCBI does not provide annotations for plants?

0
Entering edit mode
Thomas Maurel ▴ 790
@thomas-maurel-5295
Last seen 3 months ago
United Kingdom

Hello,

I am afraid that only the 69 vertebrate species and the 16 mouse strains MySQL files can be found at this location: ftp://ftp.ensembl.org/pub/current_mysql/.

The other ensembl divisions can be found on their respective FTP site spaces:

I don't know if makeTxDbFromBiomart supports the 5 Ensembl divisions.

Kind Regards,

Thomas

0
Entering edit mode

makeTxDbFromBiomart() uses biomaRt under the hood so, yes, it supports these sister ensembl sites. You can specify different hosts with the 'host' argument, also see the 'biomart' and 'dataset' arguments on the ?makeTxDbFromBiomart man page.

> listMarts(host="plants.ensembl.org")
biomart              version
1       plants_mart           Plant Mart
2 plants_variations Plant Variation Mart
> listMarts(host="fungi.ensembl.org")
biomart               version
1       fungal_mart           Fungal Mart
2 fungal_variations Fungal Variation Mart

Valerie

2
Entering edit mode

The OP did use the correct host and mart, but the underlying code is insensitive to the choice made. In other words, the base URI for the FTP site is set in Ensembl.utils.R of GenomicFeatures as

.ENSEMBL.PUB_FTP_URL <- "ftp://ftp.ensembl.org/pub/"

and in the function ftp_url_to_Ensembl_mysql, this is used to create the FTP URI:

ftp_url_to_Ensembl_mysql <- function (release = NA, use.grch37 = FALSE)
{
if (is.na(release)) {
if (use.grch37)
pub_subdir <- "current/mysql"
else pub_subdir <- "current_mysql"
}
else {
pub_subdir <- paste0("release-", release, "/mysql")
}
if (use.grch37)
pub_ftp_url <- .ENSEMBLGRCh37.PUB_FTP_URL
else pub_ftp_url <- .ENSEMBL.PUB_FTP_URL
paste0(pub_ftp_url, pub_subdir, "/")
}

So regardless of the host and mart chosen, you get the 'main' ftp site.

0
Entering edit mode

Thanks a lot. So what to do exactly to make  makeTxDbFromBiomart work using host="plants.ensembl.org"?

0
Entering edit mode

0
Entering edit mode

Yes. From ?useMart:

ensemblRedirect: By default when you access Ensembl BioMart it will
redirect you to your local mirror, even if you have set a
region specific mirror in the 'host' argument.  By setting
this argument to 'FALSE' you can override this behaviour and
prevent overwhelming the main Ensembl site. If you are
accessing a BioMart instance other than Ensembl this should
have no effect.