Question

Getting errors with makeTxDbFromBiomart

1

Entering edit mode

Didi ▴ 10

@didi-10905

Last seen 2.2 years ago

Spain

Hi,

I'm trying to make a txDb object from biomart.

When running this command:

txdb <- makeTxDbFromBiomart (biomart="plants_mart", dataset="athaliana_eg_gene", host="plants.ensembl.org")

I get this:

Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Download and preprocess the 'splicings' data frame ... OK
Download and preprocess the 'genes' data frame ... OK
Prepare the 'metadata' data frame ... Error in .Ensembl_getMySQLCoreDir(dataset) :
found 0 or more than 1 subdir for "athaliana_eg_gene" dataset at ftp://ftp.ensembl.org/pub/current_mysql/

Please can you tell me what I am doing wrong.

I get the same error when running makeTxDb package.

Thanks a lot.

genomicfeatures biomart maketxdbfrombiomart • 3.1k views

ADD COMMENT • link updated 7.3 years ago by Thomas Maurel ▴ 800 • written 7.3 years ago by Didi ▴ 10

score 0 · Answer 1 · 2017-01-25

0

Entering edit mode

Johannes Rainer ★ 2.0k

@johannes-rainer-6987

Last seen 23 days ago

Italy

Now the problem here is that the function tries to extract the sequence (chromosome) lengths from the MySQL database dumps from Ensembl, but only ensemblgenomes provides these. Alternatively, you could use ensembldb to create an EnsDb database for this species (has the same functionality than the TxDb). The code below assumes that you have downloaded the GTF file from ftp.ensemblgenomes.org/pub/plants/release-34/gtf

library(ensembldb)
dbFile <- ensDbFromGtf("Arabidopsis_thaliana.TAIR10.34.gtf.gz")
## Load the database.
edb <- EnsDb(dbFile)

You also get an error message that sequence lengths can not be retrieved from Ensembl, but they are from ensemblgenomes, as they are present if you look at the seqinfo (the function tries to find the corresponding files first on the ensembl ftp server, then on the ensemblgenomes ftp server):

seqinfo(edb)
Seqinfo object with 7 sequences from TAIR10 genome:
  seqnames seqlengths isCircular genome
  1          30427671         NA TAIR10
  5          26975502         NA TAIR10
  3          23459830         NA TAIR10
  2          19698289         NA TAIR10
  4          18585056         NA TAIR10
  Mt           366924         NA TAIR10
  Pt           154478         NA TAIR10

Hope this helps.

cheers, jo

ADD COMMENT • link 7.3 years ago Johannes Rainer ★ 2.0k

0

Entering edit mode

Thanks a lot. really helpful.

Is it normal to get these warnings:

Warning messages:
1: In readLines(gtf, n = 10) : line 1 appears to contain an embedded nul
2: In readLines(gtf, n = 10) : line 3 appears to contain an embedded nul
3: In readLines(gtf, n = 10) : line 5 appears to contain an embedded nul
4: In grep(tmp, pattern = "^#") : input string 1 is invalid in this locale
5: In grep(tmp, pattern = "^#") : input string 2 is invalid in this locale
6: In grep(tmp, pattern = "^#") : input string 3 is invalid in this locale
7: In grep(tmp, pattern = "^#") : input string 4 is invalid in this locale
8: In grep(tmp, pattern = "^#") : input string 5 is invalid in this locale
9: In ensDbFromGRanges(GTF, outfile = outfile, path = path, organism = organism, :
I'm missing column(s): 'entrezid'. The corresponding database column(s) will be empty!

ADD REPLY • link 7.3 years ago Didi ▴ 10

0

Entering edit mode

These warnings are strange. The one related to the entrezid is fine, since they are not provided in the GTF file and hence the database column will be empty. Also warnings related to not fetching the sequence lengths should be OK - the function first tries to get them from ensembl and fails, but should be able to fetch them from the ensemblgenomes. Just check afterwards using the seqinfo if you've got sequence lengths (I did with using ensembldb from BioC 3.4).

Could you provide the output of the sessionInfo? And what exactly are you doing? Did you download the GTF file locally and use the ensDbFromGtf function?

ADD REPLY • link 7.3 years ago Johannes Rainer ★ 2.0k

0

Entering edit mode

Hi,

I used this script.

dbFile <- ensDbFromGtf("ftp://ftp.ensemblgenomes.org/pub/plants/release-34/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.34.gtf.gz")
## Load the database.
edb <- EnsDb(dbFile)
seqinfo(edb)

And I've got this:

Importing GTF file...trying URL 'ftp://ftp.ensemblgenomes.org/pub/plants/release-34/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.34.gtf.gz'
downloaded 9.5 MB

OK
Processing metadata...OK
Processing genes...
Attribute availability:
o gene_id... OK
o gene_name... OK
o entrezid... Nope
o gene_biotype... OK
OK
Processing transcripts...
Attribute availability:
o transcript_id... OK
o gene_id... OK
o transcript_biotype... OK
OK
Processing exons...OK
Processing chromosomes...Fetch seqlengths from ensembl, dataset athaliana_gene_ensembl version 34...Error in function (type, msg, asError = TRUE) :
Server denied you to change to the given directory

Unable to get sequence lengths from Ensembl for dataset: athaliana_gene_ensembl. Error was:

OK
OK
Generating index...OK
-------------
Verifying validity of the information in the database:
Checking transcripts...OK
Checking exons...OK
Warning messages:
1: In readLines(gtf, n = 10) : line 1 appears to contain an embedded nul
2: In readLines(gtf, n = 10) : line 5 appears to contain an embedded nul
3: In readLines(gtf, n = 10) : line 6 appears to contain an embedded nul
4: In readLines(gtf, n = 10) : line 7 appears to contain an embedded nul
5: In readLines(gtf, n = 10) : line 8 appears to contain an embedded nul
6: In readLines(gtf, n = 10) : line 9 appears to contain an embedded nul
7: In readLines(gtf, n = 10) : line 10 appears to contain an embedded nul
8: In ensDbFromGRanges(GTF, outfile = outfile, path = path, organism = organism, :
I'm missing column(s): 'entrezid'. The corresponding database column(s) will be empty!

I want to make a TxDb object.

Thanks a lot.

D.

ADD REPLY • link 7.2 years ago Didi ▴ 10

0

Entering edit mode

I proposed this as an alternative to the TxDb object. EnsDb objects provide the same annotations, same methods and same functionality, but are specifically designed for Ensembl annotations. There is no way to convert an EnsDb to a TxDb, but you should be able to use the EnsDb as it was a TxDb.

ADD REPLY • link 7.2 years ago Johannes Rainer ★ 2.0k

0

Entering edit mode

Yes it works thanks. Is there a way to add entezid?

Thanks a lot.

D.

ADD REPLY • link 7.2 years ago Didi ▴ 10

0

Entering edit mode

I can provide you an EnsDb package for A. thaliana for ensemblgenomes-34 (corresponds to Ensembl 87) build from the MySQL database dumps and using the Ensembl perl API. But I checked, also there is no entrezid available - seems NCBI does not provide annotations for plants?

ADD REPLY • link 7.2 years ago Johannes Rainer ★ 2.0k

score 0 · Answer 2 · 2017-01-25

0

Entering edit mode

Thomas Maurel ▴ 800

@thomas-maurel-5295

Last seen 14 months ago

United Kingdom

Hello,

I am afraid that only the 69 vertebrate species and the 16 mouse strains MySQL files can be found at this location: ftp://ftp.ensembl.org/pub/current_mysql/.

The other ensembl divisions can be found on their respective FTP site spaces:

Plants: ftp://ftp.ensemblgenomes.org/pub/release-34/plants/
Metazoa: ftp://ftp.ensemblgenomes.org/pub/release-34/metazoa/
Fungi: ftp://ftp.ensemblgenomes.org/pub/release-34/fungi/
Protists: ftp://ftp.ensemblgenomes.org/pub/release-34/protists/
Bacteria: ftp://ftp.ensemblgenomes.org/pub/release-34/bacteria/

I don't know if makeTxDbFromBiomart supports the 5 Ensembl divisions.

Kind Regards,

Thomas

ADD COMMENT • link 7.3 years ago Thomas Maurel ▴ 800

0

Entering edit mode

makeTxDbFromBiomart() uses biomaRt under the hood so, yes, it supports these sister ensembl sites. You can specify different hosts with the 'host' argument, also see the 'biomart' and 'dataset' arguments on the ?makeTxDbFromBiomart man page.

> listMarts(host="plants.ensembl.org")
            biomart              version
1       plants_mart           Plant Mart
2 plants_variations Plant Variation Mart
> listMarts(host="fungi.ensembl.org")
            biomart               version
1       fungal_mart           Fungal Mart
2 fungal_variations Fungal Variation Mart

Valerie

ADD REPLY • link 7.2 years ago Valerie Obenchain ★ 6.8k

2

Entering edit mode

The OP did use the correct host and mart, but the underlying code is insensitive to the choice made. In other words, the base URI for the FTP site is set in Ensembl.utils.R of GenomicFeatures as

.ENSEMBL.PUB_FTP_URL <- "ftp://ftp.ensembl.org/pub/"

and in the function ftp_url_to_Ensembl_mysql, this is used to create the FTP URI:

ftp_url_to_Ensembl_mysql <- function (release = NA, use.grch37 = FALSE)
{
    if (is.na(release)) {
        if (use.grch37)
            pub_subdir <- "current/mysql"
        else pub_subdir <- "current_mysql"
    }
    else {
        pub_subdir <- paste0("release-", release, "/mysql")
    }
    if (use.grch37)
        pub_ftp_url <- .ENSEMBLGRCh37.PUB_FTP_URL
    else pub_ftp_url <- .ENSEMBL.PUB_FTP_URL
    paste0(pub_ftp_url, pub_subdir, "/")
}

So regardless of the host and mart chosen, you get the 'main' ftp site.

ADD REPLY • link 7.2 years ago James W. MacDonald 65k

0

Entering edit mode

Thanks a lot. So what to do exactly to make makeTxDbFromBiomart work using host="plants.ensembl.org"?

ADD REPLY • link 7.2 years ago Didi ▴ 10

0

Entering edit mode

Was this ever fixed?

ADD REPLY • link 6.7 years ago maltethodberg ▴ 170

0

Entering edit mode

Yes. From ?useMart:

ensemblRedirect: By default when you access Ensembl BioMart it will
          redirect you to your local mirror, even if you have set a
          region specific mirror in the 'host' argument.  By setting
          this argument to 'FALSE' you can override this behaviour and
          force access to your specified 'host'. Defaults to 'TRUE' to
          prevent overwhelming the main Ensembl site. If you are
          accessing a BioMart instance other than Ensembl this should
          have no effect.

ADD REPLY • link 6.7 years ago James W. MacDonald 65k