Question: Getting errors with makeTxDbFromBiomart
1
gravatar for Didi
2.8 years ago by
Didi10
Didi10 wrote:

Hi,

I'm trying to make a txDb object from biomart. 

When running this command:

txdb <- makeTxDbFromBiomart (biomart="plants_mart", dataset="athaliana_eg_gene", host="plants.ensembl.org")

I get this:

Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Download and preprocess the 'splicings' data frame ... OK
Download and preprocess the 'genes' data frame ... OK
Prepare the 'metadata' data frame ... Error in .Ensembl_getMySQLCoreDir(dataset) : 
  found 0 or more than 1 subdir for "athaliana_eg_gene" dataset at ftp://ftp.ensembl.org/pub/current_mysql/

Please can you tell me what I am doing wrong.

I get the same error when running makeTxDb package.

Thanks a lot.

ADD COMMENTlink modified 2.8 years ago by Thomas Maurel770 • written 2.8 years ago by Didi10
Answer: Getting errors with makeTxDbFromBiomart
0
gravatar for Johannes Rainer
2.8 years ago by
Johannes Rainer1.5k
Italy
Johannes Rainer1.5k wrote:

Now the problem here is that the function tries to extract the sequence (chromosome) lengths from the MySQL database dumps from Ensembl, but only ensemblgenomes provides these. Alternatively, you could use ensembldb to create an EnsDb database for this species (has the same functionality than the TxDb). The code below assumes that you have downloaded the GTF file from ftp.ensemblgenomes.org/pub/plants/release-34/gtf

library(ensembldb)
dbFile <- ensDbFromGtf("Arabidopsis_thaliana.TAIR10.34.gtf.gz")
## Load the database.
edb <- EnsDb(dbFile)

 

You also get an error message that sequence lengths can not be retrieved from Ensembl, but they are from ensemblgenomes, as they are present if you look at the seqinfo (the function tries to find the corresponding files first on the ensembl ftp server, then on the ensemblgenomes ftp server):

seqinfo(edb)
Seqinfo object with 7 sequences from TAIR10 genome:
  seqnames seqlengths isCircular genome
  1          30427671         NA TAIR10
  5          26975502         NA TAIR10
  3          23459830         NA TAIR10
  2          19698289         NA TAIR10
  4          18585056         NA TAIR10
  Mt           366924         NA TAIR10
  Pt           154478         NA TAIR10

 

Hope this helps.

cheers, jo

ADD COMMENTlink written 2.8 years ago by Johannes Rainer1.5k

 

Thanks a lot. really helpful. 

Is it normal to get these warnings:

Warning messages:
1: In readLines(gtf, n = 10) : line 1 appears to contain an embedded nul
2: In readLines(gtf, n = 10) : line 3 appears to contain an embedded nul
3: In readLines(gtf, n = 10) : line 5 appears to contain an embedded nul
4: In grep(tmp, pattern = "^#") : input string 1 is invalid in this locale
5: In grep(tmp, pattern = "^#") : input string 2 is invalid in this locale
6: In grep(tmp, pattern = "^#") : input string 3 is invalid in this locale
7: In grep(tmp, pattern = "^#") : input string 4 is invalid in this locale
8: In grep(tmp, pattern = "^#") : input string 5 is invalid in this locale
9: In ensDbFromGRanges(GTF, outfile = outfile, path = path, organism = organism,  :
   I'm missing column(s): 'entrezid'. The corresponding database column(s) will be empty!

 

 

 

 

 

 

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Didi10

These warnings are strange. The one related to the entrezid is fine, since they are not provided in the GTF file and hence the database column will be empty. Also warnings related to not fetching the sequence lengths should be OK - the function first tries to get them from ensembl and fails, but should be able to fetch them from the ensemblgenomes. Just check afterwards using the seqinfo if you've got sequence lengths (I did with using ensembldb from BioC 3.4).

Could you provide the output of the sessionInfo? And what exactly are you doing? Did you download the GTF file locally and use the ensDbFromGtf function?

 

ADD REPLYlink written 2.8 years ago by Johannes Rainer1.5k

Hi, 

I used this script.

dbFile <- ensDbFromGtf("ftp://ftp.ensemblgenomes.org/pub/plants/release-34/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.34.gtf.gz")
## Load the database.
edb <- EnsDb(dbFile)
seqinfo(edb)

And I've got this:

Importing GTF file...trying URL 'ftp://ftp.ensemblgenomes.org/pub/plants/release-34/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.34.gtf.gz'
downloaded 9.5 MB

OK
Processing metadata...OK
Processing genes...
 Attribute availability:
  o gene_id... OK
  o gene_name... OK
  o entrezid... Nope
  o gene_biotype... OK
OK
Processing transcripts...
 Attribute availability:
  o transcript_id... OK
  o gene_id... OK
  o transcript_biotype... OK
OK
Processing exons...OK
Processing chromosomes...Fetch seqlengths from ensembl, dataset athaliana_gene_ensembl version 34...Error in function (type, msg, asError = TRUE)  : 
  Server denied you to change to the given directory

Unable to get sequence lengths from Ensembl for dataset: athaliana_gene_ensembl. Error was: 

OK
OK
Generating index...OK
  -------------
Verifying validity of the information in the database:
Checking transcripts...OK
Checking exons...OK
Warning messages:
1: In readLines(gtf, n = 10) : line 1 appears to contain an embedded nul
2: In readLines(gtf, n = 10) : line 5 appears to contain an embedded nul
3: In readLines(gtf, n = 10) : line 6 appears to contain an embedded nul
4: In readLines(gtf, n = 10) : line 7 appears to contain an embedded nul
5: In readLines(gtf, n = 10) : line 8 appears to contain an embedded nul
6: In readLines(gtf, n = 10) : line 9 appears to contain an embedded nul
7: In readLines(gtf, n = 10) : line 10 appears to contain an embedded nul
8: In ensDbFromGRanges(GTF, outfile = outfile, path = path, organism = organism,  :
   I'm missing column(s): 'entrezid'. The corresponding database column(s) will be empty!

I want to make a TxDb object.

Thanks a lot.

D.

ADD REPLYlink written 2.8 years ago by Didi10

I proposed this as an alternative to the TxDb object. EnsDb objects provide the same annotations, same methods and same functionality, but are specifically designed for Ensembl annotations. There is no way to convert an EnsDb to a TxDb, but you should be able to use the EnsDb as it was a TxDb.

ADD REPLYlink written 2.8 years ago by Johannes Rainer1.5k

Yes it works thanks. Is there a way to add entezid?

Thanks a lot.

D.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Didi10

I can provide you an EnsDb package for A. thaliana for ensemblgenomes-34 (corresponds to Ensembl 87) build from the MySQL database dumps and using the Ensembl perl API. But I checked, also there is no entrezid available - seems NCBI does not provide annotations for plants?
 

ADD REPLYlink written 2.8 years ago by Johannes Rainer1.5k
Answer: Getting errors with makeTxDbFromBiomart
0
gravatar for Thomas Maurel
2.8 years ago by
Thomas Maurel770
United Kingdom
Thomas Maurel770 wrote:

Hello,

I am afraid that only the 69 vertebrate species and the 16 mouse strains MySQL files can be found at this location: ftp://ftp.ensembl.org/pub/current_mysql/.

The other ensembl divisions can be found on their respective FTP site spaces:

  1. Plants: ftp://ftp.ensemblgenomes.org/pub/release-34/plants/
  2. Metazoa:  ftp://ftp.ensemblgenomes.org/pub/release-34/metazoa/
  3. Fungi: ftp://ftp.ensemblgenomes.org/pub/release-34/fungi/
  4. Protists: ftp://ftp.ensemblgenomes.org/pub/release-34/protists/
  5. Bacteria: ftp://ftp.ensemblgenomes.org/pub/release-34/bacteria/

I don't know if makeTxDbFromBiomart supports the 5 Ensembl divisions.

Kind Regards,

Thomas

ADD COMMENTlink written 2.8 years ago by Thomas Maurel770

makeTxDbFromBiomart() uses biomaRt under the hood so, yes, it supports these sister ensembl sites. You can specify different hosts with the 'host' argument, also see the 'biomart' and 'dataset' arguments on the ?makeTxDbFromBiomart man page.

> listMarts(host="plants.ensembl.org")
            biomart              version
1       plants_mart           Plant Mart
2 plants_variations Plant Variation Mart
> listMarts(host="fungi.ensembl.org")
            biomart               version
1       fungal_mart           Fungal Mart
2 fungal_variations Fungal Variation Mart

Valerie

ADD REPLYlink written 2.8 years ago by Valerie Obenchain6.7k
2

The OP did use the correct host and mart, but the underlying code is insensitive to the choice made. In other words, the base URI for the FTP site is set in Ensembl.utils.R of GenomicFeatures as

.ENSEMBL.PUB_FTP_URL <- "ftp://ftp.ensembl.org/pub/"

and in the function ftp_url_to_Ensembl_mysql, this is used to create the FTP URI:

ftp_url_to_Ensembl_mysql <- function (release = NA, use.grch37 = FALSE)
{
    if (is.na(release)) {
        if (use.grch37)
            pub_subdir <- "current/mysql"
        else pub_subdir <- "current_mysql"
    }
    else {
        pub_subdir <- paste0("release-", release, "/mysql")
    }
    if (use.grch37)
        pub_ftp_url <- .ENSEMBLGRCh37.PUB_FTP_URL
    else pub_ftp_url <- .ENSEMBL.PUB_FTP_URL
    paste0(pub_ftp_url, pub_subdir, "/")
}

So regardless of the host and mart chosen, you get the 'main' ftp site.

ADD REPLYlink written 2.8 years ago by James W. MacDonald51k

Thanks a lot. So what to do exactly to make  makeTxDbFromBiomart work using host="plants.ensembl.org"?

 

ADD REPLYlink written 2.8 years ago by Didi10

Was this ever fixed?

ADD REPLYlink written 2.3 years ago by maltethodberg140

Yes. From ?useMart:

ensemblRedirect: By default when you access Ensembl BioMart it will
          redirect you to your local mirror, even if you have set a
          region specific mirror in the 'host' argument.  By setting
          this argument to 'FALSE' you can override this behaviour and
          force access to your specified 'host'. Defaults to 'TRUE' to
          prevent overwhelming the main Ensembl site. If you are
          accessing a BioMart instance other than Ensembl this should
          have no effect.

 

ADD REPLYlink written 2.3 years ago by James W. MacDonald51k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 292 users visited in the last hour