I a trying hard to make a proper txdb object that I can ultimately use with gDNAx
package. I raised a separate issue here but I have realized that it's not gDNAx's problem but is actually about how the txdb object is created. I have downloaded the FASTA file and GTF file from ToxoDB from here and I have been trying hard to make a proper txdb object on which when I run seqlevelsStyle
and genomeStyles
should not throw an error. But it has proved to be a difficult task. I am not sure anymore what's the right way to do this so I am choosing to ask it here.
## Read files
gtf_file <- "ToxoDB-67_TgondiiME49.gtf"
fasta_file <- "ToxoDB-67_TgondiiME49_Genome.fasta"
## Making a txdb object
gff <- makeTxDbFromGFF(gtf_file,format = "gtf",organism = "Toxoplasma gondii",taxonomyId=508771, dataSource = "ToxoDB release 67")
## Maybe the chromosome information should be added
library(GenomicFeatures)
library(Biostrings)
library(Rsamtools)
fa <- FaFile(fasta_file)
fa_seqinfo <- seqinfo(fa)
seqinfo(gff) <- seqinfo(fa) ## Doesn't work
pruned_fa_seqinfo <- keepSeqlevels(fa_seqinfo, seqlevels(gff)) ## because the order of chromosome in GFF and FASTA file doesn't match. So to match that.
#seqinfo(gff) <- pruned_fa_seqinfo ## Doesn't work so let's rerun the makeTxDbFromGFF with this object pruned_fa_seqinfo
identical(seqlevels(gff), seqlevels(pruned_fa_seqinfo))
## now using to recreate toxodb object with chromosome information
gff <- makeTxDbFromGFF(gtf_file,format = "gtf",organism = "Toxoplasma gondii",taxonomyId=508771, dataSource = "ToxoDB release 67",chrominfo = pruned_fa_seqinfo)
seqlevelsStyle(gff)
Error in seqlevelsStyle(seqlevels) :
The style does not have a compatible entry for the species supported by Seqname. Please see
genomeStyles() for supported species/style
genomeStyles(gff)
Error in strsplit(organism, "_| ") : non-character argument
seqlevelsStyle(gff) <- "NCBI"
Error in .replace_seqlevels_style(x_seqlevels, value) :
found no sequence renaming map compatible with seqname style "NCBI" for this object
seqinfo(gff)
Seqinfo object with 436 sequences from an unspecified genome:
seqnames seqlengths isCircular genome
KE138841 35372 <NA> <NA>
KE138851 13341 <NA> <NA>
KE138856 1030 <NA> <NA>
KE138857 1121 <NA> <NA>
KE138861 3123 <NA> <NA>
... ... ... ...
TGME49_chrVIIa 4541629 <NA> <NA>
TGME49_chrVIIb 5069724 <NA> <NA>
TGME49_chrX 7486190 <NA> <NA>
TGME49_chrXI 6623461 <NA> <NA>
TGME49_chrXII 7094428 <NA> <NA>
Thanks Hervé! Very helpful. However, I believe that
keepStandardChromosomes
gets called bygDNAx
, and it appears not to work correctly.And it appears that 'Pltd' is non-nuclear. Can this be avoided by adding in the
chrominfo
?keepStandardChromosomes()
is not smart enough to know what to keep for this assembly. Note that whatkeepStandardChromosomes()
means by standard chromosome is what they call "assembled molecules" at NCBI:So in theory all the information is available and it should be possible to refactor
keepStandardChromosomes()
to do the right thing. As long as:genome
column in the Seqinfo object returned byseqinfo(txdb)
contains the correct assembly namegetChromInfoFromNCBI()
orgetChromInfoFromUCSC()
recognizes that name.This would be a significant refactor of
keepStandardChromosomes()
though and I don't really have the resources to work on this at the moment but PRs are welcome.Anyways, it's easy to drop the non-standard chromosomes of the txdb object "by hand":
Now gDNAx will still call
keepStandardChromosomes()
on this TxDb object and mess it up, but really the gDNAx folks should make this call optional.H.
Hi Herve
As I mentioned, for parasites we rely on VEuPathDB databases (such as ToxoDB here: https://toxodb.org/toxo/app/downloads) which procure genome sequences from NCBI but perform their annotation and are updated frequently. That's why I wish to use the below-given files. Now what you showed using
chrominfo <- getChromInfoFromNCBI("TGA4")
is useful but it doesn't solve theseqlevelsStyle(txdb)
error when using GTF file from other sources than NCBI, Ensembl, or UCSC. I tried some of your steps below to reproduce the error. Would it be possible to assign it something like thisseqlevelsStyle(txdb) <- "NCBI"
so that it doesn't complaint. I have requestedgDNAx
developers for a feature request to turn offseqlevelsStyle(txdb)
so that it is not checked if the txdb was built using GTF/GFF. But it would be really helpful ifGenomeInfoDb
can provide a way for users who are building the txdb objects using their own annotations while staying compatible with other packages.