I a trying hard to make a proper txdb object that I can ultimately use with gDNAx
package. I raised a separate issue here but I have realized that it's not gDNAx's problem but is actually about how the txdb object is created. I have downloaded the FASTA file and GTF file from ToxoDB from here and I have been trying hard to make a proper txdb object on which when I run seqlevelsStyle
and genomeStyles
should not throw an error. But it has proved to be a difficult task. I am not sure anymore what's the right way to do this so I am choosing to ask it here.
## Read files
gtf_file <- "ToxoDB-67_TgondiiME49.gtf"
fasta_file <- "ToxoDB-67_TgondiiME49_Genome.fasta"
## Making a txdb object
gff <- makeTxDbFromGFF(gtf_file,format = "gtf",organism = "Toxoplasma gondii",taxonomyId=508771, dataSource = "ToxoDB release 67")
## Maybe the chromosome information should be added
library(GenomicFeatures)
library(Biostrings)
library(Rsamtools)
fa <- FaFile(fasta_file)
fa_seqinfo <- seqinfo(fa)
seqinfo(gff) <- seqinfo(fa) ## Doesn't work
pruned_fa_seqinfo <- keepSeqlevels(fa_seqinfo, seqlevels(gff)) ## because the order of chromosome in GFF and FASTA file doesn't match. So to match that.
#seqinfo(gff) <- pruned_fa_seqinfo ## Doesn't work so let's rerun the makeTxDbFromGFF with this object pruned_fa_seqinfo
identical(seqlevels(gff), seqlevels(pruned_fa_seqinfo))
## now using to recreate toxodb object with chromosome information
gff <- makeTxDbFromGFF(gtf_file,format = "gtf",organism = "Toxoplasma gondii",taxonomyId=508771, dataSource = "ToxoDB release 67",chrominfo = pruned_fa_seqinfo)
seqlevelsStyle(gff)
Error in seqlevelsStyle(seqlevels) :
The style does not have a compatible entry for the species supported by Seqname. Please see
genomeStyles() for supported species/style
genomeStyles(gff)
Error in strsplit(organism, "_| ") : non-character argument
seqlevelsStyle(gff) <- "NCBI"
Error in .replace_seqlevels_style(x_seqlevels, value) :
found no sequence renaming map compatible with seqname style "NCBI" for this object
seqinfo(gff)
Seqinfo object with 436 sequences from an unspecified genome:
seqnames seqlengths isCircular genome
KE138841 35372 <NA> <NA>
KE138851 13341 <NA> <NA>
KE138856 1030 <NA> <NA>
KE138857 1121 <NA> <NA>
KE138861 3123 <NA> <NA>
... ... ... ...
TGME49_chrVIIa 4541629 <NA> <NA>
TGME49_chrVIIb 5069724 <NA> <NA>
TGME49_chrX 7486190 <NA> <NA>
TGME49_chrXI 6623461 <NA> <NA>
TGME49_chrXII 7094428 <NA> <NA>
Thanks Hervé! Very helpful. However, I believe that
keepStandardChromosomes
gets called bygDNAx
, and it appears not to work correctly.And it appears that 'Pltd' is non-nuclear. Can this be avoided by adding in the
chrominfo
?keepStandardChromosomes()
is not smart enough to know what to keep for this assembly. Note that whatkeepStandardChromosomes()
means by standard chromosome is what they call "assembled molecules" at NCBI:So in theory all the information is available and it should be possible to refactor
keepStandardChromosomes()
to do the right thing. As long as:genome
column in the Seqinfo object returned byseqinfo(txdb)
contains the correct assembly namegetChromInfoFromNCBI()
orgetChromInfoFromUCSC()
recognizes that name.This would be a significant refactor of
keepStandardChromosomes()
though and I don't really have the resources to work on this at the moment but PRs are welcome.Anyways, it's easy to drop the non-standard chromosomes of the txdb object "by hand":
Now gDNAx will still call
keepStandardChromosomes()
on this TxDb object and mess it up, but really the gDNAx folks should make this call optional.H.
Hi Herve
As I mentioned, for parasites we rely on VEuPathDB databases (such as ToxoDB here: https://toxodb.org/toxo/app/downloads) which procure genome sequences from NCBI but perform their annotation and are updated frequently. That's why I wish to use the below-given files. Now what you showed using
chrominfo <- getChromInfoFromNCBI("TGA4")
is useful but it doesn't solve theseqlevelsStyle(txdb)
error when using GTF file from other sources than NCBI, Ensembl, or UCSC. I tried some of your steps below to reproduce the error. Would it be possible to assign it something like thisseqlevelsStyle(txdb) <- "NCBI"
so that it doesn't complaint. I have requestedgDNAx
developers for a feature request to turn offseqlevelsStyle(txdb)
so that it is not checked if the txdb was built using GTF/GFF. But it would be really helpful ifGenomeInfoDb
can provide a way for users who are building the txdb objects using their own annotations while staying compatible with other packages.Yes, that's the right course of action.
seqlevelsStyle()
is inherently unreliable and will return an error if it cannot figure out the style. Same withkeepStandardChromosomes()
: the user should be able to turn this off too. But a better approach for packages that take a TxDb object as input is as follow:In the case of your TxDb object, the chromosome/scaffold names are a mixture of NCBI names and GenBank accessions. But
seqlevelsStyle(txdb)
is not smart enough to figure that out so it fails:As I explained earlier (see above), if gDNAx only cares about the chromosomes (a.k.a. assembled molecules) then you can get rid of all the other sequences (i.e. the scaffolds) yourself before passing your TxDb object to gDNAx:
Why wouln't gDNAx be happy with that TxDb object? Aren't those chromosome names compatible with the FASTA file? If so then why does gDNAx need to correct the seqlevels of the TxDb object with calls to
seqlevelsStyle()
orkeepStandardChromosomes()
?H.
Hi Hervé, thank you very much for your input into how to properly build and manage TxDb objects. gDNAx has two main inputs, one or more BAM files and an annotation, which should come in the form of a TxDb object. Because I work mainly with human data, where
seqlevelsStyle()
works fine, my perception was that to release the user from knowing how to make sequence names compatible between annotation and BAM file, the software could do something like:Removing whatever sequences were not common or had different lengths (as it happened with chrMT in between the two official versions of hg19). As you rightly show, this is not that easy outside human data, and I'll modify the software along the lines you suggest. By the way, gDNAx attempts to work with the "standard chromosomes" only by default for its calculations, but this can be disabled by setting
stdChrom=FALSE
in the call togDNAdx()
.Something I would like to confirm I'm understanding right is, if I would like to have a more complete Seqinfo object for the example above, i.e., filling up the
seqlengths
andgenome
columns, once I download the data withgetChromInfoFromNCBI()
, should one call againmakeTxDbFromGFF()
with thechrominfo
argument? or is there any other more direct way of updating those columns in theSeqinfo
slot of aTxDb
object?