Question

Issues with seqlevelsStyle when making custom txdb objects for genomes/annotations from ToxoDB

0

Entering edit mode

rohitsatyam102 ▴ 20

@rohitsatyam102-24390

Last seen 2 days ago

India

I a trying hard to make a proper txdb object that I can ultimately use with gDNAx package. I raised a separate issue here but I have realized that it's not gDNAx's problem but is actually about how the txdb object is created. I have downloaded the FASTA file and GTF file from ToxoDB from here and I have been trying hard to make a proper txdb object on which when I run seqlevelsStyle and genomeStyles should not throw an error. But it has proved to be a difficult task. I am not sure anymore what's the right way to do this so I am choosing to ask it here.

## Read files
gtf_file <- "ToxoDB-67_TgondiiME49.gtf"
fasta_file <- "ToxoDB-67_TgondiiME49_Genome.fasta"

## Making a txdb object
gff <- makeTxDbFromGFF(gtf_file,format = "gtf",organism = "Toxoplasma gondii",taxonomyId=508771, dataSource = "ToxoDB release 67")


## Maybe the chromosome information should be added
library(GenomicFeatures)
library(Biostrings)
library(Rsamtools)

fa <- FaFile(fasta_file)
fa_seqinfo <- seqinfo(fa)
seqinfo(gff) <- seqinfo(fa) ## Doesn't work
pruned_fa_seqinfo <- keepSeqlevels(fa_seqinfo, seqlevels(gff)) ## because the order of chromosome in GFF and FASTA file doesn't match. So to match that.
#seqinfo(gff) <- pruned_fa_seqinfo ## Doesn't work so let's rerun the makeTxDbFromGFF with this object pruned_fa_seqinfo
identical(seqlevels(gff), seqlevels(pruned_fa_seqinfo))  

## now using to recreate toxodb object with chromosome information
gff <- makeTxDbFromGFF(gtf_file,format = "gtf",organism = "Toxoplasma gondii",taxonomyId=508771, dataSource = "ToxoDB release 67",chrominfo = pruned_fa_seqinfo)

seqlevelsStyle(gff)
Error in seqlevelsStyle(seqlevels) : 
  The style does not have a compatible entry for the species supported by Seqname. Please see
  genomeStyles() for supported species/style

genomeStyles(gff)
Error in strsplit(organism, "_| ") : non-character argument

seqlevelsStyle(gff) <- "NCBI"
Error in .replace_seqlevels_style(x_seqlevels, value) : 
  found no sequence renaming map compatible with seqname style "NCBI" for this object

seqinfo(gff)
Seqinfo object with 436 sequences from an unspecified genome:
  seqnames       seqlengths isCircular genome
  KE138841            35372       <NA>   <NA>
  KE138851            13341       <NA>   <NA>
  KE138856             1030       <NA>   <NA>
  KE138857             1121       <NA>   <NA>
  KE138861             3123       <NA>   <NA>
  ...                   ...        ...    ...
  TGME49_chrVIIa    4541629       <NA>   <NA>
  TGME49_chrVIIb    5069724       <NA>   <NA>
  TGME49_chrX       7486190       <NA>   <NA>
  TGME49_chrXI      6623461       <NA>   <NA>
  TGME49_chrXII     7094428       <NA>   <NA>

txdbmaker GenomeInfoDb GenomicFeatures Biostrings gDNAx • 318 views

ADD COMMENT • link updated 3 hours ago by Robert Castelo ★ 3.4k • written 2 days ago by rohitsatyam102 ▴ 20

score 2 · Answer 1 · 2024-12-18

2

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 1 day ago

Seattle, WA, United States

Is this for genome assembly TGA4? https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000006565.2/

If that's the case then:

library(txdbmaker)
gtf <- "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/565/GCF_000006565.2_TGA4/GCF_000006565.2_TGA4_genomic.gtf.gz"
txdb <- makeTxDbFromGFF(gtf, organism="Toxoplasma gondii ME49", taxonomyId=508771)
genome(txdb) <- "TGA4"

seqinfo(txdb)
# Seqinfo object with 437 sequences from TGA4 genome; no seqlengths:
#   seqnames       seqlengths isCircular genome
#   NC_001799.1          <NA>       <NA>   TGA4
#   NC_031467.1          <NA>       <NA>   TGA4
#   NC_031468.1          <NA>       <NA>   TGA4
#   NC_031469.1          <NA>       <NA>   TGA4
#   NC_031470.1          <NA>       <NA>   TGA4
#   ...                   ...        ...    ...
#   NW_017383901.1       <NA>       <NA>   TGA4
#   NW_017383916.1       <NA>       <NA>   TGA4
#   NW_017383917.1       <NA>       <NA>   TGA4
#   NW_017383919.1       <NA>       <NA>   TGA4
#   NW_017383924.1       <NA>       <NA>   TGA4

The sequence names here are the RefSeq accessions. For some reason switching to the NCBI names doesn't work for this assembly (I would need to investigate why). But we can always do this renaming ourselves "by hand":

chrominfo <- getChromInfoFromNCBI("TGA4")

chrominfo[1:9 , c("RefSeqAccn", "SequenceName")]
#    RefSeqAccn   SequenceName
# 1 NC_031467.1   TGME49_chrIa
# 2 NC_031468.1   TGME49_chrIb
# 3 NC_031469.1   TGME49_chrII
# 4 NC_031470.1  TGME49_chrIII
# 5 NC_031471.1   TGME49_chrIV
# 6 NC_031472.1    TGME49_chrV
# 7 NC_031473.1   TGME49_chrVI
# 8 NC_031474.1 TGME49_chrVIIa
# 9 NC_031475.1 TGME49_chrVIIb

m <- match(seqlevels(txdb), chrominfo$RefSeqAccn)

seqlevels(txdb) <- chrominfo[m , "SequenceName"]

seqinfo(txdb)
# Seqinfo object with 437 sequences from TGA4 genome; no seqlengths:
#   seqnames      seqlengths isCircular genome
#   Pltd                <NA>       <NA>   TGA4
#   TGME49_chrIa        <NA>       <NA>   TGA4
#   TGME49_chrIb        <NA>       <NA>   TGA4
#   TGME49_chrII        <NA>       <NA>   TGA4
#   TGME49_chrIII       <NA>       <NA>   TGA4
#   ...                  ...        ...    ...
#   asmbl.1867          <NA>       <NA>   TGA4
#   asmbl.1884          <NA>       <NA>   TGA4
#   asmbl.1885          <NA>       <NA>   TGA4
#   asmbl.1889          <NA>       <NA>   TGA4
#   asmbl.1912          <NA>       <NA>   TGA4

seqlevelsStyle(txdb)
# [1] "NCBI"

Note that the chrominfo data.frame obtained with getChromInfoFromNCBI() contains a lot of information about the sequences including their lengths. Unfortunately you cannot add/modify the seqlengths of a TxDb object after it has been created. If you really want them in your TxDb object you need to pass that information at creation-time thru the chrominfo argument of the makeTxDbFromGFF() function.

Hope this helps,

H.

ADD COMMENT • link 2 days ago Hervé Pagès 16k

1

Entering edit mode

Thanks Hervé! Very helpful. However, I believe that keepStandardChromosomes gets called by gDNAx, and it appears not to work correctly.

> seqinfo(txdb)
Seqinfo object with 437 sequences from TGA4 genome; no seqlengths:
  seqnames      seqlengths isCircular
  Pltd                <NA>       <NA>
  TGME49_chrIa        <NA>       <NA>
  TGME49_chrIb        <NA>       <NA>
  TGME49_chrII        <NA>       <NA>
  TGME49_chrIII       <NA>       <NA>
  ...                  ...        ...
  asmbl.1867          <NA>       <NA>
  asmbl.1884          <NA>       <NA>
  asmbl.1885          <NA>       <NA>
  asmbl.1889          <NA>       <NA>
  asmbl.1912          <NA>       <NA>
                genome
  Pltd            TGA4
  TGME49_chrIa    TGA4
  TGME49_chrIb    TGA4
  TGME49_chrII    TGA4
  TGME49_chrIII   TGA4
  ...              ...
  asmbl.1867      TGA4
  asmbl.1884      TGA4
  asmbl.1885      TGA4
  asmbl.1889      TGA4
  asmbl.1912      TGA4

> newtxdb <- keepStandardChromosomes(txdb)
> seqinfo(newtxdb)
Seqinfo object with 1 sequence from TGA4 genome; no seqlengths:
  seqnames seqlengths isCircular
  Pltd             NA         NA
           genome
  Pltd       TGA4

And it appears that 'Pltd' is non-nuclear. Can this be avoided by adding in the chrominfo?

ADD REPLY • link 2 days ago James W. MacDonald 67k

0

Entering edit mode

keepStandardChromosomes() is not smart enough to know what to keep for this assembly. Note that what keepStandardChromosomes() means by standard chromosome is what they call "assembled molecules" at NCBI:

library(GenomeInfoDb)
chrominfo <- getChromInfoFromNCBI("TGA4")
chrominfo2 <- subset(chrominfo, SequenceRole=="assembled-molecule")
chrominfo2[, c("SequenceName", "SequenceRole", "RefSeqAccn", "SequenceLength")]
#      SequenceName       SequenceRole  RefSeqAccn SequenceLength
# 1    TGME49_chrIa assembled-molecule NC_031467.1        1859933
# 2    TGME49_chrIb assembled-molecule NC_031468.1        1955354
# 3    TGME49_chrII assembled-molecule NC_031469.1        2347032
# 4   TGME49_chrIII assembled-molecule NC_031470.1        2532871
# 5    TGME49_chrIV assembled-molecule NC_031471.1        2686605
# 6     TGME49_chrV assembled-molecule NC_031472.1        3331915
# 7    TGME49_chrVI assembled-molecule NC_031473.1        3646983
# 8  TGME49_chrVIIa assembled-molecule NC_031474.1        4541629
# 9  TGME49_chrVIIb assembled-molecule NC_031475.1        5069724
# 10 TGME49_chrVIII assembled-molecule NC_031476.1        6970285
# 11   TGME49_chrIX assembled-molecule NC_031477.1        6327655
# 12    TGME49_chrX assembled-molecule NC_031478.1        7486190
# 13   TGME49_chrXI assembled-molecule NC_031479.1        6623461
# 14  TGME49_chrXII assembled-molecule NC_031480.1        7094428
# 15           Pltd assembled-molecule NC_001799.1          34996

So in theory all the information is available and it should be possible to refactor keepStandardChromosomes() to do the right thing. As long as:

the genome column in the Seqinfo object returned by seqinfo(txdb) contains the correct assembly name
and getChromInfoFromNCBI() or getChromInfoFromUCSC() recognizes that name.

This would be a significant refactor of keepStandardChromosomes() though and I don't really have the resources to work on this at the moment but PRs are welcome.

Anyways, it's easy to drop the non-standard chromosomes of the txdb object "by hand":

seqlevels(txdb) <- chrominfo2$SequenceName
seqinfo(txdb)
# Seqinfo object with 15 sequences from TGA4 genome; no seqlengths:
#   seqnames      seqlengths isCircular genome
#   TGME49_chrIa        <NA>       <NA>   TGA4
#   TGME49_chrIb        <NA>       <NA>   TGA4
#   TGME49_chrII        <NA>       <NA>   TGA4
#   TGME49_chrIII       <NA>       <NA>   TGA4
#   TGME49_chrIV        <NA>       <NA>   TGA4
#   ...                  ...        ...    ...
#   TGME49_chrIX        <NA>       <NA>   TGA4
#   TGME49_chrX         <NA>       <NA>   TGA4
#   TGME49_chrXI        <NA>       <NA>   TGA4
#   TGME49_chrXII       <NA>       <NA>   TGA4
#   Pltd                <NA>       <NA>   TGA4

Now gDNAx will still call keepStandardChromosomes() on this TxDb object and mess it up, but really the gDNAx folks should make this call optional.

H.

ADD REPLY • link 2 days ago Hervé Pagès 16k

0

Entering edit mode

Hi Herve

As I mentioned, for parasites we rely on VEuPathDB databases (such as ToxoDB here: https://toxodb.org/toxo/app/downloads) which procure genome sequences from NCBI but perform their annotation and are updated frequently. That's why I wish to use the below-given files. Now what you showed using chrominfo <- getChromInfoFromNCBI("TGA4") is useful but it doesn't solve the seqlevelsStyle(txdb) error when using GTF file from other sources than NCBI, Ensembl, or UCSC. I tried some of your steps below to reproduce the error. Would it be possible to assign it something like this seqlevelsStyle(txdb) <- "NCBI" so that it doesn't complaint. I have requested gDNAx developers for a feature request to turn off seqlevelsStyle(txdb) so that it is not checked if the txdb was built using GTF/GFF. But it would be really helpful if GenomeInfoDb can provide a way for users who are building the txdb objects using their own annotations while staying compatible with other packages.


gff<- "https://toxodb.org/common/downloads/release-68/TgondiiME49/gff/data/ToxoDB-68_TgondiiME49.gff"
fa<- "https://toxodb.org/common/downloads/release-68/TgondiiME49/fasta/data/ToxoDB-68_TgondiiME49_Genome.fasta"

library(txdbmaker)
gtf <- "https://toxodb.org/common/downloads/release-68/TgondiiME49/gff/data/ToxoDB-68_TgondiiME49.gff"
txdb <- makeTxDbFromGFF(gtf, organism="Toxoplasma gondii ME49", taxonomyId=508771)
genome(txdb) <- "TGA4"
seqinfo(txdb)

chrominfo <- getChromInfoFromNCBI("TGA4")
## Need to change the SequenceName because in out GTF the chromosome name is partially from "SequenceName" and from "GenBankAccn". let's use removeVersion to remove versions from the contig
chrominfo$SequenceName[!grepl("TGME49_",chrominfo$SequenceName)] <- GeneStructureTools::removeVersion(chrominfo$GenBankAccn[!grepl("TGME49_",chrominfo$SequenceName)] )
chrominfo[15:20 , c("RefSeqAccn", "SequenceName")]

m <- match(seqlevels(txdb), chrominfo$SequenceName)
seqlevels(txdb) <- chrominfo[m , "SequenceName"]
seqlevelsStyle(txdb)
Error in seqlevelsStyle(seqlevels) : 
  The style does not have a compatible entry for the species supported by Seqname. Please see
genomeStyles() for supported species/style

ADD REPLY • link 2 days ago rohitsatyam102 ▴ 20

0

Entering edit mode

I have requested gDNAx developers for a feature request to turn off seqlevelsStyle(txdb)

Yes, that's the right course of action. seqlevelsStyle() is inherently unreliable and will return an error if it cannot figure out the style. Same with keepStandardChromosomes(): the user should be able to turn this off too. But a better approach for packages that take a TxDb object as input is as follow:

First check that the supplied TxDb object has the correct/expected seqlevels.
If the supplied TxDb object does NOT have the correct/expected seqlevels, then they should raise an error with an informative error message, so that the user can correct the seqlevels.
The code in the package should not try to correct the seqlevels, but if they really want to do it anyway, they need to keep in mind that this won't always work. And when it fails, their code should handle the failure gracefully.

In the case of your TxDb object, the chromosome/scaffold names are a mixture of NCBI names and GenBank accessions. But seqlevelsStyle(txdb) is not smart enough to figure that out so it fails:

> seqlevelsStyle(txdb)
Error in seqlevelsStyle(seqlevels) : 
  The style does not have a compatible entry for the species supported by
  Seqname. Please see genomeStyles() for supported species/style

As I explained earlier (see above), if gDNAx only cares about the chromosomes (a.k.a. assembled molecules) then you can get rid of all the other sequences (i.e. the scaffolds) yourself before passing your TxDb object to gDNAx:

chrominfo <- getChromInfoFromNCBI("TGA4")
chrominfo2 <- subset(chrominfo, SequenceRole=="assembled-molecule")
seqlevels(txdb) <- intersect(seqlevels(txdb), chrominfo2$SequenceName)

seqinfo(txdb)
# Seqinfo object with 14 sequences from an unspecified genome; no seqlengths:
#   seqnames       seqlengths isCircular genome
#   TGME49_chrII         <NA>       <NA>   <NA>
#   TGME49_chrIII        <NA>       <NA>   <NA>
#   TGME49_chrIV         <NA>       <NA>   <NA>
#   TGME49_chrIX         <NA>       <NA>   <NA>
#   TGME49_chrIa         <NA>       <NA>   <NA>
#   ...                   ...        ...    ...
#   TGME49_chrVIIa       <NA>       <NA>   <NA>
#   TGME49_chrVIIb       <NA>       <NA>   <NA>
#   TGME49_chrX          <NA>       <NA>   <NA>
#   TGME49_chrXI         <NA>       <NA>   <NA>
#   TGME49_chrXII        <NA>       <NA>   <NA>

Why wouln't gDNAx be happy with that TxDb object? Aren't those chromosome names compatible with the FASTA file? If so then why does gDNAx need to correct the seqlevels of the TxDb object with calls to seqlevelsStyle() or keepStandardChromosomes()?

H.

ADD REPLY • link 1 day ago Hervé Pagès 16k

0

Entering edit mode

Hi Hervé, thank you very much for your input into how to properly build and manage TxDb objects. gDNAx has two main inputs, one or more BAM files and an annotation, which should come in the form of a TxDb object. Because I work mainly with human data, where seqlevelsStyle() works fine, my perception was that to release the user from knowing how to make sequence names compatible between annotation and BAM file, the software could do something like:

seqlevelsStyle(txdb) <- seqlevelsStyle(bam)

Removing whatever sequences were not common or had different lengths (as it happened with chrMT in between the two official versions of hg19). As you rightly show, this is not that easy outside human data, and I'll modify the software along the lines you suggest. By the way, gDNAx attempts to work with the "standard chromosomes" only by default for its calculations, but this can be disabled by setting stdChrom=FALSE in the call to gDNAdx().

Something I would like to confirm I'm understanding right is, if I would like to have a more complete Seqinfo object for the example above, i.e., filling up the seqlengths and genome columns, once I download the data with getChromInfoFromNCBI(), should one call again makeTxDbFromGFF() with the chrominfo argument? or is there any other more direct way of updating those columns in the Seqinfo slot of a TxDb object?

ADD REPLY • link 3 hours ago Robert Castelo ★ 3.4k

score 0 · Answer 2 · 2024-12-18

Your post title is misleading. You have no problem whatsoever generating the TxDb, the problem is using it with gDNAx, which attempts to subset to the standard chromosomes, which requires knowing the genome style, and there are only a handful of species for which that is possible:

> names(genomeStyles())
 [1] "Arabidopsis_thaliana"    
 [2] "Caenorhabditis_elegans"  
 [3] "Canis_familiaris"        
 [4] "Cyanidioschyzon_merolae" 
 [5] "Drosophila_melanogaster" 
 [6] "Gossypium_hirsutum"      
 [7] "Homo_sapiens"            
 [8] "Mus_musculus"            
 [9] "Oryza_sativa"            
[10] "Populus_trichocarpa"     
[11] "Rattus_norvegicus"       
[12] "Saccharomyces_cerevisiae"
[13] "Zea_mays"

Any species not on that list will not be amenable to subsetting to standard chromosomes or changing the genome style. There may be a simple workaround, and if so I imagine either Robert Castelo or Herve Pages will be along soon enough to point it out. But I personally have never found a way to specify the standard chromosomes or genome style directly.