Question

Issues with seqlevelsStyle when making custom txdb objects for genomes/annotations from ToxoDB

0

Entering edit mode

rohitsatyam102 ▴ 20

@rohitsatyam102-24390

Last seen 2 hours ago

India

I a trying hard to make a proper txdb object that I can ultimately use with gDNAx package. I raised a separate issue here but I have realized that it's not gDNAx's problem but is actually about how the txdb object is created. I have downloaded the FASTA file and GTF file from ToxoDB from here and I have been trying hard to make a proper txdb object on which when I run seqlevelsStyle and genomeStyles should not throw an error. But it has proved to be a difficult task. I am not sure anymore what's the right way to do this so I am choosing to ask it here.

## Read files
gtf_file <- "ToxoDB-67_TgondiiME49.gtf"
fasta_file <- "ToxoDB-67_TgondiiME49_Genome.fasta"

## Making a txdb object
gff <- makeTxDbFromGFF(gtf_file,format = "gtf",organism = "Toxoplasma gondii",taxonomyId=508771, dataSource = "ToxoDB release 67")


## Maybe the chromosome information should be added
library(GenomicFeatures)
library(Biostrings)
library(Rsamtools)

fa <- FaFile(fasta_file)
fa_seqinfo <- seqinfo(fa)
seqinfo(gff) <- seqinfo(fa) ## Doesn't work
pruned_fa_seqinfo <- keepSeqlevels(fa_seqinfo, seqlevels(gff)) ## because the order of chromosome in GFF and FASTA file doesn't match. So to match that.
#seqinfo(gff) <- pruned_fa_seqinfo ## Doesn't work so let's rerun the makeTxDbFromGFF with this object pruned_fa_seqinfo
identical(seqlevels(gff), seqlevels(pruned_fa_seqinfo))  

## now using to recreate toxodb object with chromosome information
gff <- makeTxDbFromGFF(gtf_file,format = "gtf",organism = "Toxoplasma gondii",taxonomyId=508771, dataSource = "ToxoDB release 67",chrominfo = pruned_fa_seqinfo)

seqlevelsStyle(gff)
Error in seqlevelsStyle(seqlevels) : 
  The style does not have a compatible entry for the species supported by Seqname. Please see
  genomeStyles() for supported species/style

genomeStyles(gff)
Error in strsplit(organism, "_| ") : non-character argument

seqlevelsStyle(gff) <- "NCBI"
Error in .replace_seqlevels_style(x_seqlevels, value) : 
  found no sequence renaming map compatible with seqname style "NCBI" for this object

seqinfo(gff)
Seqinfo object with 436 sequences from an unspecified genome:
  seqnames       seqlengths isCircular genome
  KE138841            35372       <NA>   <NA>
  KE138851            13341       <NA>   <NA>
  KE138856             1030       <NA>   <NA>
  KE138857             1121       <NA>   <NA>
  KE138861             3123       <NA>   <NA>
  ...                   ...        ...    ...
  TGME49_chrVIIa    4541629       <NA>   <NA>
  TGME49_chrVIIb    5069724       <NA>   <NA>
  TGME49_chrX       7486190       <NA>   <NA>
  TGME49_chrXI      6623461       <NA>   <NA>
  TGME49_chrXII     7094428       <NA>   <NA>

txdbmaker GenomeInfoDb GenomicFeatures Biostrings gDNAx • 85 views

ADD COMMENT • link 8 hours ago • updated 2 hours ago rohitsatyam102 ▴ 20

score 2 · Answer 1 · 2024-12-18

2

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 2 hours ago

Seattle, WA, United States

Is this for genome assembly TGA4? https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000006565.2/

If that's the case then:

library(txdbmaker)
gtf <- "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/565/GCF_000006565.2_TGA4/GCF_000006565.2_TGA4_genomic.gtf.gz"
txdb <- makeTxDbFromGFF(gtf, organism="Toxoplasma gondii ME49", taxonomyId=508771)
genome(txdb) <- "TGA4"

seqinfo(txdb)
# Seqinfo object with 437 sequences from TGA4 genome; no seqlengths:
#   seqnames       seqlengths isCircular genome
#   NC_001799.1          <NA>       <NA>   TGA4
#   NC_031467.1          <NA>       <NA>   TGA4
#   NC_031468.1          <NA>       <NA>   TGA4
#   NC_031469.1          <NA>       <NA>   TGA4
#   NC_031470.1          <NA>       <NA>   TGA4
#   ...                   ...        ...    ...
#   NW_017383901.1       <NA>       <NA>   TGA4
#   NW_017383916.1       <NA>       <NA>   TGA4
#   NW_017383917.1       <NA>       <NA>   TGA4
#   NW_017383919.1       <NA>       <NA>   TGA4
#   NW_017383924.1       <NA>       <NA>   TGA4

The sequence names here are the RefSeq accessions. For some reason switching to the NCBI names doesn't work for this assembly (I would need to investigate why). But we can always do this renaming ourselves "by hand":

chrominfo <- getChromInfoFromNCBI("TGA4")

chrominfo[1:9 , c("RefSeqAccn", "SequenceName")]
#    RefSeqAccn   SequenceName
# 1 NC_031467.1   TGME49_chrIa
# 2 NC_031468.1   TGME49_chrIb
# 3 NC_031469.1   TGME49_chrII
# 4 NC_031470.1  TGME49_chrIII
# 5 NC_031471.1   TGME49_chrIV
# 6 NC_031472.1    TGME49_chrV
# 7 NC_031473.1   TGME49_chrVI
# 8 NC_031474.1 TGME49_chrVIIa
# 9 NC_031475.1 TGME49_chrVIIb

m <- match(seqlevels(txdb), chrominfo$RefSeqAccn)

seqlevels(txdb) <- chrominfo[m , "SequenceName"]

seqinfo(txdb)
# Seqinfo object with 437 sequences from TGA4 genome; no seqlengths:
#   seqnames      seqlengths isCircular genome
#   Pltd                <NA>       <NA>   TGA4
#   TGME49_chrIa        <NA>       <NA>   TGA4
#   TGME49_chrIb        <NA>       <NA>   TGA4
#   TGME49_chrII        <NA>       <NA>   TGA4
#   TGME49_chrIII       <NA>       <NA>   TGA4
#   ...                  ...        ...    ...
#   asmbl.1867          <NA>       <NA>   TGA4
#   asmbl.1884          <NA>       <NA>   TGA4
#   asmbl.1885          <NA>       <NA>   TGA4
#   asmbl.1889          <NA>       <NA>   TGA4
#   asmbl.1912          <NA>       <NA>   TGA4

seqlevelsStyle(txdb)
# [1] "NCBI"

Note that the chrominfo data.frame obtained with getChromInfoFromNCBI() contains a lot of information about the sequences including their lengths. Unfortunately you cannot add/modify the seqlengths of a TxDb object after it has been created. If you really want them in your TxDb object you need to pass that information at creation-time thru the chrominfo argument of the makeTxDbFromGFF() function.

Hope this helps,

H.

ADD COMMENT • link 4 hours ago Hervé Pagès 16k

1

Entering edit mode

Thanks Hervé! Very helpful. However, I believe that keepStandardChromosomes gets called by gDNAx, and it appears not to work correctly.

> seqinfo(txdb)
Seqinfo object with 437 sequences from TGA4 genome; no seqlengths:
  seqnames      seqlengths isCircular
  Pltd                <NA>       <NA>
  TGME49_chrIa        <NA>       <NA>
  TGME49_chrIb        <NA>       <NA>
  TGME49_chrII        <NA>       <NA>
  TGME49_chrIII       <NA>       <NA>
  ...                  ...        ...
  asmbl.1867          <NA>       <NA>
  asmbl.1884          <NA>       <NA>
  asmbl.1885          <NA>       <NA>
  asmbl.1889          <NA>       <NA>
  asmbl.1912          <NA>       <NA>
                genome
  Pltd            TGA4
  TGME49_chrIa    TGA4
  TGME49_chrIb    TGA4
  TGME49_chrII    TGA4
  TGME49_chrIII   TGA4
  ...              ...
  asmbl.1867      TGA4
  asmbl.1884      TGA4
  asmbl.1885      TGA4
  asmbl.1889      TGA4
  asmbl.1912      TGA4

> newtxdb <- keepStandardChromosomes(txdb)
> seqinfo(newtxdb)
Seqinfo object with 1 sequence from TGA4 genome; no seqlengths:
  seqnames seqlengths isCircular
  Pltd             NA         NA
           genome
  Pltd       TGA4

And it appears that 'Pltd' is non-nuclear. Can this be avoided by adding in the chrominfo?

ADD REPLY • link 4 hours ago James W. MacDonald 67k

0

Entering edit mode

keepStandardChromosomes() is not smart enough to know what to keep for this assembly. Note that what keepStandardChromosomes() means by standard chromosome is what they call "assembled molecules" at NCBI:

library(GenomeInfoDb)
chrominfo <- getChromInfoFromNCBI("TGA4")
chrominfo2 <- subset(chrominfo, SequenceRole=="assembled-molecule")
chrominfo2[, c("SequenceName", "SequenceRole", "RefSeqAccn", "SequenceLength")]
#      SequenceName       SequenceRole  RefSeqAccn SequenceLength
# 1    TGME49_chrIa assembled-molecule NC_031467.1        1859933
# 2    TGME49_chrIb assembled-molecule NC_031468.1        1955354
# 3    TGME49_chrII assembled-molecule NC_031469.1        2347032
# 4   TGME49_chrIII assembled-molecule NC_031470.1        2532871
# 5    TGME49_chrIV assembled-molecule NC_031471.1        2686605
# 6     TGME49_chrV assembled-molecule NC_031472.1        3331915
# 7    TGME49_chrVI assembled-molecule NC_031473.1        3646983
# 8  TGME49_chrVIIa assembled-molecule NC_031474.1        4541629
# 9  TGME49_chrVIIb assembled-molecule NC_031475.1        5069724
# 10 TGME49_chrVIII assembled-molecule NC_031476.1        6970285
# 11   TGME49_chrIX assembled-molecule NC_031477.1        6327655
# 12    TGME49_chrX assembled-molecule NC_031478.1        7486190
# 13   TGME49_chrXI assembled-molecule NC_031479.1        6623461
# 14  TGME49_chrXII assembled-molecule NC_031480.1        7094428
# 15           Pltd assembled-molecule NC_001799.1          34996

So in theory all the information is available and it should be possible to refactor keepStandardChromosomes() to do the right thing. As long as:

the genome column in the Seqinfo object returned by seqinfo(txdb) contains the correct assembly name
and getChromInfoFromNCBI() or getChromInfoFromUCSC() recognizes that name.

This would be a significant refactor of keepStandardChromosomes() though and I don't really have the resources to work on this at the moment but PRs are welcome.

Anyways, it's easy to drop the non-standard chromosomes of the txdb object "by hand":

seqlevels(txdb) <- chrominfo2$SequenceName
seqinfo(txdb)
# Seqinfo object with 15 sequences from TGA4 genome; no seqlengths:
#   seqnames      seqlengths isCircular genome
#   TGME49_chrIa        <NA>       <NA>   TGA4
#   TGME49_chrIb        <NA>       <NA>   TGA4
#   TGME49_chrII        <NA>       <NA>   TGA4
#   TGME49_chrIII       <NA>       <NA>   TGA4
#   TGME49_chrIV        <NA>       <NA>   TGA4
#   ...                  ...        ...    ...
#   TGME49_chrIX        <NA>       <NA>   TGA4
#   TGME49_chrX         <NA>       <NA>   TGA4
#   TGME49_chrXI        <NA>       <NA>   TGA4
#   TGME49_chrXII       <NA>       <NA>   TGA4
#   Pltd                <NA>       <NA>   TGA4

Now gDNAx will still call keepStandardChromosomes() on this TxDb object and mess it up, but really the gDNAx folks should make this call optional.

H.

ADD REPLY • link 3 hours ago Hervé Pagès 16k

0

Entering edit mode

Hi Herve

As I mentioned, for parasites we rely on VEuPathDB databases (such as ToxoDB here: https://toxodb.org/toxo/app/downloads) which procure genome sequences from NCBI but perform their annotation and are updated frequently. That's why I wish to use the below-given files. Now what you showed using chrominfo <- getChromInfoFromNCBI("TGA4") is useful but it doesn't solve the seqlevelsStyle(txdb) error when using GTF file from other sources than NCBI, Ensembl, or UCSC. I tried some of your steps below to reproduce the error. Would it be possible to assign it something like this seqlevelsStyle(txdb) <- "NCBI" so that it doesn't complaint. I have requested gDNAx developers for a feature request to turn off seqlevelsStyle(txdb) so that it is not checked if the txdb was built using GTF/GFF. But it would be really helpful if GenomeInfoDb can provide a way for users who are building the txdb objects using their own annotations while staying compatible with other packages.


gff<- "https://toxodb.org/common/downloads/release-68/TgondiiME49/gff/data/ToxoDB-68_TgondiiME49.gff"
fa<- "https://toxodb.org/common/downloads/release-68/TgondiiME49/fasta/data/ToxoDB-68_TgondiiME49_Genome.fasta"

library(txdbmaker)
gtf <- "https://toxodb.org/common/downloads/release-68/TgondiiME49/gff/data/ToxoDB-68_TgondiiME49.gff"
txdb <- makeTxDbFromGFF(gtf, organism="Toxoplasma gondii ME49", taxonomyId=508771)
genome(txdb) <- "TGA4"
seqinfo(txdb)

chrominfo <- getChromInfoFromNCBI("TGA4")
## Need to change the SequenceName because in out GTF the chromosome name is partially from "SequenceName" and from "GenBankAccn". let's use removeVersion to remove versions from the contig
chrominfo$SequenceName[!grepl("TGME49_",chrominfo$SequenceName)] <- GeneStructureTools::removeVersion(chrominfo$GenBankAccn[!grepl("TGME49_",chrominfo$SequenceName)] )
chrominfo[15:20 , c("RefSeqAccn", "SequenceName")]

m <- match(seqlevels(txdb), chrominfo$SequenceName)
seqlevels(txdb) <- chrominfo[m , "SequenceName"]
seqlevelsStyle(txdb)
Error in seqlevelsStyle(seqlevels) : 
  The style does not have a compatible entry for the species supported by Seqname. Please see
genomeStyles() for supported species/style

ADD REPLY • link 2 hours ago rohitsatyam102 ▴ 20

score 0 · Answer 2 · 2024-12-18

Your post title is misleading. You have no problem whatsoever generating the TxDb, the problem is using it with gDNAx, which attempts to subset to the standard chromosomes, which requires knowing the genome style, and there are only a handful of species for which that is possible:

> names(genomeStyles())
 [1] "Arabidopsis_thaliana"    
 [2] "Caenorhabditis_elegans"  
 [3] "Canis_familiaris"        
 [4] "Cyanidioschyzon_merolae" 
 [5] "Drosophila_melanogaster" 
 [6] "Gossypium_hirsutum"      
 [7] "Homo_sapiens"            
 [8] "Mus_musculus"            
 [9] "Oryza_sativa"            
[10] "Populus_trichocarpa"     
[11] "Rattus_norvegicus"       
[12] "Saccharomyces_cerevisiae"
[13] "Zea_mays"

Any species not on that list will not be amenable to subsetting to standard chromosomes or changing the genome style. There may be a simple workaround, and if so I imagine either Robert Castelo or Herve Pages will be along soon enough to point it out. But I personally have never found a way to specify the standard chromosomes or genome style directly.