error making txdb object from gtf file
2
0
Entering edit mode
@sekawaiwai2006-9531
Last seen 7.8 years ago

hello guys,

i am having trouble to build a txdb object for S.lycopersicum

GTF file link :http://www.ncbi.nlm.nih.gov/genome/?term=solanum+lycopersicum

the command i used :

gtffile <- file.path("~/Desktop/s.lyco gff/new one/GCF_000188115.3_SL2.50_genomic.gff")
txdb <- makeTxDbFromGFF(gtffile, format= "gtf",circ_seqs=character())

ERROR:

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... Warning message:
In .local(con, format, text, ...) :
  gff-version directive indicates version is 3, not 2
Error in ID[gene_IDX] %in% unlist(Parent[exon_with_gene_parent_IDX], use.names = FALSE) : 
  error in evaluating the argument 'table' in selecting a method for function '%in%': Error in unlist(Parent[exon_with_gene_parent_IDX], use.names = FALSE) : 
  error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) : 
  subscript contains NAs or out-of-bounds indices

 

> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] TxDb.Athaliana.BioMart.plantsmart28_3.2.2
 [2] BiocInstaller_1.20.1                     
 [3] GenomicFeatures_1.22.8                   
 [4] AnnotationDbi_1.32.3                     
 [5] Biobase_2.30.0                           
 [6] GenomicRanges_1.22.3                     
 [7] GenomeInfoDb_1.6.1                       
 [8] IRanges_2.4.6                            
 [9] S4Vectors_0.8.7                          
[10] BiocGenerics_0.16.1                      

loaded via a namespace (and not attached):
 [1] XVector_0.10.0             zlibbioc_1.16.0           
 [3] GenomicAlignments_1.6.3    BiocParallel_1.4.3        
 [5] tools_3.2.3                SummarizedExperiment_1.0.2
 [7] DBI_0.3.1                  lambda.r_1.1.7            
 [9] futile.logger_1.4.1        rtracklayer_1.30.1        
[11] futile.options_1.0.0       bitops_1.0-6              
[13] RCurl_1.95-4.7             biomaRt_2.26.1            
[15] RSQLite_1.0.0              Biostrings_2.38.3         
[17] Rsamtools_1.22.0           XML_3.98-1.3 
genomicfeatures • 4.8k views
ADD COMMENT
0
Entering edit mode
@michael-lawrence-3846
Last seen 2.4 years ago
United States

As the warning indicates, the GFF version is 3, not 2, meaning that it is not a GTF file, but a GFF3 file, which agrees with the file extension, so you probably shouldn't pass "gtf" as the format.

ADD COMMENT
0
Entering edit mode

hi michael , thx for the reply, however i got another error trying to run it like this

txdb <- makeTxDbFromGFF(gtffile)

ERROR

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... Error in .merge_transcript_parts(transcripts) : 
  The following transcripts have multiple parts that cannot be merged
  because of incompatible strand: gene31004
ADD REPLY
0
Entering edit mode

Looks like a case of trans-splicing. That does not appear to be supported by GenomicFeatures yet. Herve?

ADD REPLY
0
Entering edit mode
Johannes Rainer ★ 2.0k
@johannes-rainer-6987
Last seen 4 weeks ago
Italy

Out of curiosity I tried to build a EnsDb for that plant from an GTF from Ensembl. I downloaded the GTF from ftp.ensemblgenomes.org/pub/release-30/plants/gtf/solanum_lycopersicum

That's obviously not the same version you're looking for, but eventually you might find one matching your version on Ensembl.

To build an EnsDb:

> library(ensembldb)

> gtfFile <- "Solanum_lycopersicum.GCA_000188115.2.30.chr.gtf.gz"

## Create a EnsDb database file from that GTF

> dbFile <- ensDbFromGtf(gtf=gtfFile, organism="Solanum_lycopersicum", genomeVersion="GCA_000188115.2", version=30)

> edb <- EnsDb(dbFile)
> edb
EnsDb for Ensembl:
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.0.1
|Creation time: Thu Jan 21 08:40:36 2016
|ensembl_version: 30
|ensembl_host: unknown
|Organism: Solanum_lycopersicum
|genome_build: GCA_000188115.2
|DBSCHEMAVERSION: 1.0
|source_file: Solanum_lycopersicum.GCA_000188115.2.30.chr.gtf.gz
| No. of genes: 38633.
| No. of transcripts: 38633.
> seqinfo(edb)
Seqinfo object with 12 sequences from GCA_000188115.2 genome:
  seqnames seqlengths isCircular          genome
  1          98543444       <NA> GCA_000188115.2
  9          72482091       <NA> GCA_000188115.2
  3          70787664       <NA> GCA_000188115.2
  7          68045021       <NA> GCA_000188115.2
  12         67145203       <NA> GCA_000188115.2
  ...             ...        ...             ...
  8          65866657       <NA> GCA_000188115.2
  10         65527505       <NA> GCA_000188115.2
  11         56302525       <NA> GCA_000188115.2
  2          55340444       <NA> GCA_000188115.2
  6          49751636       <NA> GCA_000188115.2

 

So, it basically would work. I do however not have any test that would verify that the EnsDb is "correct". Basically, it just extracts exons, transcripts and genes from the GTF and stores them into the SQLite database. Trans-splicing would not be supported, i.e. transcripts and all of its exons are supposed to have the same strand and seqname than the gene.

best, jo

 

ADD COMMENT
0
Entering edit mode

Can GTF support trans-splicing?

ADD REPLY
0
Entering edit mode

I am also having the same problem for Cannabis Sativa:

TxDb <- makeTxDbFromGFF(file = "./GCF900626175.2cs10genomic.gff") Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... Error in .mergetranscriptparts(transcripts) : The following transcripts have multiple parts that cannot be merged because of incompatible strand: gene-A5N79gp01, gene-A5N79_gp24

Trans-splicing seems legitimate (https://en.wikipedia.org/wiki/Trans-splicing#:~:text=Trans%2Dsplicing%20is%20a%20special,half%2Dgenes%22%20for%20tRNAs.), still not supported?

ADD REPLY
0
Entering edit mode

What should be done for gtf files in these cases? Remove the transcripts with trans-splicing?

ADD REPLY

Login before adding your answer.

Traffic: 567 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6