Search
Question: makeTxDbFromGFF errors too many NAs and make.splicings
0
gravatar for Karl Lundén
13 days ago by
Karl Lundén20
Karl Lundén20 wrote:

 

Hi,

Im trying to make TxDb objects for some GFF3-files of Picea abies from Congenie. Can you see any obvious reason why there are errors ? Are the GFF3 -files not compatible with the makeTxDbFromGFF or are there some updates needed ?

Kind regards

Karl

 

> MYBtestTxDb<-makeTxDbFromGFF("ftp://plantgenie.org/Data/ConGenIE/Picea_abies/v1.0/GFF3/Gene_Prediction_Transcript_assemblies/Trinity_kmer10.gff3.gz")
Import genomic features from the file as a GRanges object ... trying URL 'ftp://plantgenie.org/Data/ConGenIE/Picea_abies/v1.0/GFF3/Gene_Prediction_Transcript_assemblies/Trinity_kmer10.gff3.gz'
Content type 'unknown' length 13520938 bytes (12.9 MB)
==================================================
Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges") : 
  solving row 58427: range cannot be determined from the supplied arguments (too many NAs)
> traceback()
11: .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges")
10: solveUserSEW0(start = start, end = end, width = width)
9: IRanges(ans_start, ans_end, names = ans_names)
8: makeGRangesFromDataFrame(df, seqnames.field = "seqid")
7: readGFFAsGRanges(con, version = version, colnames = colnames, 
       filter = list(type = feature.type), genome = genome, sequenceRegionsAsSeqinfo = sequenceRegionsAsSeqinfo, 
       speciesAsMetadata = TRUE)
6: .local(con, format, text, ...)
5: import(FileForFormat(con, format), ...)
4: import(FileForFormat(con, format), ...)
3: import(file, format = format, colnames = colnames, feature.type = GFF_FEATURE_TYPES)
2: import(file, format = format, colnames = colnames, feature.type = GFF_FEATURE_TYPES)
1: makeTxDbFromGFF("ftp://plantgenie.org/Data/ConGenIE/Picea_abies/v1.0/GFF3/Gene_Prediction_Transcript_assemblies/Trinity_kmer10.gff3.gz")

> txdB_gene2<- makeTxDbFromGFF("ftp://plantgenie.org/Data/ConGenIE/Picea_abies/v1.0/GFF3/Gene_Prediction_Transcript_assemblies/Pabies01b-gene.gff3.gz")
Import genomic features from the file as a GRanges object ... trying URL 'ftp://plantgenie.org/Data/ConGenIE/Picea_abies/v1.0/GFF3/Gene_Prediction_Transcript_assemblies/Pabies01b-gene.gff3.gz'
Content type 'unknown' length 5769965 bytes (5.5 MB)
==================================================
OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... Error in .make_splicings(exons, cds, stop_codons) : 
  some CDS cannot be mapped to an exon
> traceback()
4: stop(wmsg("some CDS cannot be mapped to an exon"))
3: .make_splicings(exons, cds, stop_codons)
2: makeTxDbFromGRanges(gr, metadata = metadata)
1: makeTxDbFromGFF("ftp://plantgenie.org/Data/ConGenIE/Picea_abies/v1.0/GFF3/Gene_Prediction_Transcript_assemblies/Pabies01b-gene.gff3.gz")

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.4

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BiocInstaller_1.30.0   GenomicFeatures_1.32.0 AnnotationDbi_1.42.1   Biobase_2.40.0         rtracklayer_1.40.3     GenomicRanges_1.32.3  
 [7] GenomeInfoDb_1.16.0    IRanges_2.14.10        S4Vectors_0.18.3       BiocGenerics_0.26.0   

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17                compiler_3.5.0              XVector_0.20.0              prettyunits_1.0.2           bitops_1.0-6               
 [6] tools_3.5.0                 zlibbioc_1.26.0             progress_1.2.0              biomaRt_2.36.1              digest_0.6.15              
[11] bit_1.1-14                  RSQLite_2.1.1               memoise_1.1.0               lattice_0.20-35             pkgconfig_2.0.1            
[16] rlang_0.2.1                 Matrix_1.2-14               DelayedArray_0.6.1          DBI_1.0.0                   GenomeInfoDbData_1.1.0     
[21] httr_1.3.1                  stringr_1.3.1               Biostrings_2.48.0           hms_0.4.2                   bit64_0.9-7                
[26] grid_3.5.0                  R6_2.2.2                    XML_3.98-1.11               BiocParallel_1.14.1         magrittr_1.5               
[31] blob_1.1.1                  Rsamtools_1.32.0            matrixStats_0.53.1          GenomicAlignments_1.16.0    assertthat_0.2.0           
[36] SummarizedExperiment_1.10.1 stringi_1.2.3               RCurl_1.95-4.10             crayon_1.3.4               

 

 

ADD COMMENTlink modified 4 days ago by Hervé Pagès ♦♦ 13k • written 13 days ago by Karl Lundén20

did you should take a look at row 58427 as suggested in error msg?  what did you see?

ADD REPLYlink written 13 days ago by Malcolm Cook1.5k
0
gravatar for Hervé Pagès
4 days ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi Karl,

The issue with the 1st file (Trinity_kmer10.gff3.gz) is that it contains start/end values greater than 2^31-1. The problem was that, because these values cannot be stored in an R integer vector, makeTxDbFromGFF() was silently coercing them to NAs. I committed a change to rtracklayer (version 1.41.7, this is in BioC 3.8 only) so that makeTxDbFromGFF() now fails early with an informative error message when this happens:

library(rtracklayer)
library(GenomicFeatures)
txdb <- makeTxDbFromGFF("Trinity_kmer10.gff3.gz")
# Import genomic features from the file as a GRanges object ... Error in
# readGFF(filepath, version = version, columns = columns, tags = tags,  : 
#   reading GFF file: line 58427 contains values greater than 2^31-1 
#   (= .Machine$integer.max) in column 4 (start) and/or 5 (end).
#   Bioconductor does not support such GFF files at the moment. Sorry!

The issue with the 2nd file (Pabies01b-gene.gff3) was that CDS features have their Parent set to an exon instead of a transcript. Note that this is very unconventional and deviates from the well established convention documented at: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

I committed a change to GenomicFeatures (version 1.33.4, also in BioC 3.8 only) so that makeTxDbFromGFF() now supports such files:

txdb <- makeTxDbFromGFF("Pabies01b-gene.gff3.gz")
# Import genomic features from the file as a GRanges object ... OK
# Prepare the 'metadata' data frame ... OK
# Make the TxDb object ... OK

Note that in this file, the CDS and exons are actually the same (i.e. same genomic ranges):

> all(cds(txdb) == exons(txdb))
[1] TRUE

Both rtracklayer 1.41.7 and GenomicFeatures 1.33.4 should become available to BioC 3.8 users via BiocManager::install() in the next 24 hours or so.

Cheers,

H.

> sessionInfo()
R version 3.5.1 Patched (2018-08-01 r75051)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /home/hpages/R/R-3.5.r75051/lib/libRblas.so
LAPACK: /home/hpages/R/R-3.5.r75051/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] GenomicFeatures_1.33.4 AnnotationDbi_1.43.1   Biobase_2.41.2        
[4] rtracklayer_1.41.7     GenomicRanges_1.33.14  GenomeInfoDb_1.17.2   
[7] IRanges_2.15.18        S4Vectors_0.19.22      BiocGenerics_0.27.1   

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.19                compiler_3.5.1             
 [3] XVector_0.21.4              prettyunits_1.0.2          
 [5] bitops_1.0-6                tools_3.5.1                
 [7] zlibbioc_1.27.0             progress_1.2.0             
 [9] biomaRt_2.37.8              digest_0.6.18              
[11] bit_1.1-14                  RSQLite_2.1.1              
[13] memoise_1.1.0               lattice_0.20-35            
[15] pkgconfig_2.0.2             rlang_0.2.2                
[17] Matrix_1.2-14               DelayedArray_0.7.47        
[19] DBI_1.0.0                   GenomeInfoDbData_1.2.0     
[21] httr_1.3.1                  stringr_1.3.1              
[23] Biostrings_2.49.2           hms_0.4.2                  
[25] bit64_0.9-7                 grid_3.5.1                 
[27] R6_2.3.0                    XML_3.98-1.16              
[29] BiocParallel_1.15.15        magrittr_1.5               
[31] blob_1.1.1                  Rsamtools_1.99.0           
[33] matrixStats_0.54.0          GenomicAlignments_1.17.3   
[35] assertthat_0.2.0            SummarizedExperiment_1.11.6
[37] stringi_1.2.4               RCurl_1.95-4.11            
[39] crayon_1.3.4

 

ADD COMMENTlink modified 4 days ago • written 4 days ago by Hervé Pagès ♦♦ 13k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 436 users visited in the last hour