branchpointer: trouble to read gtf
Last seen 4 weeks ago


I have trouble to read gtf files with branchpointer::gtfToExons While the supplied example file (gencode.v26.annotation.small.gtf) can be read, my own gtf files or any change in the example file lead to "Error: subscript contains invalid names". E.g. keeping only the gene_id and transcript_id from the example file renders it unreadable. I suspect that gtfToExons relies on specific attributes in the group/attribute field but I cannot pinpoint which. I work with non-model organisms and can only provide transcript-exon information with non-public identifiers. Also, gff3 files cannot be read.

An example for a minimal gtf file which cannot be read is:

chr1    gmap    transcript      1       1000    .       +       .       transcript_id "tx1";
chr1    gmap    exon    100     900     .       +       0       transcript_id "tx1";

Any hint on how to construct my gtf files?

exons <- gtfToExons("minimal.gtf")

Error: subscript contains invalid names

sessionInfo( ):
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] branchpointer_1.18.0 caret_6.0-88         ggplot2_3.3.5       
[4] lattice_0.20-44     

loaded via a namespace (and not attached):
  [1] nlme_3.1-153                      matrixStats_0.60.1               
  [3] bitops_1.0-7                      lubridate_1.7.10                 
  [5] bit64_4.0.5                       filelock_1.0.2                   
  [7] progress_1.2.2                    httr_1.4.2                       
  [9] GenomeInfoDb_1.28.4               tools_4.1.0                      
 [11] utf8_1.2.2                        R6_2.5.1                         
 [13] rpart_4.1-15                      DBI_1.1.1                        
 [15] BiocGenerics_0.38.0               colorspace_2.0-2                 
 [17] nnet_7.3-16                       withr_2.4.2                      
 [19] gbm_2.1.8                         tidyselect_1.1.1                 
 [21] prettyunits_1.1.1                 bit_4.0.4                        
 [23] curl_4.3.2                        compiler_4.1.0                   
 [25] Biobase_2.52.0                    xml2_1.3.2                       
 [27] DelayedArray_0.18.0               rtracklayer_1.52.1               
 [29] scales_1.1.1                      rappdirs_0.3.3                   
 [31] Rsamtools_2.8.0                   stringr_1.4.0                    
 [33] digest_0.6.27                     XVector_0.32.0                   
 [35] pkgconfig_2.0.3                   parallelly_1.28.1                
 [37] MatrixGenerics_1.4.3              BSgenome_1.60.0                  
 [39] dbplyr_2.1.1                      fastmap_1.1.0                    
 [41] rlang_0.4.11                      rstudioapi_0.13                  
 [43] RSQLite_2.2.8                     BiocIO_1.2.0                     
 [45] generics_0.1.0                    BiocParallel_1.26.2              
 [47] dplyr_1.0.7                       ModelMetrics_1.2.2.2             
 [49] RCurl_1.98-1.5                    magrittr_2.0.1                   
 [51] GenomeInfoDbData_1.2.6            Matrix_1.3-4                     
 [53] Rcpp_1.0.7                        munsell_0.5.0                    
 [55] S4Vectors_0.30.0                  fansi_0.5.0                      
 [57] lifecycle_1.0.0                   yaml_2.2.1                       
 [59] stringi_1.7.4                     pROC_1.18.0                      
 [61] SummarizedExperiment_1.22.0       MASS_7.3-54                      
 [63] zlibbioc_1.38.0                   plyr_1.8.6                       
 [65] recipes_0.1.16                    BiocFileCache_2.0.0              
 [67] grid_4.1.0                        blob_1.2.2                       
 [69] parallel_4.1.0                    listenv_0.8.0                    
 [71] crayon_1.4.1                      cowplot_1.1.1                    
 [73] Biostrings_2.60.2                 splines_4.1.0                    
 [75] hms_1.1.0                         KEGGREST_1.32.0                  
 [77] BSgenome.Hsapiens.UCSC.hg38_1.4.3 pillar_1.6.2                     
 [79] GenomicRanges_1.44.0              rjson_0.2.20                     
 [81] future.apply_1.8.1                reshape2_1.4.4                   
 [83] codetools_0.2-18                  biomaRt_2.48.3                   
 [85] stats4_4.1.0                      XML_3.99-0.8                     
 [87] glue_1.4.2                        data.table_1.14.0                
 [89] png_0.1-7                         vctrs_0.3.8                      
 [91] foreach_1.5.1                     gtable_0.3.0                     
 [93] purrr_0.3.4                       kernlab_0.9-29                   
 [95] future_1.22.1                     assertthat_0.2.1                 
 [97] cachem_1.0.6                      gower_0.2.2                      
 [99] prodlim_2019.11.13                restfulr_0.0.13                  
[101] class_7.3-19                      survival_3.2-13                  
[103] timeDate_3043.102                 tibble_3.1.4                     
[105] iterators_1.0.13                  GenomicAlignments_1.28.0         
[107] AnnotationDbi_1.54.1              memoise_2.0.0                    
[109] IRanges_2.26.0                    lava_1.6.10                      
[111] globals_0.14.0                    ellipsis_0.3.2                   
[113] ipred_0.9-12
gtfToExons "Error:subscriptcontainsinvalidnames" gtf branchpointer • 83 views
Hi Frank,

Your example gtf is missing a gene_id. In the old code we also required a transcript_type/transcript_biotype, and a gene_type/gene_biotype. The code on github (betsig/branchpointer) has been updated so these are no longer required.


