Dears,
I have trouble to read gtf files with branchpointer::gtfToExons While the supplied example file (gencode.v26.annotation.small.gtf) can be read, my own gtf files or any change in the example file lead to "Error: subscript contains invalid names". E.g. keeping only the gene_id and transcript_id from the example file renders it unreadable. I suspect that gtfToExons relies on specific attributes in the group/attribute field but I cannot pinpoint which. I work with non-model organisms and can only provide transcript-exon information with non-public identifiers. Also, gff3 files cannot be read.
An example for a minimal gtf file which cannot be read is:
chr1 gmap transcript 1 1000 . + . transcript_id "tx1";
chr1 gmap exon 100 900 . + 0 transcript_id "tx1";
Any hint on how to construct my gtf files?
library(branchpointer)
exons <- gtfToExons("minimal.gtf")
Error: subscript contains invalid names
sessionInfo( ):
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] branchpointer_1.18.0 caret_6.0-88 ggplot2_3.3.5
[4] lattice_0.20-44
loaded via a namespace (and not attached):
[1] nlme_3.1-153 matrixStats_0.60.1
[3] bitops_1.0-7 lubridate_1.7.10
[5] bit64_4.0.5 filelock_1.0.2
[7] progress_1.2.2 httr_1.4.2
[9] GenomeInfoDb_1.28.4 tools_4.1.0
[11] utf8_1.2.2 R6_2.5.1
[13] rpart_4.1-15 DBI_1.1.1
[15] BiocGenerics_0.38.0 colorspace_2.0-2
[17] nnet_7.3-16 withr_2.4.2
[19] gbm_2.1.8 tidyselect_1.1.1
[21] prettyunits_1.1.1 bit_4.0.4
[23] curl_4.3.2 compiler_4.1.0
[25] Biobase_2.52.0 xml2_1.3.2
[27] DelayedArray_0.18.0 rtracklayer_1.52.1
[29] scales_1.1.1 rappdirs_0.3.3
[31] Rsamtools_2.8.0 stringr_1.4.0
[33] digest_0.6.27 XVector_0.32.0
[35] pkgconfig_2.0.3 parallelly_1.28.1
[37] MatrixGenerics_1.4.3 BSgenome_1.60.0
[39] dbplyr_2.1.1 fastmap_1.1.0
[41] rlang_0.4.11 rstudioapi_0.13
[43] RSQLite_2.2.8 BiocIO_1.2.0
[45] generics_0.1.0 BiocParallel_1.26.2
[47] dplyr_1.0.7 ModelMetrics_1.2.2.2
[49] RCurl_1.98-1.5 magrittr_2.0.1
[51] GenomeInfoDbData_1.2.6 Matrix_1.3-4
[53] Rcpp_1.0.7 munsell_0.5.0
[55] S4Vectors_0.30.0 fansi_0.5.0
[57] lifecycle_1.0.0 yaml_2.2.1
[59] stringi_1.7.4 pROC_1.18.0
[61] SummarizedExperiment_1.22.0 MASS_7.3-54
[63] zlibbioc_1.38.0 plyr_1.8.6
[65] recipes_0.1.16 BiocFileCache_2.0.0
[67] grid_4.1.0 blob_1.2.2
[69] parallel_4.1.0 listenv_0.8.0
[71] crayon_1.4.1 cowplot_1.1.1
[73] Biostrings_2.60.2 splines_4.1.0
[75] hms_1.1.0 KEGGREST_1.32.0
[77] BSgenome.Hsapiens.UCSC.hg38_1.4.3 pillar_1.6.2
[79] GenomicRanges_1.44.0 rjson_0.2.20
[81] future.apply_1.8.1 reshape2_1.4.4
[83] codetools_0.2-18 biomaRt_2.48.3
[85] stats4_4.1.0 XML_3.99-0.8
[87] glue_1.4.2 data.table_1.14.0
[89] png_0.1-7 vctrs_0.3.8
[91] foreach_1.5.1 gtable_0.3.0
[93] purrr_0.3.4 kernlab_0.9-29
[95] future_1.22.1 assertthat_0.2.1
[97] cachem_1.0.6 gower_0.2.2
[99] prodlim_2019.11.13 restfulr_0.0.13
[101] class_7.3-19 survival_3.2-13
[103] timeDate_3043.102 tibble_3.1.4
[105] iterators_1.0.13 GenomicAlignments_1.28.0
[107] AnnotationDbi_1.54.1 memoise_2.0.0
[109] IRanges_2.26.0 lava_1.6.10
[111] globals_0.14.0 ellipsis_0.3.2
[113] ipred_0.9-12
Hi Frank,
Your example gtf is missing a gene_id. In the old code we also required a transcript_type/transcript_biotype, and a gene_type/gene_biotype. The code on github (betsig/branchpointer) has been updated so these are no longer required.