Dear All,
I want to prepare TxDb object from gff3 file using makeTxDbFromGFF function. The gff3 file contains the following lines:
###
Pt Araport11 gene 293 1522 . - . ID=ATCG00020;Name=psbA
Pt Araport11 mRNA 293 1522 . - . ID=ATCG00020.1;Parent=ATCG00020;Name=psbA
Pt Araport11 CDS 383 1444 . - 0 ID=ATCG00020:CDS:1;Parent=ATCG00020.1;Name=psbA:CDS:1
Pt Araport11 exon 293 1522 . - . ID=ATCG00020:exon:1;Parent=ATCG00020.1;Name=psbA:exon:1
Pt Araport11 protein 383 1444 . - . ID=ATCG00020.1-Protein;Name=D1;Derives_from=ATCG00020.1
###
Pt Araport11 gene 6853 7758 . + . ID=ATCG00070;Name=psbK
Pt Araport11 gene 6853 7758 . + . ID=ATCG00080;Name=psbI
Pt Araport11 mRNA 6853 7758 . + . ID=ATCG00070.1-ATCG00080.1;Parent=ATCG00070,ATCG00080;Name=psbK-psbI
Pt Araport11 exon 6853 7758 . + . ID=ATCG00070:exon:1;Parent=ATCG00070.1-ATCG00080.1;Name=psbK-psbI
Pt Araport11 CDS 7017 7202 . + 0 ID=ATCG00070:CDS:1;Parent=ATCG00070.1-ATCG00080.1;Name=psbK:CDS:1
Pt Araport11 CDS 7583 7693 . + 0 ID=ATCG00080:CDS:1;Parent=ATCG00070.1-ATCG00080.1;Name=psbI:CDS:1
###
There are two mRNAs. First, encodes for one CDS while the second one contains 2 CDSs (bicistronic transcript). The gff3 file (especially second mRNA with two CDSs) was prepared according to https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md. While running the command:
library(rtracklayer); library(GenomicFeatures)
txdb <- makeTxDbFromGFF("path_to_gff3_file.gff", format = "gff3")
the following massage appears:
Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... The following transcripts have exons that contain more than one CDS (only the first CDS was kept for each exon): ATCG00070.1-ATCG00080.1OK
And indeed, there is only one CDS for second transcript (ATCG00070.1-ATCG00080.1)
So my question is if it is possible to represent polycistronic transcripts in TxDb object and if so, how to prepare/load gff file?
Thank you for help!
Piotr
sessionInfo() R version 3.4.2 (2017-09-28) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.5 LTS Matrix products: default BLAS: /usr/lib/libblas/libblas.so.3.0 LAPACK: /usr/lib/lapack/liblapack.so.3.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=pl_PL.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=pl_PL.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=pl_PL.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=pl_PL.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicFeatures_1.28.4 AnnotationDbi_1.38.1 Biobase_2.36.2 BSgenome_1.44.0 rtracklayer_1.36.4 Biostrings_2.44.1 XVector_0.16.0 [8] GenomicRanges_1.28.4 GenomeInfoDb_1.12.2 IRanges_2.10.2 S4Vectors_0.14.3 BiocGenerics_0.22.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.12 compiler_3.4.2 bitops_1.0-6 tools_3.4.2 zlibbioc_1.22.0 biomaRt_2.32.1 [7] digest_0.6.12 bit_1.1-12 RSQLite_2.0 memoise_1.1.0 tibble_1.3.4 lattice_0.20-35 [13] pkgconfig_2.0.1 rlang_0.1.2 Matrix_1.2-11 DelayedArray_0.2.7 DBI_0.7 yaml_2.1.14 [19] GenomeInfoDbData_0.99.0 knitr_1.17 bit64_0.9-7 grid_3.4.2 XML_3.98-1.9 BiocParallel_1.10.1 [25] blob_1.1.0 Rsamtools_1.28.0 matrixStats_0.52.2 GenomicAlignments_1.12.1 SummarizedExperiment_1.6.3 RCurl_1.95-4.8