The package GenomicFeatures (>v1.20) provides the "tx_type" column in the transcript table of TranscriptDBs.
I want to read a GTF file, that includes the transcript_biotype. As example, I downloaded and unziped an GTF from Ensembl: ftp://ftp.ensembl.org/pub/release-82/gtf/homo_sapiens/Homo_sapiens.GRCh38.82.gtf.gz .
Here an extract:
1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic"; transcript_support_level "1";
However, I don't get the a tx_type like mRNA, snoRNA,... . Instead the tx_type column is filled with the word "transcript".
My example:
> txdb <- GenomicFeatures::makeTxDbFromGFF("~/data/Homo_sapiens.GRCh38.82.gtf",format="gtf")
> tx <- GenomicFeatures::transcripts(txdb,column=c("tx_name","tx_type"))
> head(tx)
GRanges object with 6 ranges and 2 metadata columns:
seqnames ranges strand | tx_name tx_type
<Rle> <IRanges> <Rle> | <character> <character>
[1] 1 [11869, 14409] + | ENST00000456328 transcript
[2] 1 [12010, 13670] + | ENST00000450305 transcript
[3] 1 [29554, 31097] + | ENST00000473358 transcript
[4] 1 [30267, 31109] + | ENST00000469289 transcript
[5] 1 [30366, 30503] + | ENST00000607096 transcript
[6] 1 [52473, 53312] + | ENST00000606857 transcript
-------
seqinfo: 59 sequences (1 circular) from an unspecified genome; no seqlengths
Looking at the code:
rtracklayer::import is used to read the GTF, while only the columns "type","gene_id","transcript_id" and "exon_id" are returned. Thereby "type" describes the 3.column in the GTF. Maybe I am wrong, but this column never includes transcript_type information.
My questions:
1) Is there something wrong in the way I make TxDbs from GTF or did I understand the tx_type incorrectly?
2) Why are only a predefined tx_types excapted ?
Thanks, Karolin
__________
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS
locale:
[1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 LC_PAPER=de_DE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.32.0 XVector_0.10.0 GenomicRanges_1.22.1 BiocGenerics_0.16.1
[5] zlibbioc_1.16.0 GenomicAlignments_1.6.1 IRanges_2.4.4 BiocParallel_1.4.0
[9] GenomeInfoDb_1.6.1 tools_3.2.2 SummarizedExperiment_1.0.1 parallel_3.2.2
[13] Biobase_2.30.0 DBI_0.3.1 lambda.r_1.1.7 futile.logger_1.4.1
[17] rtracklayer_1.30.1 S4Vectors_0.8.3 futile.options_1.0.0 bitops_1.0-6
[21] RCurl_1.95-4.7 biomaRt_2.26.1 RSQLite_1.0.0 GenomicFeatures_1.22.5
[25] Biostrings_2.38.2 Rsamtools_1.22.0 stats4_3.2.2 XML_3.98-1.3