Question

Get wrong tx_type when using GenomicFeatures::makeTxDbFromGTF

0

Entering edit mode

Karolin Wiedemann • 0

@karolin-wiedemann-9303

Last seen 10.2 years ago

Germany

The package GenomicFeatures (>v1.20) provides the "tx_type" column in the transcript table of TranscriptDBs.
I want to read a GTF file, that includes the transcript_biotype. As example, I downloaded and unziped an GTF from Ensembl: ftp://ftp.ensembl.org/pub/release-82/gtf/homo_sapiens/Homo_sapiens.GRCh38.82.gtf.gz .
Here an extract:
1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic"; transcript_support_level "1";

However, I don't get the a tx_type like mRNA, snoRNA,... . Instead the tx_type column is filled with the word "transcript".

My example:

> txdb <- GenomicFeatures::makeTxDbFromGFF("~/data/Homo_sapiens.GRCh38.82.gtf",format="gtf")
> tx <- GenomicFeatures::transcripts(txdb,column=c("tx_name","tx_type")) > head(tx) GRanges object with 6 ranges and 2 metadata columns: seqnames ranges strand | tx_name tx_type <Rle> <IRanges> <Rle> | <character> <character> [1] 1 [11869, 14409] + | ENST00000456328 transcript [2] 1 [12010, 13670] + | ENST00000450305 transcript [3] 1 [29554, 31097] + | ENST00000473358 transcript [4] 1 [30267, 31109] + | ENST00000469289 transcript [5] 1 [30366, 30503] + | ENST00000607096 transcript [6] 1 [52473, 53312] + | ENST00000606857 transcript ------- seqinfo: 59 sequences (1 circular) from an unspecified genome; no seqlengths

Looking at the code:

rtracklayer::import is used to read the GTF, while only the columns "type","gene_id","transcript_id" and "exon_id" are returned. Thereby "type" describes the 3.column in the GTF. Maybe I am wrong, but this column never includes transcript_type information.

My questions:
1) Is there something wrong in the way I make TxDbs from GTF or did I understand the tx_type incorrectly?

2) Why are only a predefined tx_types excapted ?

Thanks, Karolin

__________

R version 3.2.2 (2015-08-14) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.3 LTS

locale: [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 LC_PAPER=de_DE.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] AnnotationDbi_1.32.0 XVector_0.10.0 GenomicRanges_1.22.1 BiocGenerics_0.16.1 [5] zlibbioc_1.16.0 GenomicAlignments_1.6.1 IRanges_2.4.4 BiocParallel_1.4.0 [9] GenomeInfoDb_1.6.1 tools_3.2.2 SummarizedExperiment_1.0.1 parallel_3.2.2 [13] Biobase_2.30.0 DBI_0.3.1 lambda.r_1.1.7 futile.logger_1.4.1 [17] rtracklayer_1.30.1 S4Vectors_0.8.3 futile.options_1.0.0 bitops_1.0-6 [21] RCurl_1.95-4.7 biomaRt_2.26.1 RSQLite_1.0.0 GenomicFeatures_1.22.5 [25] Biostrings_2.38.2 Rsamtools_1.22.0 stats4_3.2.2 XML_3.98-1.3

genomicfeatures maketxdbfromgff tx_type gtf • 1.5k views

ADD COMMENT • link 10.2 years ago Karolin Wiedemann • 0