The package GenomicFeatures (>v1.20) provides the "tx_type" column in the transcript table of TranscriptDBs.
I want to read a GTF file, that includes the transcript_biotype. As example, I downloaded and unziped an GTF from Ensembl: ftp://ftp.ensembl.org/pub/release-82/gtf/homo_sapiens/Homo_sapiens.GRCh38.82.gtf.gz .
Here an extract:
1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic"; transcript_support_level "1";
However, I don't get the a tx_type like mRNA, snoRNA,... . Instead the tx_type column is filled with the word "transcript".
My example:
> txdb <- GenomicFeatures::makeTxDbFromGFF("~/data/Homo_sapiens.GRCh38.82.gtf",format="gtf")
> tx <- GenomicFeatures::transcripts(txdb,column=c("tx_name","tx_type"))
> head(tx)
GRanges object with 6 ranges and 2 metadata columns:
seqnames ranges strand | tx_name tx_type
<Rle> <IRanges> <Rle> | <character> <character>
[1] 1 [11869, 14409] + | ENST00000456328 transcript
[2] 1 [12010, 13670] + | ENST00000450305 transcript
[3] 1 [29554, 31097] + | ENST00000473358 transcript
[4] 1 [30267, 31109] + | ENST00000469289 transcript
[5] 1 [30366, 30503] + | ENST00000607096 transcript
[6] 1 [52473, 53312] + | ENST00000606857 transcript
-------
seqinfo: 59 sequences (1 circular) from an unspecified genome; no seqlengths
Looking at the code:
rtracklayer::import is used to read the GTF, while only the columns "type","gene_id","transcript_id" and "exon_id" are returned. Thereby "type" describes the 3.column in the GTF. Maybe I am wrong, but this column never includes transcript_type information.
My questions:
1) Is there something wrong in the way I make TxDbs from GTF or did I understand the tx_type incorrectly?
2) Why are only a predefined tx_types excapted ?
Thanks, Karolin
__________
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS
locale:
[1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 LC_PAPER=de_DE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.32.0 XVector_0.10.0 GenomicRanges_1.22.1 BiocGenerics_0.16.1
[5] zlibbioc_1.16.0 GenomicAlignments_1.6.1 IRanges_2.4.4 BiocParallel_1.4.0
[9] GenomeInfoDb_1.6.1 tools_3.2.2 SummarizedExperiment_1.0.1 parallel_3.2.2
[13] Biobase_2.30.0 DBI_0.3.1 lambda.r_1.1.7 futile.logger_1.4.1
[17] rtracklayer_1.30.1 S4Vectors_0.8.3 futile.options_1.0.0 bitops_1.0-6
[21] RCurl_1.95-4.7 biomaRt_2.26.1 RSQLite_1.0.0 GenomicFeatures_1.22.5
[25] Biostrings_2.38.2 Rsamtools_1.22.0 stats4_3.2.2 XML_3.98-1.3
Did you load the libraries before?
I mean
library(ensembldb)
and eventually alsolibrary(GenomicFeatures)
(although this should come along the ensembldb package). I don't see these two packages being attached in your sessionInfo.In fact, the function does also load the sequence lengths (length of the chromosomes in nt) from Ensembl by a call to the Ensembl ftp server, so, yes, to get all information you need internet connection. If you don't have that you will still get an EnsDb database, but without the
seqinfo
.cheers, jo
Oh, it's a shame! That is the reason. Thanks a lot !!
Johannes -- I think the could not find function "seqlengths" error means that your package is missing an import(GenomeInfoDb) or importFrom(seqlengths, GenomeInfoDb) in it's NAMESPACE. This is needed both for the (valid) case illustrated in post and for when a package extends your package via import-ing functionality.
Thanks Martin. I indeed forgot to import this method. Is now fixed in release and devel version and should be populated tomorrow (I guess).