Question

Get wrong tx_type when using GenomicFeatures::makeTxDbFromGTF

0

Entering edit mode

Karolin Wiedemann • 0

@karolin-wiedemann-9303

Last seen 8.4 years ago

Germany

The package GenomicFeatures (>v1.20) provides the "tx_type" column in the transcript table of TranscriptDBs.
I want to read a GTF file, that includes the transcript_biotype. As example, I downloaded and unziped an GTF from Ensembl: ftp://ftp.ensembl.org/pub/release-82/gtf/homo_sapiens/Homo_sapiens.GRCh38.82.gtf.gz .
Here an extract:

1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic"; transcript_support_level "1";

However, I don't get the a tx_type like mRNA, snoRNA,... . Instead the tx_type column is filled with the word "transcript".

My example:

> txdb <- GenomicFeatures::makeTxDbFromGFF("~/data/Homo_sapiens.GRCh38.82.gtf",format="gtf")
> tx <- GenomicFeatures::transcripts(txdb,column=c("tx_name","tx_type")) > head(tx) GRanges object with 6 ranges and 2 metadata columns: seqnames ranges strand | tx_name tx_type <Rle> <IRanges> <Rle> | <character> <character> [1] 1 [11869, 14409] + | ENST00000456328 transcript [2] 1 [12010, 13670] + | ENST00000450305 transcript [3] 1 [29554, 31097] + | ENST00000473358 transcript [4] 1 [30267, 31109] + | ENST00000469289 transcript [5] 1 [30366, 30503] + | ENST00000607096 transcript [6] 1 [52473, 53312] + | ENST00000606857 transcript ------- seqinfo: 59 sequences (1 circular) from an unspecified genome; no seqlengths

Looking at the code:

rtracklayer::import is used to read the GTF, while only the columns "type","gene_id","transcript_id" and "exon_id" are returned. Thereby "type" describes the 3.column in the GTF. Maybe I am wrong, but this column never includes transcript_type information.

My questions:
1) Is there something wrong in the way I make TxDbs from GTF or did I understand the tx_type incorrectly?

2) Why are only a predefined tx_types excapted ?

Thanks, Karolin

__________

R version 3.2.2 (2015-08-14) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.3 LTS

locale: [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 LC_PAPER=de_DE.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] AnnotationDbi_1.32.0 XVector_0.10.0 GenomicRanges_1.22.1 BiocGenerics_0.16.1 [5] zlibbioc_1.16.0 GenomicAlignments_1.6.1 IRanges_2.4.4 BiocParallel_1.4.0 [9] GenomeInfoDb_1.6.1 tools_3.2.2 SummarizedExperiment_1.0.1 parallel_3.2.2 [13] Biobase_2.30.0 DBI_0.3.1 lambda.r_1.1.7 futile.logger_1.4.1 [17] rtracklayer_1.30.1 S4Vectors_0.8.3 futile.options_1.0.0 bitops_1.0-6 [21] RCurl_1.95-4.7 biomaRt_2.26.1 RSQLite_1.0.0 GenomicFeatures_1.22.5 [25] Biostrings_2.38.2 Rsamtools_1.22.0 stats4_3.2.2 XML_3.98-1.3

genomicfeatures maketxdbfromgff tx_type gtf • 1.9k views

ADD COMMENT • link updated 8.4 years ago by Hervé Pagès 16k • written 8.4 years ago by Karolin Wiedemann • 0

score 0 · Answer 1 · 2015-12-02

It may not answer your question directly... but for Ensembl based annotations you might also consider to use the ensembldb package instead. That way you can create EnsDb objects/databases/packages that have essentially the same functionality than the TxDb object, just that they are tailored to Ensembl based annotations (you'll have an explicit column tx_biotype and gene_biotype available).

Basically, you could use the ensDbFromGtf function if you have the GTF locally. Alternatively, as explained in the vignette, you could use the AnnotationHub package to fetch a GRanges object representing the GTF data from Ensembl and build the database from that (unfortunateyl, Ensembl 82 is not yet available in AnnotationHub).

Let me know if you run into problems or something in unclear.

cheers, jo

PS: the working example:

> library(ensembldb)
> gtff <- "~/data/Homo_sapiens.GRCh38.82.gtf"
> DB <- ensDbFromGtf(gtff, verbose=TRUE)
importing gtf file...done
processing metadata...OK
processing genes...OK
processing transcripts...OK
processing exons...OK
processing chromosomes...fetch seqlenghts from ensembl, dataset  hsapiens_gene_ensembl  version  82 ...OK
generating index...OK
Verifying validity of the information in the database:
Checking transcripts...OK
Checking exons...OK

> edb <- EnsDb(DB)
> transcripts(edb)
GRanges object with 198634 ranges and 5 metadata columns:
                  seqnames               ranges strand   |           tx_id
                     <Rle>            <IRanges>  <Rle>   |     <character>
  ENST00000456328        1       [11869, 14409]      +   | ENST00000456328
  ENST00000450305        1       [12010, 13670]      +   | ENST00000450305
  ENST00000488147        1       [14404, 29570]      -   | ENST00000488147
  ENST00000619216        1       [17369, 17436]      -   | ENST00000619216
  ENST00000473358        1       [29554, 31097]      +   | ENST00000473358
              ...      ...                  ...    ... ...             ...
  ENST00000420810        Y [26549425, 26549743]      +   | ENST00000420810
  ENST00000456738        Y [26586642, 26591601]      -   | ENST00000456738
  ENST00000435945        Y [26594851, 26634652]      -   | ENST00000435945
  ENST00000435741        Y [26626520, 26627159]      -   | ENST00000435741
  ENST00000431853        Y [56855244, 56855488]      +   | ENST00000431853
                                          tx_biotype tx_cds_seq_start
                                         <character>        <numeric>
  ENST00000456328               processed_transcript             <NA>
  ENST00000450305 transcribed_unprocessed_pseudogene             <NA>
  ENST00000488147             unprocessed_pseudogene             <NA>
  ENST00000619216                              miRNA             <NA>
  ENST00000473358                            lincRNA             <NA>
              ...                                ...              ...
  ENST00000420810               processed_pseudogene             <NA>
  ENST00000456738             unprocessed_pseudogene             <NA>
  ENST00000435945             unprocessed_pseudogene             <NA>
  ENST00000435741               processed_pseudogene             <NA>
  ENST00000431853               processed_pseudogene             <NA>
                  tx_cds_seq_end         gene_id
                       <numeric>     <character>
  ENST00000456328           <NA> ENSG00000223972
  ENST00000450305           <NA> ENSG00000223972
  ENST00000488147           <NA> ENSG00000227232
  ENST00000619216           <NA> ENSG00000278267
  ENST00000473358           <NA> ENSG00000243485
              ...            ...             ...
  ENST00000420810           <NA> ENSG00000224240
  ENST00000456738           <NA> ENSG00000227629
  ENST00000435945           <NA> ENSG00000237917
  ENST00000435741           <NA> ENSG00000231514
  ENST00000431853           <NA> ENSG00000235857
  -------
  seqinfo: 59 sequences from GRCh38 genome

score 0 · Answer 2 · 2015-12-03

Quite nice! Theoretically, it could help to solve some more problems I have. Unfortunately, I get an error:

> db <- ensembldb::ensDbFromGtf("~/data/Homo_sapiens.GRCh38.82.gtf",verbose=TRUE) importing gtf file...done Error in ensDbFromGRanges(GTF, outfile = outfile, path = path, organism = organism, : could not find function "seqlengths"

I guess,the function tries to fetch the data through biomart or something like this. Right?
I am connected to the WWW. Nevertheless, I need a solution without additional data.
> sessionInfo() R version 3.2.2 (2015-08-14) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.3 LTS

locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=de_DE.UTF-8 [6] LC_MESSAGES=en_GB.UTF-8 LC_PAPER=de_DE.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] Rcpp_0.12.2 AnnotationDbi_1.32.0 magrittr_1.5 AnnotationHub_2.2.2 XVector_0.10.0 [6] GenomicRanges_1.22.1 BiocGenerics_0.16.1 zlibbioc_1.16.0 GenomicAlignments_1.6.1 IRanges_2.4.4 [11] BiocParallel_1.4.0 xtable_1.8-0 R6_2.1.1 stringr_1.0.0 httr_1.0.0 [16] ensembldb_1.2.0 GenomeInfoDb_1.6.1 tools_3.2.2 SummarizedExperiment_1.0.1 parallel_3.2.2 [21] Biobase_2.30.0 DBI_0.3.1 lambda.r_1.1.7 futile.logger_1.4.1 htmltools_0.2.6 [26] digest_0.6.8 interactiveDisplayBase_1.8.0 shiny_0.12.2 rtracklayer_1.30.1 S4Vectors_0.8.3 [31] futile.options_1.0.0 bitops_1.0-6 RCurl_1.95-4.7 biomaRt_2.26.1 mime_0.4 [36] RSQLite_1.0.0 stringi_1.0-1 BiocInstaller_1.20.1 GenomicFeatures_1.22.5 Biostrings_2.38.2 [41] Rsamtools_1.22.0 stats4_3.2.2 XML_3.98-1.3 httpuv_1.3.3

score 0 · Answer 3 · 2015-12-04

Hi,

makeTxDbFromGFF() gets the tx_type information from the 3rd column (type column) of the GFF/GTF file, which for a GTF file always happens to be set to transcript for transcripts (unlike for a GFF file, where it can be set to various things). I agree this is not very informative and we should probably address this.

FWIW note that you can also fetch a TxDb object from Ensembl with makeTxDbFromBiomart() (also from the GenomicFeatures package). This will populate the tx_type column with Ensembl transcript_biotype attribute.

Cheers,

H.