Get wrong tx_type when using GenomicFeatures::makeTxDbFromGTF
Entering edit mode
Last seen 6.8 years ago

The package GenomicFeatures (>v1.20) provides the "tx_type" column in the transcript table of TranscriptDBs.
I want to read a GTF file, that includes the transcript_biotype. As example, I downloaded and unziped an GTF from Ensembl: .
Here an extract:

1       havana  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic"; transcript_support_level "1";

However, I don't get the a tx_type like mRNA, snoRNA,... .  Instead the tx_type column is filled with the word "transcript".

My example:

> txdb <- GenomicFeatures::makeTxDbFromGFF("~/data/Homo_sapiens.GRCh38.82.gtf",format="gtf")
> tx <- GenomicFeatures::transcripts(txdb,column=c("tx_name","tx_type"))
> head(tx)
GRanges object with 6 ranges and 2 metadata columns:
      seqnames         ranges strand |         tx_name     tx_type
         <Rle>      <IRanges>  <Rle> |     <character> <character>
  [1]        1 [11869, 14409]      + | ENST00000456328  transcript
  [2]        1 [12010, 13670]      + | ENST00000450305  transcript
  [3]        1 [29554, 31097]      + | ENST00000473358  transcript
  [4]        1 [30267, 31109]      + | ENST00000469289  transcript
  [5]        1 [30366, 30503]      + | ENST00000607096  transcript
  [6]        1 [52473, 53312]      + | ENST00000606857  transcript
  seqinfo: 59 sequences (1 circular) from an unspecified genome; no seqlengths


Looking at the code:

rtracklayer::import is used to read the GTF, while only the columns "type","gene_id","transcript_id" and "exon_id" are returned. Thereby "type" describes the 3.column in the GTF. Maybe I am wrong, but this column never includes transcript_type information.


My questions:
1) Is there something wrong in the way I make TxDbs from GTF or did I understand the tx_type incorrectly?

2) Why are only a predefined tx_types excapted ?



Thanks, Karolin


R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                 

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.32.0       XVector_0.10.0             GenomicRanges_1.22.1       BiocGenerics_0.16.1       
 [5] zlibbioc_1.16.0            GenomicAlignments_1.6.1    IRanges_2.4.4              BiocParallel_1.4.0        
 [9] GenomeInfoDb_1.6.1         tools_3.2.2                SummarizedExperiment_1.0.1 parallel_3.2.2            
[13] Biobase_2.30.0             DBI_0.3.1                  lambda.r_1.1.7             futile.logger_1.4.1       
[17] rtracklayer_1.30.1         S4Vectors_0.8.3            futile.options_1.0.0       bitops_1.0-6              
[21] RCurl_1.95-4.7             biomaRt_2.26.1             RSQLite_1.0.0              GenomicFeatures_1.22.5    
[25] Biostrings_2.38.2          Rsamtools_1.22.0           stats4_3.2.2               XML_3.98-1.3

genomicfeatures maketxdbfromgff tx_type gtf • 1.4k views
Entering edit mode
Johannes Rainer ★ 2.0k
Last seen 8 weeks ago

It may not answer your question directly... but for Ensembl based annotations you might also consider to use the ensembldb package instead. That way you can create EnsDb objects/databases/packages that have essentially the same functionality than the TxDb object, just that they are tailored to Ensembl based annotations (you'll have an explicit column tx_biotype and gene_biotype available).

Basically, you could use the ensDbFromGtf function if you have the GTF locally. Alternatively, as explained in the vignette, you could use the AnnotationHub package to fetch a GRanges object representing the GTF data from Ensembl and build the database from that (unfortunateyl, Ensembl 82 is not yet available in AnnotationHub).

Let me know if you run into problems or something in unclear.

cheers, jo

PS: the working example:

> library(ensembldb)
> gtff <- "~/data/Homo_sapiens.GRCh38.82.gtf"
> DB <- ensDbFromGtf(gtff, verbose=TRUE)
importing gtf file...done
processing metadata...OK
processing genes...OK
processing transcripts...OK
processing exons...OK
processing chromosomes...fetch seqlenghts from ensembl, dataset  hsapiens_gene_ensembl  version  82 ...OK
generating index...OK
Verifying validity of the information in the database:
Checking transcripts...OK
Checking exons...OK

> edb <- EnsDb(DB)
> transcripts(edb)
GRanges object with 198634 ranges and 5 metadata columns:
                  seqnames               ranges strand   |           tx_id
                     <Rle>            <IRanges>  <Rle>   |     <character>
  ENST00000456328        1       [11869, 14409]      +   | ENST00000456328
  ENST00000450305        1       [12010, 13670]      +   | ENST00000450305
  ENST00000488147        1       [14404, 29570]      -   | ENST00000488147
  ENST00000619216        1       [17369, 17436]      -   | ENST00000619216
  ENST00000473358        1       [29554, 31097]      +   | ENST00000473358
              ...      ...                  ...    ... ...             ...
  ENST00000420810        Y [26549425, 26549743]      +   | ENST00000420810
  ENST00000456738        Y [26586642, 26591601]      -   | ENST00000456738
  ENST00000435945        Y [26594851, 26634652]      -   | ENST00000435945
  ENST00000435741        Y [26626520, 26627159]      -   | ENST00000435741
  ENST00000431853        Y [56855244, 56855488]      +   | ENST00000431853
                                          tx_biotype tx_cds_seq_start
                                         <character>        <numeric>
  ENST00000456328               processed_transcript             <NA>
  ENST00000450305 transcribed_unprocessed_pseudogene             <NA>
  ENST00000488147             unprocessed_pseudogene             <NA>
  ENST00000619216                              miRNA             <NA>
  ENST00000473358                            lincRNA             <NA>
              ...                                ...              ...
  ENST00000420810               processed_pseudogene             <NA>
  ENST00000456738             unprocessed_pseudogene             <NA>
  ENST00000435945             unprocessed_pseudogene             <NA>
  ENST00000435741               processed_pseudogene             <NA>
  ENST00000431853               processed_pseudogene             <NA>
                  tx_cds_seq_end         gene_id
                       <numeric>     <character>
  ENST00000456328           <NA> ENSG00000223972
  ENST00000450305           <NA> ENSG00000223972
  ENST00000488147           <NA> ENSG00000227232
  ENST00000619216           <NA> ENSG00000278267
  ENST00000473358           <NA> ENSG00000243485
              ...            ...             ...
  ENST00000420810           <NA> ENSG00000224240
  ENST00000456738           <NA> ENSG00000227629
  ENST00000435945           <NA> ENSG00000237917
  ENST00000435741           <NA> ENSG00000231514
  ENST00000431853           <NA> ENSG00000235857
  seqinfo: 59 sequences from GRCh38 genome



Entering edit mode
Last seen 6.8 years ago

Quite nice! Theoretically, it could help to solve some more problems I have. Unfortunately, I get an error:

> db <- ensembldb::ensDbFromGtf("~/data/Homo_sapiens.GRCh38.82.gtf",verbose=TRUE)
importing gtf file...done
Error in ensDbFromGRanges(GTF, outfile = outfile, path = path, organism = organism,  :
  could not find function "seqlengths"

I guess,the function tries to fetch the data through biomart or something like this. Right?
I am connected to the WWW. Nevertheless, I need a solution without additional data.
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=de_DE.UTF-8   
 [6] LC_MESSAGES=en_GB.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.2                  AnnotationDbi_1.32.0         magrittr_1.5                 AnnotationHub_2.2.2          XVector_0.10.0              
 [6] GenomicRanges_1.22.1         BiocGenerics_0.16.1          zlibbioc_1.16.0              GenomicAlignments_1.6.1      IRanges_2.4.4               
[11] BiocParallel_1.4.0           xtable_1.8-0                 R6_2.1.1                     stringr_1.0.0                httr_1.0.0                  
[16] ensembldb_1.2.0              GenomeInfoDb_1.6.1           tools_3.2.2                  SummarizedExperiment_1.0.1   parallel_3.2.2              
[21] Biobase_2.30.0               DBI_0.3.1                    lambda.r_1.1.7               futile.logger_1.4.1          htmltools_0.2.6             
[26] digest_0.6.8                 interactiveDisplayBase_1.8.0 shiny_0.12.2                 rtracklayer_1.30.1           S4Vectors_0.8.3             
[31] futile.options_1.0.0         bitops_1.0-6                 RCurl_1.95-4.7               biomaRt_2.26.1               mime_0.4                    
[36] RSQLite_1.0.0                stringi_1.0-1                BiocInstaller_1.20.1         GenomicFeatures_1.22.5       Biostrings_2.38.2           
[41] Rsamtools_1.22.0             stats4_3.2.2                 XML_3.98-1.3                 httpuv_1.3.3 

Entering edit mode

Did you load the libraries before?

I mean library(ensembldb) and eventually also library(GenomicFeatures) (although this should come along the ensembldb package). I don't see these two packages being attached in your sessionInfo.

In fact, the function does also load the sequence lengths (length of the chromosomes in nt) from Ensembl by a call to the Ensembl ftp server, so, yes, to get all information you need internet connection. If you don't have that you will still get an EnsDb database, but without the seqinfo.

cheers, jo

Entering edit mode

Oh, it's a shame! That is the reason. Thanks a lot !!


Entering edit mode

Johannes -- I think the could not find function "seqlengths" error means that your package is missing an import(GenomeInfoDb) or importFrom(seqlengths, GenomeInfoDb) in it's NAMESPACE. This is needed both for the (valid) case illustrated in post and for when a package extends your package via import-ing functionality.

Entering edit mode

Thanks Martin. I indeed forgot to import this method. Is now fixed in release and devel version and should be populated tomorrow (I guess).

Entering edit mode
Last seen 4 hours ago
Seattle, WA, United States


makeTxDbFromGFF() gets the tx_type information from the 3rd column (type column) of the GFF/GTF file, which for a GTF file always happens to be set to transcript for transcripts (unlike for a GFF file, where it can be set to various things). I agree this is not very informative and we should probably address this.

FWIW note that you can also fetch a TxDb object from Ensembl with makeTxDbFromBiomart() (also from the GenomicFeatures package). This will populate the tx_type column with Ensembl transcript_biotype attribute.




Login before adding your answer.

Traffic: 309 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6