makeTxDbFromGFF not capturing/displaying 'gene' records from GFF3 file.
1
0
Entering edit mode
pterry • 0
@pterry-6902
Last seen 2.3 years ago
United States

The **goal** for my question is to follow the procedure in the 'C. Differential Expression' presentation, April 06-07, 2015, bioconductor workshop, section '5. Reduction, Setup, part c', to prepare a TxDb object from a gff3 file for the subsequent 'Counting' section (preparation of SummarizedExperiment) and for section '6. Analysis using DESeq2'.

The Problem: makeTxDbFromGFF is producing a TxDb file with a
GRangesList object of length 0:
when trying to list exons by gene.

for the TxDb file produced from the gff3 file at
ftp://ftp.ensemblgenomes.org/pub/plants/release-26/gff3/setaria_italica/

> gff3file <- file.path(path, "Setaria_italica.JGIv2.0.26.gff3")
> txdbSitalica <- makeTxDbFromGFF(gff3file, format="gff3", circ_seqs=character())
Prepare the 'metadata' data frame ... metadata: OK

> genesSitalica <- exonsBy(txdbSitalica, by="gene")
> genesSitalica
GRangesList object of length 0:
<0 elements>

-------
seqinfo: 76 sequences from an unspecified genome; no seqlengths

I note this gff3 file has 35,471 records/rows with 'gene' in column 3, of its gff3 format.

By way of comparison, I can display 'exons by transcripts' from the txdbSitalica variable class TxDb,
grlsitalica <- exonsBy(txdbSitalica, "tx", use.names=TRUE)  
...
grlsitalica  ## 'GRangesList object of length 40599:'
## So, in txdbSitalica, can find all the records/rows with 'transcript' in column 3 of the gff3 file.
## that is, 40599 records with 'transcript' in column 3 of gff3 file MATCHES the number in grlsitalica.
> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
[1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.2
[2] GenomicFeatures_1.20.0                 
[3] AnnotationDbi_1.30.1                   
[4] Biobase_2.28.0                         
[5] GenomicRanges_1.20.3                   
[6] GenomeInfoDb_1.4.0                     
[7] IRanges_2.2.1                          
[8] S4Vectors_0.6.0                        
[9] BiocGenerics_0.14.0                    

loaded via a namespace (and not attached):
 [1] XML_3.98-1.1            Rsamtools_1.20.1        Biostrings_2.36.0      
 [4] GenomicAlignments_1.4.1 bitops_1.0-6            futile.options_1.0.0   
 [7] DBI_0.3.1               RSQLite_1.0.0           zlibbioc_1.14.0        
[10] XVector_0.8.0           futile.logger_1.4.1     lambda.r_1.1.7         
[13] BiocParallel_1.2.1      tools_3.2.0             biomaRt_2.24.0         
[16] RCurl_1.95-4.6          rtracklayer_1.28.2     
>


Thanks for comments,

 

genomicfeatures makeTxDbFromGFF • 2.2k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 17 hours ago
Seattle, WA, United States

Hi Philip,

Thanks for providing the link to the GFF3 file. Like it's often the case with GFF3 files, it seems that the authors of this file have their own interpretation of the GFF3 official spec, and in particular how the gene ids/names should be stored in the file. This was preventing makeTxDbFromGFF() from importing this information into the TxDb object. I fixed this in the release and devel versions of the GenomicFeatures package (versions 1.20.1 and 1.21.2 respectively). To get the new version, just run biocLite() (with no parenthesis) in 24 hours or so. That will also keep your installation up-to-date by automatically updating any other package for which a newer version is available.

Anyway, I would recommend that you use the GTF file here:

ftp://ftp.ensemblgenomes.org/pub/plants/release-26/gtf/setaria_italica/

instead of the big GFF3 file that you found there:

ftp://ftp.ensemblgenomes.org/pub/plants/release-26/gff3/setaria_italica/

The GTF file is much lighter and can be imported with:

txdb <- makeTxDbFromGFF("Setaria_italica.JGIv2.0.26.gtf.gz", format="gtf")

Then:

ex_by_gn <- exonsBy(txdb, by="gene")
ex_by_gn
# GRangesList object of length 35471:
# $Si000001m.g 
# GRanges object with 30 ranges and 2 metadata columns:
#        seqnames               ranges strand   |   exon_id   exon_name
#           <Rle>            <IRanges>  <Rle>   | <integer> <character>
#    [1]        5 [31915664, 31916200]      -   |     92734        <NA>
#    [2]        5 [31916360, 31916492]      -   |     92735        <NA>
#    [3]        5 [31916572, 31917090]      -   |     92736        <NA>
#    [4]        5 [31917958, 31918191]      -   |     92737        <NA>
#    [5]        5 [31918358, 31918637]      -   |     92738        <NA>
#    ...      ...                  ...    ... ...       ...         ...
#   [26]        5 [31930593, 31930671]      -   |     92759        <NA>
#   [27]        5 [31930981, 31931025]      -   |     92760        <NA>
#   [28]        5 [31931878, 31932035]      -   |     92761        <NA>
#   [29]        5 [31933861, 31933983]      -   |     92762        <NA>
#   [30]        5 [31934064, 31934132]      -   |     92763        <NA>
#
# ...
# <35470 more elements>
# -------
# seqinfo: 76 sequences from an unspecified genome; no seqlengths

As a sanity check, let's check that all the exons in txdb belong to at least 1 gene. 2 ways to do this:

## 1st way
## =======
ex <- exons(txdb, columns="gene_id")
table(elementLengths(mcols(ex)$gene_id))
#      1      2 
# 166338     27 

## ==> 166338 exons belong to 1 gene, 27 exons belong to 2 genes

## 2nd way
## =======
all(ex %in% unlist(ex_by_gn))
# [1] TRUE

Hope this helps,

H.

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] GenomicFeatures_1.20.1 AnnotationDbi_1.30.1   Biobase_2.28.0        
[4] GenomicRanges_1.20.3   GenomeInfoDb_1.4.0     IRanges_2.2.1         
[7] S4Vectors_0.6.0        BiocGenerics_0.14.0   

loaded via a namespace (and not attached):
 [1] XML_3.98-1.1            Rsamtools_1.20.1        Biostrings_2.36.0      
 [4] GenomicAlignments_1.4.1 bitops_1.0-6            futile.options_1.0.0   
 [7] DBI_0.3.1               RSQLite_1.0.0           zlibbioc_1.14.0        
[10] XVector_0.8.0           futile.logger_1.4.1     lambda.r_1.1.7         
[13] BiocParallel_1.2.1      tools_3.2.0             biomaRt_2.24.0         
[16] RCurl_1.95-4.6          rtracklayer_1.28.2     
ADD COMMENT
0
Entering edit mode

One more thing: you can also obtain the TxDb for Setaria italica from the "plants mart":

library(GenomicFeatures)
txdb <- makeTxDbFromBiomart("plants_mart_26", "sitalica_eg_gene")

See ?makeTxDbFromBiomart for more information.

Unfortunately the above code fails with the current version of GenomicFeatures but today I committed a change to makeTxDbFromBiomart() to address the problem. This change is also available in the release and devel versions of the GenomicFeatures package that will become available tomorrow (versions 1.20.1 and 1.21.2 respectively).

Note that the 3 methods (use of makeTxDbFromGFF() on Setaria_italica.JGIv2.0.26.gtf.gz or Setaria_italica.JGIv2.0.26.gff3.gz, or use of makeTxDbFromBiomart() to get the sitalica_eg_gene dataset from the plants mart) produce the same TxDb objects (e.g. same set of transcripts and genes, and same transcript "geometries").

H.

ADD REPLY

Login before adding your answer.

Traffic: 948 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6