create a txdb using makeTxDbFromGFF
0
0
Entering edit mode
@naderaryamanesh-14634
Last seen 6.9 years ago

Hi,

I am trying to make a txdb for Arabidopsis lyrata. the annotation file could be downloaded here:

ftp://ftp.ensemblgenomes.org/pub/plants/release-30/gff3/arabidopsis_lyrata/Arabidopsis_lyrata.v.1.0.30.chr.gff3.gz

I am using the following command to create txdb:

txdb <- makeTxDbFromGFF(file="/PATH/rawdata/annotations/Arabidopsis_lyrata.v.1.0.30.chrb.gff3",

                                                                format=c("auto", "gff3", "gtf"),

                                                                dataSource="gtf file for Arabidopsis lyrata",

                                                              organism="Arabidopsis lyrata")

Above command creates the txdb as below:

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK

> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: gtf file for Arabidopsis lyrata
# Organism: Arabidopsis lyrata
# Taxonomy ID: 59689
# miRBase build ID: NA
# Genome: NA
# transcript_nrow: 31478
# exon_nrow: 170022
# cds_nrow: 154686
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2017-12-14 14:59:40 +0200 (Thu, 14 Dec 2017)
# GenomicFeatures version at creation time: 1.28.4
# RSQLite version at creation time: 2.0
# DBSCHEMAVERSION: 1.1

However when I use seqinfo(txdb) it shows empty:

> seqinfo(txdb)
Seqinfo object with 8 sequences from an unspecified genome; no seqlengths:
  seqnames seqlengths isCircular genome
  chr1             NA         NA   <NA>
  chr2             NA         NA   <NA>
  chr3             NA         NA   <NA>
  chr4             NA         NA   <NA>
  chr5             NA         NA   <NA>
  chr6             NA         NA   <NA>
  chr7             NA         NA   <NA>
  chr8             NA         NA   <NA>

While it should be similar to:

> library("BSgenome.Alyrata.JGI.v1")

> seqinfo(Alyrata)
Seqinfo object with 8 sequences from Assembly V1.0 genome:
  seqnames seqlengths isCircular        genome
  chr1       33132539      FALSE Assembly V1.0
  chr2       19320864      FALSE Assembly V1.0
  chr3       24464547      FALSE Assembly V1.0
  chr4       23328337      FALSE Assembly V1.0
  chr5       21221946      FALSE Assembly V1.0
  chr6       25113588      FALSE Assembly V1.0
  chr7       24649197      FALSE Assembly V1.0
  chr8       22951293      FALSE Assembly V1.0

I really appreciate it if you pin point the problem or if there is a better way to make the txdb?

Kind regards,

Nader

 

 
bioconductor txdb maketxdbfromgff • 4.2k views
ADD COMMENT
0
Entering edit mode

sessionInfo() please! Your version of Bioconductor seems outdated.

Note that makeTxDbFromGFF() uses rtracklayer::import.gff3() internally as a first step of importing the GFF3 file as a GRanges object. And even though the sequence lengths are present in the file, for some reasons rtracklayer::import.gff3() fails to import them:

library(rtracklayer)
gr <- import.gff3("Arabidopsis_lyrata.v.1.0.30.chr.gff3.gz")
seqinfo(gr)
# Seqinfo object with 8 sequences from an unspecified genome; no seqlengths:
#   seqnames seqlengths isCircular genome
#   1                NA         NA   <NA>
#   2                NA         NA   <NA>
#   3                NA         NA   <NA>
#   4                NA         NA   <NA>
#   5                NA         NA   <NA>
#   6                NA         NA   <NA>
#   7                NA         NA   <NA>
#   8                NA         NA   <NA>

You could either ask a new question on this site with tag rtracklayer and focus on the rtracklayer::import.gff3() issue, or open an issue on GitHub: https://github.com/lawremi/rtracklayer/issues

Thanks,

H.

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /home/hpages/R/R-3.4.3/lib/libRblas.so
LAPACK: /home/hpages/R/R-3.4.3/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base   

other attached packages:
[1] rtracklayer_1.38.2   GenomicRanges_1.30.0 GenomeInfoDb_1.14.0 
[4] IRanges_2.12.0       S4Vectors_0.16.0     BiocGenerics_0.24.0 

loaded via a namespace (and not attached):
 [1] lattice_0.20-35            matrixStats_0.52.2        
 [3] XML_3.98-1.9               Rsamtools_1.30.0          
 [5] Biostrings_2.46.0          GenomicAlignments_1.14.1  
 [7] bitops_1.0-6               grid_3.4.3                
 [9] zlibbioc_1.24.0            XVector_0.18.0            
[11] Matrix_1.2-12              BiocParallel_1.12.0       
[13] tools_3.4.3                Biobase_2.38.0            
[15] RCurl_1.95-4.8             DelayedArray_0.4.1        
[17] compiler_3.4.3             SummarizedExperiment_1.8.0
[19] GenomeInfoDbData_0.99.1   
ADD REPLY

Login before adding your answer.

Traffic: 733 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6