TxDb, GenomicFeatures load from GTF file (Mmul_8)
1
0
Entering edit mode
Brian Smith ▴ 120
@brian-smith-6197
Last seen 4.3 years ago
United States

Hi,

I wanted to get the gene coordinates for Mmul8.0.1, and download the file from ensembl (ftp://ftp.ensembl.org/pub/release-97/gtf/macacamulatta/Macacamulatta.Mmul8.0.1.97.chr.gtf.gz).

When I load the package to txdb object using GenomicFeatures, I don't seem to be getting the genes. The transcripts, cds appears to be ok, but not genes. What am I doing wrong?

thanks for your help!

> library(GenomicFeatures)
> txdbmm <- makeTxDbFromGFF(file = paste0(mmdir,'Macaca_mulatta.Mmul_8.0.1.97.chr.gtf'))

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored.
> seqlevelsStyle(txdbmm) <- "UCSC"

> xx <- transcripts(txdbmm)
> yy <- cds(txdbmm) 
> zz <- genes(txdbmm)

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function 'genes' for signature '"TxDb"'

> txdbmm
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: /Users/Mmul_8.0.1/Macaca_mulatta.Mmul_8.0.1.97.chr.gtf
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# Nb of transcripts: 55075
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2020-07-11 15:51:11 -0400 (Sat, 11 Jul 2020)
# GenomicFeatures version at creation time: 1.40.1
# RSQLite version at creation time: 2.2.0
# DBSCHEMAVERSION: 1.2

My sessionInfo() is:

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] circlize_0.4.10                      annotatr_1.14.0                      dplyr_1.0.0                         
 [4] regioneR_1.20.1                      Gviz_1.32.0                          org.Mmu.eg.db_3.11.4                
 [7] pvclust_2.2-0                        zoo_1.8-8                            data.table_1.12.8                   
[10] foreign_0.8-80                       gtools_3.8.2                         BSgenome.Hsapiens.UCSC.hg38_1.4.3   
[13] BSgenome.Mmulatta.UCSC.rheMac8_1.4.2 BSgenome_1.56.0                      rtracklayer_1.47.0                  
[16] Biostrings_2.56.0                    XVector_0.28.0                       biomaRt_2.44.1                      
[19] topGO_2.40.0                         SparseM_1.78                         GO.db_3.11.4                        
[22] graph_1.66.0                         org.Hs.eg.db_3.11.4                  reutils_0.2.3                       
[25] XML_3.99-0.4                         DSS_2.36.0                           BiocParallel_1.22.0                 
[28] bsseq_1.24.4                         SummarizedExperiment_1.18.2          DelayedArray_0.14.0                 
[31] matrixStats_0.56.0                   GenomicFeatures_1.40.1               AnnotationDbi_1.50.1                
[34] Biobase_2.48.0                       GenomicRanges_1.40.0                 GenomeInfoDb_1.24.2                 
[37] IRanges_2.22.2                       S4Vectors_0.26.1                     BiocGenerics_0.34.0                 

loaded via a namespace (and not attached):
  [1] backports_1.1.8               Hmisc_4.4-0                   AnnotationHub_2.20.0          BiocFileCache_1.12.0         
  [5] plyr_1.8.6                    lazyeval_0.2.2                splines_4.0.2                 ggplot2_3.3.2                
  [9] digest_0.6.25                 ensembldb_2.12.1              htmltools_0.5.0               magrittr_1.5                 
 [13] checkmate_2.0.0               memoise_1.1.0                 cluster_2.1.0                 limma_3.44.3                 
 [17] readr_1.3.1                   R.utils_2.9.2                 askpass_1.1                   prettyunits_1.1.1            
 [21] jpeg_0.1-8.1                  colorspace_1.4-1              blob_1.2.1                    rappdirs_0.3.1               
 [25] xfun_0.15                     crayon_1.3.4                  RCurl_1.98-1.2                survival_3.2-3               
 [29] VariantAnnotation_1.34.0      glue_1.4.1                    gtable_0.3.0                  zlibbioc_1.34.0              
 [33] Rhdf5lib_1.10.1               shape_1.4.4                   HDF5Array_1.16.1              scales_1.1.1                 
 [37] DBI_1.1.0                     Rcpp_1.0.5                    xtable_1.8-4                  progress_1.2.2               
 [41] htmlTable_2.0.1               bit_1.1-15.2                  Formula_1.2-3                 htmlwidgets_1.5.1            
 [45] httr_1.4.1                    RColorBrewer_1.1-2            acepack_1.4.1                 ellipsis_0.3.1               
 [49] pkgconfig_2.0.3               R.methodsS3_1.8.0             nnet_7.3-14                   dbplyr_1.4.4                 
 [53] locfit_1.5-9.4                reshape2_1.4.4                tidyselect_1.1.0              rlang_0.4.6                  
 [57] later_1.1.0.1                 munsell_0.5.0                 BiocVersion_3.11.1            tools_4.0.2                  
 [61] generics_0.0.2                RSQLite_2.2.0                 fastmap_1.0.1                 stringr_1.4.0                
 [65] yaml_2.2.1                    knitr_1.29                    bit64_0.9-7                   purrr_0.3.4                  
 [69] AnnotationFilter_1.12.0       mime_0.9                      R.oo_1.23.0                   compiler_4.0.2               
 [73] rstudioapi_0.11               curl_4.3                      png_0.1-7                     interactiveDisplayBase_1.26.3
 [77] tibble_3.0.2                  stringi_1.4.6                 lattice_0.20-41               ProtGenerics_1.20.0          
 [81] Matrix_1.2-18                 permute_0.9-5                 vctrs_0.3.1                   pillar_1.4.6                 
 [85] lifecycle_0.2.0               BiocManager_1.30.10           GlobalOptions_0.1.2           bitops_1.0-6                 
 [89] httpuv_1.5.4                  R6_2.4.1                      latticeExtra_0.6-29           promises_1.1.1               
 [93] gridExtra_2.3                 dichromat_2.0-0               assertthat_0.2.1              rhdf5_2.32.2                 
 [97] openssl_1.4.2                 GenomicAlignments_1.24.0      Rsamtools_2.4.0               GenomeInfoDbData_1.2.3       
[101] hms_0.5.3                     rpart_4.1-15                  DelayedMatrixStats_1.10.1     biovizBase_1.36.0            
[105] shiny_1.5.0                   base64enc_0.1-3
TxDb GenomicFeatures • 2.5k views
ADD COMMENT
1
Entering edit mode

Can you try this from a clean session start? I cannot confirm. It seems your genes() method is being clobbered by some other package. Disambiguating with GenomicFeatures::genes might work better.

[Previously saved workspace restored]

> library(GenomicFeatures)
10/55 packages newly attached/loaded, see sessionInfo() for details.
> txdbmm <- makeTxDbFromGFF("Macaca_mulatta.Mmul_8.0.1.97.chr.gtf.gz")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.
> seqlevelsStyle(txdbmm) <- "UCSC"
> genes(txdbmm)
GRanges object with 30807 ranges and 1 metadata column:
                     seqnames              ranges strand |            gene_id
                        <Rle>           <IRanges>  <Rle> |        <character>
  ENSMMUG00000000001    chr20     2454421-2488376      + | ENSMMUG00000000001
  ENSMMUG00000000002    chr20     2488547-2493824      + | ENSMMUG00000000002
  ENSMMUG00000000005    chr10   84164852-84175055      - | ENSMMUG00000000005
  ENSMMUG00000000006    chr13   92226913-92233643      - | ENSMMUG00000000006
  ENSMMUG00000000007     chr3 119362812-119376267      - | ENSMMUG00000000007
                 ...      ...                 ...    ... .                ...
  ENSMMUG00000049259    chr13   90967186-90967812      - | ENSMMUG00000049259
  ENSMMUG00000049260     chr5 161054566-161057347      - | ENSMMUG00000049260
  ENSMMUG00000049261     chr3   43415682-43451824      + | ENSMMUG00000049261
  ENSMMUG00000049262    chr10   25768534-25771024      - | ENSMMUG00000049262
  ENSMMUG00000049263    chr16   47201015-47206607      - | ENSMMUG00000049263
  -------
  seqinfo: 23 sequences (1 circular) from an unspecified genome; no seqlengths
> sessionInfo()
R version 4.0.2 Patched (2020-07-06 r78792)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /udd/stvjc/VM/R-40-rex3-dist/lib64/R/lib/libRblas.so
LAPACK: /udd/stvjc/VM/R-40-rex3-dist/lib64/R/lib/libRlapack.so

locale:
[1] C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] GenomicFeatures_1.40.1 AnnotationDbi_1.50.1   Biobase_2.48.0        
[4] GenomicRanges_1.40.0   GenomeInfoDb_1.24.2    IRanges_2.22.2        
[7] S4Vectors_0.26.1       BiocGenerics_0.34.0   

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.18.2 progress_1.2.2             
 [3] tidyselect_1.1.0            purrr_0.3.4                
 [5] lattice_0.20-41             vctrs_0.3.1                
 [7] generics_0.0.2              BiocFileCache_1.12.0       
 [9] rtracklayer_1.48.0          blob_1.2.1                 
[11] XML_3.99-0.4                rlang_0.4.7                
[13] pillar_1.4.6                glue_1.4.1                 
[15] DBI_1.1.0                   BiocParallel_1.22.0        
[17] rappdirs_0.3.1              bit64_0.9-7                
[19] dbplyr_1.4.4                matrixStats_0.56.0         
[21] GenomeInfoDbData_1.2.3      lifecycle_0.2.0            
[23] stringr_1.4.0               zlibbioc_1.34.0            
[25] Biostrings_2.56.0           memoise_1.1.0              
[27] biomaRt_2.44.1              curl_4.3                   
[29] Rcpp_1.0.5                  openssl_1.4.2              
[31] DelayedArray_0.14.0         XVector_0.28.0             
[33] bit_1.1-15.2                Rsamtools_2.4.0            
[35] hms_0.5.3                   askpass_1.1                
[37] digest_0.6.25               stringi_1.4.6              
[39] dplyr_1.0.0                 grid_4.0.2                 
[41] tools_4.0.2                 bitops_1.0-6               
[43] magrittr_1.5                RCurl_1.98-1.2             
[45] RSQLite_2.2.0               tibble_3.0.3               
[47] crayon_1.3.4                pkgconfig_2.0.3            
[49] Matrix_1.2-18               ellipsis_0.3.1             
[51] prettyunits_1.1.1           assertthat_0.2.1           
[53] httr_1.4.1                  R6_2.4.1                   
[55] GenomicAlignments_1.24.0    compiler_4.0.2             
> 
ADD REPLY
0
Entering edit mode

Hi Vincent, Sorry!! I totally missed your reply! You are, of course, absolutely right about genes() getting clobbered. I did as you suggested and everything is working out! Thanks for your help!

ADD REPLY
2
Entering edit mode
@martin-morgan-1513
Last seen 5 months ago
United States

Good news! The heavy lifting has been done for you -- check out the EnsDb resources on AnnotationHub and the work done in the ensembldb package

library(AnnotationHub)
library(ensembldb)
hub = AnnotationHub()

and then

> query(hub, c("EnsDb", "Macaca", "97"))
AnnotationHub with 3 records
# snapshotDate(): 2020-06-18
# $dataprovider: Ensembl
# $species: Macaca nemestrina, Macaca mulatta, Macaca fascicularis
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH73898"]]'

            title
  AH73898 | Ensembl 97 EnsDb for Macaca fascicularis
  AH73903 | Ensembl 97 EnsDb for Macaca mulatta
  AH73906 | Ensembl 97 EnsDb for Macaca nemestrina
> edb = hub[["AH73903"]]
downloading 1 resources
  |======================================================================| 100%

loading from cache
>

and then

> genes(edb)
GRanges object with 32386 ranges and 8 metadata columns:
                     seqnames            ranges strand |            gene_id
                        <Rle>         <IRanges>  <Rle> |        <character>
  ENSMMUG00000005947        1       25432-42232      + | ENSMMUG00000005947
...
ADD COMMENT
0
Entering edit mode

Hi Martin,

Thanks for your reply! I did as you suggested and got the edb object:

> edb
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.4
|Creation time: Sat Jul  6 13:44:20 2019
|ensembl_version: 97
|ensembl_host: localhost
|Organism: Macaca mulatta
|taxonomy_id: 9544
|genome_build: Mmul_8.0.1
|DBSCHEMAVERSION: 2.1
| No. of genes: 32386.
| No. of transcripts: 56748.
|Protein data available.

However, some functions seem to be off (not sure if fault is on my end!):

> seqnames(edb)
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function 'seqnames' for signature '"EnsDb"'

> keepSeqlevels(edb,"2") 
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function 'seqinfo<-' for signature '"EnsDb"'

> cds(edb)
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function 'cds' for signature '"EnsDb"'

genes(edb) and transcripts(edb) work, albeit with a few warnings.

ADD REPLY
0
Entering edit mode

I think that these methods have not been implemented on the 'EnsDb' object itself, but they (other than cds()) are still available on the returned objects, e.g., seqnames(genes(edb)); perhaps other experts on the ensembldb package will respond.

Also I guess you want to do more than in your original question, so maybe a little more detail about overall objective would be helpful...

Returning to your original question and the error

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function 'genes' for signature '"TxDb"'

The most likely problem is that another package has defined a genes() generic, masking the generic used by GenomicFeatures (and other Bioconductor packages). So GenomicFeatures::genes(txdbm) might get you what you want.

The other possibility (less likely here) is that you have incompatible versions of Bioconductor packages; make sure BiocManager::valid() returns TRUE.

ADD REPLY
0
Entering edit mode

Hi Martin,

Thanks for your prompt reply! I have now updated all the packages.

I wanted to create some circos plots and needed the transcript and exon level data for any individual gene. For example, something like:

gene   start     end        transcript exon
TP73 3644042 3644334 ENST00000604566.1    1
TP73 3644693 3644781 ENST00000604566.1    2
TP73 3645891 3646012 ENST00000604566.1    3
TP73 3647491 3647629 ENST00000604566.1    4
TP73 3648027 3648120 ENST00000604566.1    5
TP73 3649311 3649931 ENST00000604566.1    6

I think I am almost there with:

mydata <- exonsBy(edb, by="gene",columns = c('gene_name','gene_id','entrezid','tx_id','exon_idx'))

Thanks for your help! I have it now!

ADD REPLY

Login before adding your answer.

Traffic: 851 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6