Question: ensembldb from M.musculus NCBI37 v67 gtf
0
gravatar for floriandeckert
22 months ago by
floriandeckert30 wrote:

Hello 

I am trying to perform a integrated analysis (RNAseq, ChIP, WGBS). To do so, I want to construct a custom annotation base which consists of GRange Objects with extra meta data columns for filtering (number of overlapping genes, methylation data, promoter status etc.). I am quite successful but I still have not fully sorted out issues about the assembly version of the genome, related annotations, and packages available to proceed those. 

I want to use the Ensembl Mus musculus NCBI37 build version 67. To my knowledge, this is the last annotation version (v67) of the NCBI37 genome assembly which corresponds to the mm9 UCSC assembly. I decided to use the NCBI37 build for two reasons. First, the annotation if more complete compared to GRCm38 builds. Second, I find more useful external data which was created with the NCBI37 build. 

Said that, I became quite frustrated realizing, that many bioconductor packages for ensembl annotations are only suited for GRCm38 builds. Generally, I can not access any ensembl data via bioMart when trying to get if for NCBI37 v67. Alternatively I downloaded the gtf/gff files from the ensembl archive and tried ensDbFromGtf(). But the module fails to recognize exon ids (to my knowledge the problem with the entrenzids is irrelevant).

I read about several problems related to mine with different bioconductor packages (e.g. cummRbund). But it seems to me that this is a general problem that BiomRt can not access old assembly versions. Therefore all packages which are based on access to ensembl databases fail when trying to get NCBI37 v67. Is that correct? 

I highly appreciate any help and hope my question is not to general but points to the right issue. If I am right and there is no suited workaround to get enseDB for NCBI37 v67, I would also like to ask if someone knows how to modify a gtf file so it will be recognized by ensDbFromGtf().

Thank you very much! 

Importing GTF file ... OK
Processing metadata ... OK
Processing genes ... 
 Attribute availability:
  o gene_id ... OK
  o gene_name ... OK
  o entrezid ... Nope
  o gene_biotype ... OK
OK
Processing transcripts ... 
 Attribute availability:
  o transcript_id ... OK
  o gene_id ... OK
  o source ... OK
OK
Processing exons ... Error: subscript contains invalid names
In addition: Warning message:
In ensDbFromGRanges(GTF, outfile = outfile, path = path, organism = organism,  :

   I'm missinoblem og column(s): 'entrezid'. The corresponding database column(s) will be empty! 

ensembl ensembldb ncbi37 v67 • 939 views
ADD COMMENTlink modified 22 months ago • written 22 months ago by floriandeckert30
Answer: ensembldb from M.musculus NCBI37 v67 gtf
1
gravatar for Johannes Rainer
22 months ago by
Johannes Rainer1.5k
Italy
Johannes Rainer1.5k wrote:

Instead of ensembldb you could use the GenomicFeatures package to create a TxDb object from the GFF (using makeTxDbFromGFF) - this should provide you the same information than an EnsDb.

To replicate (and fix) the problem you encounter in ensembldb it would be nice if you could provide the output of your sessionInfo() and the exact name of the GTF that you downloaded.

ADD COMMENTlink written 22 months ago by Johannes Rainer1.5k
Answer: ensembldb from M.musculus NCBI37 v67 gtf
1
gravatar for Johannes Rainer
22 months ago by
Johannes Rainer1.5k
Italy
Johannes Rainer1.5k wrote:

I could replicate your error. Apparently, the GTF file from Ensembl version 67 does not contain a field with the exon ID. Thus the function failed once it tried to extract these IDs. I fixed by generating artificial exon IDs in such cases (by concatenating the transcript ID and the index of the exon in this transcript). This fix will be in version 2.2.1 which I pushed to Bioconductor. It might however take some time until this version will be available (assuming you have the current Bioconductor version 3.6). I suggest you update ensembldb in the next days using BiocInstaller::biocLite("ensembldb").

cheers, jo

ADD COMMENTlink written 22 months ago by Johannes Rainer1.5k

Thank you very much! 

ADD REPLYlink written 22 months ago by floriandeckert30
Answer: ensembldb from M.musculus NCBI37 v67 gtf
1
gravatar for floriandeckert
22 months ago by
floriandeckert30 wrote:

Hi Johannes, 

Thank you very much for your quick response! 

Right now I am working with makeTxDbFromGFF. The problem is, that makeTxDB drops many of the metadata columns e.g. biotype. As far as I am concerned, it is not possible to add columns to makeTxDB since GenomicFeatures functions like transcriptsby will not recognize/work with them. 

*> input_gtf="Ensembl/Mus_musculus.NCBIM37.67.gtf"

> gtfTxDB=makeTxDbFromGFF(input_gtf, file="gtf")

Import genomic features from the file as a GRanges object ... OK

Prepare the 'metadata' data frame ... OK

Make the TxDb object ... OK

Warning message:

In .get_cds_IDX(type, phase) :

  The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored.

> keytypes(gtfTxDB)

[1] "CDSID"    "CDSNAME"  "EXONID"   "EXONNAME" "GENEID"   "TXID"     "TXNAME" *

To get more information for my TxDB derived GRanges I am using a GRange object from import.gff

*> input_gtf="Ensembl/Mus_musculus.NCBIM37.67.gtf"

> gffGR=import.gff(input_gtf)*

That way I get all the information needed but have to combine and filter them on my own. Possible, but of course I would like to take advantages of the cool ensembldb Filter and Functions. 

The gtf File I am using is ftp://ftp.ensembl.org/pub/release-67/gtf/mus_musculus/Mus_musculus.NCBIM37.67.gtf.gz

ADD COMMENTlink written 22 months ago by floriandeckert30
1

Thanks for the sessionInfo. In a couple of days you'll be able to install ensembldb version 2.2.1 which would allow you to import the GTF file. Let me know if you run into other troubles.

ADD REPLYlink written 22 months ago by Johannes Rainer1.5k

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] AnnotationHub_2.10.1   dplyr_0.7.4            rtracklayer_1.38.3     ensembldb_2.2.0        AnnotationFilter_1.2.0
 [6] GenomicFeatures_1.30.2 AnnotationDbi_1.40.0   Biobase_2.38.0         GenomicRanges_1.30.1   GenomeInfoDb_1.14.0   
[11] IRanges_2.12.0         S4Vectors_0.16.0       BiocGenerics_0.24.0  

ADD REPLYlink written 22 months ago by floriandeckert30

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.8.1    progress_1.1.2                lattice_0.20-35               htmltools_0.3.6              
 [5] yaml_2.1.16                   interactiveDisplayBase_1.16.0 blob_1.1.0                    XML_3.98-1.9                 
 [9] rlang_0.1.6                   pillar_1.1.0                  glue_1.2.0                    DBI_0.7                      
[13] BiocParallel_1.12.0           bit64_0.9-7                   bindrcpp_0.2                  bindr_0.1                    
[17] matrixStats_0.53.0            GenomeInfoDbData_1.0.0        ProtGenerics_1.10.0           stringr_1.2.0                
[21] zlibbioc_1.24.0               Biostrings_2.46.0             memoise_1.1.0                 biomaRt_2.34.2               
[25] httpuv_1.3.5                  BiocInstaller_1.28.0          curl_3.1                      Rcpp_0.12.15                 
[29] xtable_1.8-2                  DelayedArray_0.4.1            XVector_0.18.0                mime_0.5                     
[33] bit_1.1-12                    Rsamtools_1.30.0              RMySQL_0.10.13                digest_0.6.15                
[37] stringi_1.1.6                 shiny_1.0.5                   grid_3.4.3                    tools_3.4.3                  
[41] bitops_1.0-6                  magrittr_1.5                  RCurl_1.95-4.10               lazyeval_0.2.1               
[45] tibble_1.4.2                  RSQLite_2.0                   pkgconfig_2.0.1               Matrix_1.2-12                
[49] prettyunits_1.0.2             assertthat_0.2.0              httr_1.3.1                    R6_2.2.2                     
[53] GenomicAlignments_1.14.1      compiler_3.4.3    

ADD REPLYlink written 22 months ago by floriandeckert30
Answer: ensembldb from M.musculus NCBI37 v67 gtf
1
gravatar for floriandeckert
22 months ago by
floriandeckert30 wrote:

Thank you very much for your support Johannes! I followed your suggsetion and added exon_id to a GRange Object loaded from the gtf file. Then I used ensDBFromGRanges: 

DB=ensDbFromGRanges(gtfGRmanip, 
                    outfile="Ensembl/manipulated/Mus_musculus.NCBIM37.67.splite", 
                    organism="Mus_musculus",
                    version="67", 
                    genomeVersion="NCBIM37"
                    )
ensDB=EnsDb(DB)

Everything works smoothly except that ensDbFromGRanges can not access the seqlength from ensembl. So I added them to the GRange object manually but ensDbFromGRanges does not recognize them. Any suggestions? 

Processing chromosomes ... Fetch seqlengths from ensembl ... FAIL

Warning messages:
1: In ensDbFromGRanges(gtfGRmanip, outfile = "Ensembl/manipulated/Mus_musculus.NCBIM37.67.splite",  :
   I'm missing column(s): 'entrezid'. The corresponding database column(s) will be empty!
2: In download.file(url = paste0(base_url, "/", file_name), destfile = tmp_file,  :
  InternetOpenUrl failed: 'The login request was denied
'
3: In tryGetSeqinfoFromEnsembl(organism, version, seqnames = chroms$seq_name) :
  Unable to retrieve sequence lengths from Ensembl.
4: closing unused connection 3 (ftp://ftp.ensembl.org/pub/release-67/mysql/

ADD COMMENTlink written 22 months ago by floriandeckert30

Very nice. Regarding the problem to fetch the seqlengths - it worked for me - I guess your problem might be related to some network problems?
 

The warning also suggests this:

2: In download.file(url = paste0(base_url, "/", file_name), destfile = tmp_file,  :
  InternetOpenUrl failed: 'The login request was denied

 

ADD REPLYlink modified 22 months ago • written 22 months ago by Johannes Rainer1.5k
Answer: ensembldb from M.musculus NCBI37 v67 gtf
1
gravatar for floriandeckert
22 months ago by
floriandeckert30 wrote:

Hi Johannes, thank you very much for taking the time helping me! Everything works super smoothly now. Awesome and thank you again! 

ADD COMMENTlink written 22 months ago by floriandeckert30

Thanks for reporting!

ADD REPLYlink written 22 months ago by Johannes Rainer1.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 147 users visited in the last hour