Question

ensembldb from M.musculus NCBI37 v67 gtf

0

Entering edit mode

floriandeckert ▴ 30

@floriandeckert-13733

Last seen 11 weeks ago

Austria

Hello

I am trying to perform a integrated analysis (RNAseq, ChIP, WGBS). To do so, I want to construct a custom annotation base which consists of GRange Objects with extra meta data columns for filtering (number of overlapping genes, methylation data, promoter status etc.). I am quite successful but I still have not fully sorted out issues about the assembly version of the genome, related annotations, and packages available to proceed those.

I want to use the Ensembl Mus musculus NCBI37 build version 67. To my knowledge, this is the last annotation version (v67) of the NCBI37 genome assembly which corresponds to the mm9 UCSC assembly. I decided to use the NCBI37 build for two reasons. First, the annotation if more complete compared to GRCm38 builds. Second, I find more useful external data which was created with the NCBI37 build.

Said that, I became quite frustrated realizing, that many bioconductor packages for ensembl annotations are only suited for GRCm38 builds. Generally, I can not access any ensembl data via bioMart when trying to get if for NCBI37 v67. Alternatively I downloaded the gtf/gff files from the ensembl archive and tried ensDbFromGtf(). But the module fails to recognize exon ids (to my knowledge the problem with the entrenzids is irrelevant).

I read about several problems related to mine with different bioconductor packages (e.g. cummRbund). But it seems to me that this is a general problem that BiomRt can not access old assembly versions. Therefore all packages which are based on access to ensembl databases fail when trying to get NCBI37 v67. Is that correct?

I highly appreciate any help and hope my question is not to general but points to the right issue. If I am right and there is no suited workaround to get enseDB for NCBI37 v67, I would also like to ask if someone knows how to modify a gtf file so it will be recognized by ensDbFromGtf().

Thank you very much!

Importing GTF file ... OK
Processing metadata ... OK
Processing genes ...
Attribute availability:
o gene_id ... OK
o gene_name ... OK
o entrezid ... Nope
o gene_biotype ... OK
OK
Processing transcripts ...
Attribute availability:
o transcript_id ... OK
o gene_id ... OK
o source ... OK
OK
Processing exons ... Error: subscript contains invalid names
In addition: Warning message:
In ensDbFromGRanges(GTF, outfile = outfile, path = path, organism = organism, :

I'm missinoblem og column(s): 'entrezid'. The corresponding database column(s) will be empty!

ensembldb NCBI37 v67 ensembl • 3.7k views

ADD COMMENT • link 6.9 years ago floriandeckert ▴ 30

score 1 · Answer 1 · 2018-02-01

Instead of ensembldb you could use the GenomicFeatures package to create a TxDb object from the GFF (using makeTxDbFromGFF) - this should provide you the same information than an EnsDb.

To replicate (and fix) the problem you encounter in ensembldb it would be nice if you could provide the output of your sessionInfo() and the exact name of the GTF that you downloaded.

score 1 · Answer 2 · 2018-02-01

1

Entering edit mode

Johannes Rainer ★ 2.1k

@johannes-rainer-6987

Last seen 8 weeks ago

Italy

I could replicate your error. Apparently, the GTF file from Ensembl version 67 does not contain a field with the exon ID. Thus the function failed once it tried to extract these IDs. I fixed by generating artificial exon IDs in such cases (by concatenating the transcript ID and the index of the exon in this transcript). This fix will be in version 2.2.1 which I pushed to Bioconductor. It might however take some time until this version will be available (assuming you have the current Bioconductor version 3.6). I suggest you update ensembldb in the next days using BiocInstaller::biocLite("ensembldb").

cheers, jo

ADD COMMENT • link 6.9 years ago Johannes Rainer ★ 2.1k

0

Entering edit mode

Thank you very much!

ADD REPLY • link 6.9 years ago floriandeckert ▴ 30

score 1 · Answer 3 · 2018-02-01

1

Entering edit mode

floriandeckert ▴ 30

@floriandeckert-13733

Last seen 11 weeks ago

Austria

Hi Johannes,

Thank you very much for your quick response!

Right now I am working with makeTxDbFromGFF. The problem is, that makeTxDB drops many of the metadata columns e.g. biotype. As far as I am concerned, it is not possible to add columns to makeTxDB since GenomicFeatures functions like transcriptsby will not recognize/work with them.

*> input_gtf="Ensembl/Mus_musculus.NCBIM37.67.gtf"

> gtfTxDB=makeTxDbFromGFF(input_gtf, file="gtf")

Import genomic features from the file as a GRanges object ... OK

Prepare the 'metadata' data frame ... OK

Make the TxDb object ... OK

Warning message:

In .get_cds_IDX(type, phase) :

The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored.

> keytypes(gtfTxDB)

[1] "CDSID" "CDSNAME" "EXONID" "EXONNAME" "GENEID" "TXID" "TXNAME" *

To get more information for my TxDB derived GRanges I am using a GRange object from import.gff

*> input_gtf="Ensembl/Mus_musculus.NCBIM37.67.gtf"

> gffGR=import.gff(input_gtf)*

That way I get all the information needed but have to combine and filter them on my own. Possible, but of course I would like to take advantages of the cool ensembldb Filter and Functions.

The gtf File I am using is ftp://ftp.ensembl.org/pub/release-67/gtf/mus_musculus/Mus_musculus.NCBIM37.67.gtf.gz

ADD COMMENT • link 6.9 years ago floriandeckert ▴ 30

1

Entering edit mode

Thanks for the sessionInfo. In a couple of days you'll be able to install ensembldb version 2.2.1 which would allow you to import the GTF file. Let me know if you run into other troubles.

ADD REPLY • link 6.9 years ago Johannes Rainer ★ 2.1k

0

Entering edit mode

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] AnnotationHub_2.10.1 dplyr_0.7.4 rtracklayer_1.38.3 ensembldb_2.2.0 AnnotationFilter_1.2.0
[6] GenomicFeatures_1.30.2 AnnotationDbi_1.40.0 Biobase_2.38.0 GenomicRanges_1.30.1 GenomeInfoDb_1.14.0
[11] IRanges_2.12.0 S4Vectors_0.16.0 BiocGenerics_0.24.0

ADD REPLY • link 6.9 years ago floriandeckert ▴ 30

0

Entering edit mode

loaded via a namespace (and not attached):
[1] SummarizedExperiment_1.8.1 progress_1.1.2 lattice_0.20-35 htmltools_0.3.6
[5] yaml_2.1.16 interactiveDisplayBase_1.16.0 blob_1.1.0 XML_3.98-1.9
[9] rlang_0.1.6 pillar_1.1.0 glue_1.2.0 DBI_0.7
[13] BiocParallel_1.12.0 bit64_0.9-7 bindrcpp_0.2 bindr_0.1
[17] matrixStats_0.53.0 GenomeInfoDbData_1.0.0 ProtGenerics_1.10.0 stringr_1.2.0
[21] zlibbioc_1.24.0 Biostrings_2.46.0 memoise_1.1.0 biomaRt_2.34.2
[25] httpuv_1.3.5 BiocInstaller_1.28.0 curl_3.1 Rcpp_0.12.15
[29] xtable_1.8-2 DelayedArray_0.4.1 XVector_0.18.0 mime_0.5
[33] bit_1.1-12 Rsamtools_1.30.0 RMySQL_0.10.13 digest_0.6.15
[37] stringi_1.1.6 shiny_1.0.5 grid_3.4.3 tools_3.4.3
[41] bitops_1.0-6 magrittr_1.5 RCurl_1.95-4.10 lazyeval_0.2.1
[45] tibble_1.4.2 RSQLite_2.0 pkgconfig_2.0.1 Matrix_1.2-12
[49] prettyunits_1.0.2 assertthat_0.2.0 httr_1.3.1 R6_2.2.2
[53] GenomicAlignments_1.14.1 compiler_3.4.3

ADD REPLY • link 6.9 years ago floriandeckert ▴ 30

score 1 · Answer 4 · 2018-02-01

Thank you very much for your support Johannes! I followed your suggsetion and added exon_id to a GRange Object loaded from the gtf file. Then I used ensDBFromGRanges:

DB=ensDbFromGRanges(gtfGRmanip,
outfile="Ensembl/manipulated/Mus_musculus.NCBIM37.67.splite",
organism="Mus_musculus",
version="67",
genomeVersion="NCBIM37"
)
ensDB=EnsDb(DB)

Everything works smoothly except that ensDbFromGRanges can not access the seqlength from ensembl. So I added them to the GRange object manually but ensDbFromGRanges does not recognize them. Any suggestions?

Processing chromosomes ... Fetch seqlengths from ensembl ... FAIL

Warning messages:
1: In ensDbFromGRanges(gtfGRmanip, outfile = "Ensembl/manipulated/Mus_musculus.NCBIM37.67.splite", :
I'm missing column(s): 'entrezid'. The corresponding database column(s) will be empty!
2: In download.file(url = paste0(base_url, "/", file_name), destfile = tmp_file, :
InternetOpenUrl failed: 'The login request was denied
'
3: In tryGetSeqinfoFromEnsembl(organism, version, seqnames = chroms$seq_name) :
Unable to retrieve sequence lengths from Ensembl.
4: closing unused connection 3 (ftp://ftp.ensembl.org/pub/release-67/mysql/)

score 1 · Answer 5 · 2018-02-14

1

Entering edit mode

floriandeckert ▴ 30

@floriandeckert-13733

Last seen 11 weeks ago

Austria

Hi Johannes, thank you very much for taking the time helping me! Everything works super smoothly now. Awesome and thank you again!