I am trying to perform a integrated analysis (RNAseq, ChIP, WGBS). To do so, I want to construct a custom annotation base which consists of GRange Objects with extra meta data columns for filtering (number of overlapping genes, methylation data, promoter status etc.). I am quite successful but I still have not fully sorted out issues about the assembly version of the genome, related annotations, and packages available to proceed those.
I want to use the Ensembl Mus musculus NCBI37 build version 67. To my knowledge, this is the last annotation version (v67) of the NCBI37 genome assembly which corresponds to the mm9 UCSC assembly. I decided to use the NCBI37 build for two reasons. First, the annotation if more complete compared to GRCm38 builds. Second, I find more useful external data which was created with the NCBI37 build.
Said that, I became quite frustrated realizing, that many bioconductor packages for ensembl annotations are only suited for GRCm38 builds. Generally, I can not access any ensembl data via bioMart when trying to get if for NCBI37 v67. Alternatively I downloaded the gtf/gff files from the ensembl archive and tried ensDbFromGtf(). But the module fails to recognize exon ids (to my knowledge the problem with the entrenzids is irrelevant).
I read about several problems related to mine with different bioconductor packages (e.g. cummRbund). But it seems to me that this is a general problem that BiomRt can not access old assembly versions. Therefore all packages which are based on access to ensembl databases fail when trying to get NCBI37 v67. Is that correct?
I highly appreciate any help and hope my question is not to general but points to the right issue. If I am right and there is no suited workaround to get enseDB for NCBI37 v67, I would also like to ask if someone knows how to modify a gtf file so it will be recognized by ensDbFromGtf().
Thank you very much!
Importing GTF file ... OK
Processing metadata ... OK
Processing genes ...
o gene_id ... OK
o gene_name ... OK
o entrezid ... Nope
o gene_biotype ... OK
Processing transcripts ...
o transcript_id ... OK
o gene_id ... OK
o source ... OK
Processing exons ... Error: subscript contains invalid names
In addition: Warning message:
In ensDbFromGRanges(GTF, outfile = outfile, path = path, organism = organism, :
I'm missinoblem og column(s): 'entrezid'. The corresponding database column(s) will be empty!