I am getting an error when I try and create an ensDb object using a GENCODE .gtf annotation file for GRCh38 downloaded directly from the GENCODE website.
> gtffile <- "/user/Downloads/gencode.v26.annotation.gtf" > DB <- ensDbFromGtf(gtffile) Error in `colnames<-`(`*tmp*`, value = c("name", "value")) : length of 'dimnames' [2] not equal to array extent
I tried downloading the .gtf from Ensembl in case the issue was the file but I get a different set of errors:
> gtffile <- "/Users/Bongani/Downloads/Homo_sapiens.GRCh38.88.gtf.gz" > DB <- ensDbFromGtf(gtffile) Error in checkValidEnsDb(EnsDb(dbname), verbose = verbose) : Provided exon index in transcript does not match with ordering of the exons by chromosomal coordinates for1of the101167transcripts encoded on the + strand! In addition: Warning messages: 1: In rsqlite_fetch(res@ptr, n = n) : Don't need to call dbFetch() for statements, only for queries 2: In rsqlite_fetch(res@ptr, n = n) : Don't need to call dbFetch() for statements, only for queries 3: In rsqlite_fetch(res@ptr, n = n) : Don't need to call dbFetch() for statements, only for queries 4: In rsqlite_fetch(res@ptr, n = n) : Don't need to call dbFetch() for statements, only for queries 5: In rsqlite_fetch(res@ptr, n = n) : Don't need to call dbFetch() for statements, only for queries 6: In rsqlite_fetch(res@ptr, n = n) : Don't need to call dbFetch() for statements, only for queries
Is there a package I am missing? The errors seem to suggest an error in formatting but I have not modified the .gtf files in any way before passing them to ensDb.
> sessionInfo() R version 3.2.3 (2015-12-10) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.11.6 (El Capitan) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] ensembldb_1.2.2 GenomicFeatures_1.22.13 AnnotationDbi_1.32.3 [4] Biobase_2.30.0 GenomicRanges_1.22.4 GenomeInfoDb_1.6.3 [7] IRanges_2.4.8 S4Vectors_0.8.11 AnnotationHub_2.2.5 [10] BiocGenerics_0.16.1
Checked into the problematic transcript: transcript ENST00000639671 was assigned to gene ENSG00000141198 in Ensembl release 88, but got re-assigned in Ensembl 90 to ENSG00000166260. Both genes are encoded at around the same region on chromosome 17, but on opposite strands. The order of the exons of ENST00000639671 was (and is) that of a transcript encoded on the reverse strand. With the transcript assigned to the + encoded ENST00000141198 this order did no longer match with the expected order of the chromosomal start coords. Seems to me that this might have been a bug in the Ensembl annotation pipeline. It is fixed since Ensembl 90.
Thanks for the feedback!
1) I was able to construct the package using Ensembl 90 without a problem:
This created a folder (containing "inst", "man", "R" subfolders as well as the Namespace and description) which I installed in R using
devtools::install
. I haven't tested it out on a dataset yet but it looks good:2) With regard to fetching
EnsDb
databases fromAnnotationHub,
this was the first thing I tried but I kept getting back an empty query:I just updated to R version 3.4.1 and I am using Bioconductor version 3.5 (BiocInstaller 1.26.1) but the result is still the same.
That you don't get any results for the
query
inAnnotationHub
is strange. Could you try to delete the .AnnotationHub
folder in your home directory and re-run the commands above? Eventually theAnnotationHub
database is still the one from your previous installation.I deleted the .
AnnotationHub folder
in/Library/Frameworks/R.framework/Versions/3.4/Resources/library/AnnotationHub
and the result is unchanged. I still get 0 queries after reinstallingAnnotationHub
database usingBiocinstaller
.Mthabisi
Sorry, I was unclear. I didn't mean to remove the
AnnotatioHub
folder from the R library. AnnotationHub creates a hidden folder (called".AnnotationHub"
(note the dot before AnnotationHub), usually in the user's home directory. The cache as well as the database is stored there.You should re-install AnnotationHub with
BiocInstaller::biocLite("AnnotationHub")
, remove the hidden ".AnnotationHub" folder and try again.I missed that, thanks for the catch! Deleting the
.AnnotationHub
folder cleared the cache and did the trick.I ran
ensembldb
using both the EnsDb I made from the Ensembl GRCh38 .gtf annotation file (v90) and the one I downloaded through AnnotationHub and both worked well. The only point I would make is with regard to the annotation files from GENCODE. They append the gene/transcript version number to ensembl gene ids. This results in 0 matches when usingmapIds()
. I fixed this by removing the version numbers from the gene ids in command line after doing the counts. I will probably just download all the fasta and annotation files from ensembl in future.Mthabisi