As I was using the TxDb.Mmusculus.UCSC.mm10.ensGene
package installed using biocLite()
, the ID appears to be neither Eensembl ID nor EntrezID. I believe those gene IDs are wrong. (I am using Bioconductor 3.5 on R/3.4, and all the packages are just updated. see sessionInfo() below.)
> library(TxDb.Mmusculus.UCSC.mm10.ensGene) > features <- exonsBy(TxDb.Mmusculus.UCSC.mm10.ensGene) > features <- keepStandardChromosomes(features, pruning.mode="fine") > TxDb.Mmusculus.UCSC.mm10.ensGene TxDb object: # Db type: TxDb # Supporting package: GenomicFeatures # Data source: UCSC # Genome: mm10 # Organism: Mus musculus # Taxonomy ID: 10090 # UCSC Table: ensGene # UCSC Track: Ensembl Genes # Resource URL: http://genome.ucsc.edu/ # Type of Gene ID: Ensembl gene ID # Full dataset: yes # miRBase build ID: NA # transcript_nrow: 94647 # exon_nrow: 348801 # cds_nrow: 226312 # Db created by: GenomicFeatures package from Bioconductor # Creation time: 2016-09-29 04:15:25 +0000 (Thu, 29 Sep 2016) # GenomicFeatures version at creation time: 1.25.17 # RSQLite version at creation time: 1.0.0 # DBSCHEMAVERSION: 1.1 > tail(names(features)) [1] "94642" "94643" "94644" "94645" "94646" "94647" > head(names(features)) [1] "1" "2" "3" "4" "5" "6"
Then I've tried to build the TxDb myself using makeTxDbFromUCSC()
, I got a warning message saying there're some error due to the process in .extractCdsLocsFromUCSCTxTable.
The gene IDs that that are supposed to be Ensembl ID are replaced by just a sequence of integer started from 1. The ID are still wrong
> tablename <- "ensGene" > txdb <- makeTxDbFromUCSC(genome="mm10", + tablename="ensGene") Download the ensGene table ... OK Extract the 'transcripts' data frame ... OK Extract the 'splicings' data frame ... OK Download and preprocess the 'chrominfo' data frame ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK Warning message: In .extractCdsLocsFromUCSCTxTable(ucsc_txtable, exon_locs) : UCSC data anomaly in 6805 transcript(s): the cds cumulative length is not a multiple of 3 for transcripts ‘ENSMUST00000155084’ ‘ENSMUST00000141339’ ‘ENSMUST00000138768’ ‘ENSMUST00000138182’ ‘ENSMUST00000134301’ ‘ENSMUST00000145077’ ‘ENSMUST00000147158’ ‘ENSMUST00000145280’ ‘ENSMUST00000151888’ ‘ENSMUST00000027294’ ‘ENSMUST00000142789’ ‘ENSMUST00000134238’ ‘ENSMUST00000160023’ ‘ENSMUST00000161066’ ‘ENSMUST00000156887’ ‘ENSMUST00000147758’ ‘ENSMUST00000135246’ ‘ENSMUST00000178156’ ‘ENSMUST00000178226’ ‘ENSMUST00000177679’ ‘ENSMUST00000114423’ ‘ENSMUST00000172068’ ‘ENSMUST00000167971’ ‘ENSMUST00000141545’ ‘ENSMUST00000142009’ ‘ENSMUST00000135075’ ‘ENSMUST00000156636’ ‘ENSMUST00000123647’ ‘ENSMUST00000134947’ ‘ENSMUST00000149732’ ‘ENSMUST00000152111’ ‘ENSMUST00000150059’ ‘ENSMUST00000173235’ ‘ENSMUST00000166055’
> features <- exonsBy(txdb) > head(names(features)) [1] "1" "2" "3" "4" "5" "6"
Thanks, Chao-Jen ------------------ > sessionInfo() R Under development (unstable) (2017-01-15 r71979) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.3 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] rtracklayer_1.35.6 [2] TxDb.Mmusculus.UCSC.mm10.ensGene_3.4.0 [3] GenomicFeatures_1.27.9 [4] AnnotationDbi_1.37.1 [5] Biobase_2.35.0 [6] GenomicRanges_1.27.23 [7] GenomeInfoDb_1.11.9 [8] IRanges_2.9.18 [9] S4Vectors_0.13.15 [10] BiocGenerics_0.21.3 [11] BiocInstaller_1.25.3 loaded via a namespace (and not attached): [1] Rcpp_0.12.9 XVector_0.15.2 [3] GenomicAlignments_1.11.9 zlibbioc_1.21.0 [5] BiocParallel_1.9.5 lattice_0.20-34 [7] tools_3.4.0 grid_3.4.0 [9] SummarizedExperiment_1.5.7 DBI_0.5-1 [11] matrixStats_0.51.0 digest_0.6.12 [13] Matrix_1.2-8 GenomeInfoDbData_0.99.0 [15] bitops_1.0-6 RCurl_1.95-4.8 [17] biomaRt_2.31.4 memoise_1.0.0 [19] RSQLite_1.1-2 DelayedArray_0.1.7 [21] compiler_3.4.0 Biostrings_2.43.4 [23] Rsamtools_1.27.12 XML_3.98-1.5 >
FYI, I use
makeTxDbFromBiomart()
instead and get the right gene IDs and transcripts. However, the metadata was wrong due to error detected by.Ensembl_getMySQLCoreDir(),
which is not a big deal. sessionInfo() is the same as above. Thanks.Wrong metadata
Isn't this `found 0 or more than 1 subdir for "mmusculus_gene_ensembl" dataset` error something to fix too? It seems due to the multiple strains of mouse in Ensembl. The shortnames code in
.Ensembl_getMySQLCoreDir
should be:shortnames <- sub("(\\w)\\w*?_(\\w+?)_core_\\S+", "\\1\\2", core_dirs, perl=TRUE)
Hi Matt,
Sorry I missed this bug report about
.Ensembl_getMySQLCoreDir()
. Should be fixed in GenomicFeatures 1.28.1 (release) and 1.29.2 (devel). Please allow between 24h and 48h for these new versions to become available viabiocLite()
.Cheers,
H.
Ensembl just released 89 and it broke this again:
contains some .gz file named the same as the core dirs. I'm not sure if those are temporary or not, but non-directories should be excluded from
Ensembl_listMySQLCoreDirs
. The FTP listing will start with 'd' if it's a directory.Hi Matt,
Thanks for keeping a diligent eye on this :-) Should be fixed in GenomicFeatures 1.28.2 (release) and 1.29.5 (devel). Please allow again between 24h and 48h for these new versions to become available via
biocLite()
.H.