Search
Question: Possible error in TxDb.Mmusculus.UCSC.mm10.ensGene_3.4.0?
1
gravatar for Chao-Jen Wong
8 months ago by
USA/Seattle/Fred Hutchinson Cancer Research Center
Chao-Jen Wong30 wrote:

As I was using the TxDb.Mmusculus.UCSC.mm10.ensGene package installed using biocLite(), the ID appears to be neither Eensembl ID nor EntrezID. I believe those gene IDs are wrong. (I am using Bioconductor 3.5 on R/3.4, and all the packages are just updated. see sessionInfo() below.)

> library(TxDb.Mmusculus.UCSC.mm10.ensGene)
> features <- exonsBy(TxDb.Mmusculus.UCSC.mm10.ensGene)

> features <- keepStandardChromosomes(features, pruning.mode="fine")

> TxDb.Mmusculus.UCSC.mm10.ensGene
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: UCSC
# Genome: mm10
# Organism: Mus musculus
# Taxonomy ID: 10090
# UCSC Table: ensGene
# UCSC Track: Ensembl Genes
# Resource URL: http://genome.ucsc.edu/
# Type of Gene ID: Ensembl gene ID
# Full dataset: yes
# miRBase build ID: NA
# transcript_nrow: 94647
# exon_nrow: 348801
# cds_nrow: 226312
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2016-09-29 04:15:25 +0000 (Thu, 29 Sep 2016)
# GenomicFeatures version at creation time: 1.25.17
# RSQLite version at creation time: 1.0.0
# DBSCHEMAVERSION: 1.1
> tail(names(features))
[1] "94642" "94643" "94644" "94645" "94646" "94647"
> head(names(features))
[1] "1" "2" "3" "4" "5" "6"

 

Then I've tried to build the TxDb myself using makeTxDbFromUCSC(), I got a warning message saying there're some error due to the process in .extractCdsLocsFromUCSCTxTable. The gene IDs that that are supposed to be Ensembl ID are replaced by just a sequence of integer started from 1.  The ID are still wrong

> tablename <- "ensGene"

> txdb <- makeTxDbFromUCSC(genome="mm10",

+                          tablename="ensGene")

Download the ensGene table ... OK

Extract the 'transcripts' data frame ... OK

Extract the 'splicings' data frame ... OK

Download and preprocess the 'chrominfo' data frame ... OK

Prepare the 'metadata' data frame ... OK

Make the TxDb object ... OK

Warning message:

In .extractCdsLocsFromUCSCTxTable(ucsc_txtable, exon_locs) :

  UCSC data anomaly in 6805 transcript(s): the cds cumulative length is

  not a multiple of 3 for transcripts ‘ENSMUST00000155084’

  ‘ENSMUST00000141339’ ‘ENSMUST00000138768’ ‘ENSMUST00000138182’

  ‘ENSMUST00000134301’ ‘ENSMUST00000145077’ ‘ENSMUST00000147158’

  ‘ENSMUST00000145280’ ‘ENSMUST00000151888’ ‘ENSMUST00000027294’

  ‘ENSMUST00000142789’ ‘ENSMUST00000134238’ ‘ENSMUST00000160023’

  ‘ENSMUST00000161066’ ‘ENSMUST00000156887’ ‘ENSMUST00000147758’

  ‘ENSMUST00000135246’ ‘ENSMUST00000178156’ ‘ENSMUST00000178226’

  ‘ENSMUST00000177679’ ‘ENSMUST00000114423’ ‘ENSMUST00000172068’

  ‘ENSMUST00000167971’ ‘ENSMUST00000141545’ ‘ENSMUST00000142009’

  ‘ENSMUST00000135075’ ‘ENSMUST00000156636’ ‘ENSMUST00000123647’

  ‘ENSMUST00000134947’ ‘ENSMUST00000149732’ ‘ENSMUST00000152111’

  ‘ENSMUST00000150059’ ‘ENSMUST00000173235’ ‘ENSMUST00000166055’

 

> features <- exonsBy(txdb)

> head(names(features))

[1] "1" "2" "3" "4" "5" "6"
Thanks,
Chao-Jen

------------------
> sessionInfo()

R Under development (unstable) (2017-01-15 r71979)

Platform: x86_64-pc-linux-gnu (64-bit)

Running under: Ubuntu 14.04.3 LTS


locale:

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              

 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    

 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   

 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 

 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       


attached base packages:

[1] stats4    parallel  stats     graphics  grDevices utils     datasets 

[8] methods   base     


other attached packages:

 [1] rtracklayer_1.35.6                    

 [2] TxDb.Mmusculus.UCSC.mm10.ensGene_3.4.0

 [3] GenomicFeatures_1.27.9                

 [4] AnnotationDbi_1.37.1                  

 [5] Biobase_2.35.0                        

 [6] GenomicRanges_1.27.23                 

 [7] GenomeInfoDb_1.11.9                   

 [8] IRanges_2.9.18                        

 [9] S4Vectors_0.13.15                     

[10] BiocGenerics_0.21.3                   

[11] BiocInstaller_1.25.3                  


loaded via a namespace (and not attached):

 [1] Rcpp_0.12.9                XVector_0.15.2            

 [3] GenomicAlignments_1.11.9   zlibbioc_1.21.0           

 [5] BiocParallel_1.9.5         lattice_0.20-34           

 [7] tools_3.4.0                grid_3.4.0                

 [9] SummarizedExperiment_1.5.7 DBI_0.5-1                 

[11] matrixStats_0.51.0         digest_0.6.12             

[13] Matrix_1.2-8               GenomeInfoDbData_0.99.0   

[15] bitops_1.0-6               RCurl_1.95-4.8            

[17] biomaRt_2.31.4             memoise_1.0.0             

[19] RSQLite_1.1-2              DelayedArray_0.1.7        

[21] compiler_3.4.0             Biostrings_2.43.4         

[23] Rsamtools_1.27.12          XML_3.98-1.5              

> 
ADD COMMENTlink modified 8 months ago by James W. MacDonald45k • written 8 months ago by Chao-Jen Wong30

FYI, I use makeTxDbFromBiomart() instead and get the right gene IDs and transcripts. However, the metadata was wrong due to error detected by .Ensembl_getMySQLCoreDir(), which is not a big deal. sessionInfo() is the same as above. Thanks.

 

> txdb <- makeTxDbFromBiomart(biomart="ENSEMBL_MART_ENSEMBL",

+                             dataset="mmusculus_gene_ensembl")

Download and preprocess the 'transcripts' data frame ... OK

Download and preprocess the 'chrominfo' data frame ... FAILED! (=> skipped)

Download and preprocess the 'splicings' data frame ... OK

Download and preprocess the 'genes' data frame ... OK

Prepare the 'metadata' data frame ... Error in .Ensembl_getMySQLCoreDir(dataset) : 

  found 0 or more than 1 subdir for "mmusculus_gene_ensembl" dataset at ftp://ftp.ensembl.org/pub/current_mysql/

> head(names(f))

[1] "ENSMUSG00000000001" "ENSMUSG00000000003" "ENSMUSG00000000028"

[4] "ENSMUSG00000000031" "ENSMUSG00000000037" "ENSMUSG00000000049"

 

Wrong metadata

> txdb

TxDb object:

# Db type: TxDb

# Supporting package: GenomicFeatures

# Data source: UCSC

# Genome: mm10

# Organism: Mus musculus

# Taxonomy ID: 10090

# UCSC Table: ensGene

# UCSC Track: Ensembl Genes

# Resource URL: http://genome.ucsc.edu/

# Type of Gene ID: Ensembl gene ID

# Full dataset: yes

# miRBase build ID: NA

# transcript_nrow: 94647

# exon_nrow: 348801

# cds_nrow: 226312

# Db created by: GenomicFeatures package from Bioconductor

# Creation time: 2017-03-06 11:07:38 -0800 (Mon, 06 Mar 2017)

# GenomicFeatures version at creation time: 1.27.9

# RSQLite version at creation time: 1.1-2

# DBSCHEMAVERSION: 1.1
ADD REPLYlink written 8 months ago by Chao-Jen Wong30

Isn't this `found 0 or more than 1 subdir for "mmusculus_gene_ensembl" dataset` error something to fix too? It seems due to the multiple strains of mouse in Ensembl. The shortnames code in .Ensembl_getMySQLCoreDir should be:

shortnames <- sub("(\\w)\\w*?_(\\w+?)_core_\\S+", "\\1\\2", core_dirs, perl=TRUE)

 

ADD REPLYlink written 5 months ago by matt.chambers4210
1

Hi Matt,

Sorry I missed this bug report about .Ensembl_getMySQLCoreDir(). Should be fixed in GenomicFeatures 1.28.1 (release) and 1.29.2 (devel). Please allow between 24h and 48h for these new versions to become available via biocLite().

Cheers,

H.

ADD REPLYlink written 5 months ago by Hervé Pagès ♦♦ 13k

Ensembl just released 89 and it broke this again:

ftp://ftp.ensembl.org/pub/current_mysql/

contains some .gz file named the same as the core dirs. I'm not sure if those are temporary or not, but non-directories should be excluded from Ensembl_listMySQLCoreDirs. The FTP listing will start with 'd' if it's a directory.

ADD REPLYlink written 5 months ago by matt.chambers4210
1

Hi Matt,

Thanks for keeping a diligent eye on this :-) Should be fixed in GenomicFeatures 1.28.2 (release) and 1.29.5 (devel). Please allow again between 24h and 48h for these new versions to become available via biocLite().

H.

ADD REPLYlink written 5 months ago by Hervé Pagès ♦♦ 13k
2
gravatar for James W. MacDonald
8 months ago by
United States
James W. MacDonald45k wrote:
You have to specify if you want the names or not.

> exonsBy(TxDb.Mmusculus.UCSC.mm10.ensGene, use.names = TRUE)
GRangesList object of length 94647:
$ENSMUST00000160944
GRanges object with 1 range and 3 metadata columns:
      seqnames             ranges strand |   exon_id   exon_name exon_rank
         <Rle>          <IRanges>  <Rle> | <integer> <character> <integer>
  [1]     chr1 [3054233, 3054733]      + |         1        <NA>         1

$ENSMUST00000082908
GRanges object with 1 range and 3 metadata columns:
      seqnames             ranges strand | exon_id exon_name exon_rank
  [1]     chr1 [3102016, 3102125]      + |       2      <NA>         1

$ENSMUST00000161581
GRanges object with 2 ranges and 3 metadata columns:
      seqnames             ranges strand | exon_id exon_name exon_rank
  [1]     chr1 [3466587, 3466687]      + |       3      <NA>         1
  [2]     chr1 [3513405, 3513553]      + |       4      <NA>         2

...

> exonsBy(TxDb.Mmusculus.UCSC.mm10.ensGene)
GRangesList object of length 94647:
$1
GRanges object with 1 range and 3 metadata columns:
      seqnames             ranges strand |   exon_id   exon_name exon_rank
         <Rle>          <IRanges>  <Rle> | <integer> <character> <integer>
  [1]     chr1 [3054233, 3054733]      + |         1        <NA>         1

$2
GRanges object with 1 range and 3 metadata columns:
      seqnames             ranges strand | exon_id exon_name exon_rank
  [1]     chr1 [3102016, 3102125]      + |       2      <NA>         1

$3
GRanges object with 2 ranges and 3 metadata columns:
      seqnames             ranges strand | exon_id exon_name exon_rank
  [1]     chr1 [3466587, 3466687]      + |       3      <NA>         1
  [2]     chr1 [3513405, 3513553]      + |       4      <NA>         2

...
<94644 more elements>
-------
seqinfo: 66 sequences (1 circular) from mm10 genome
ADD COMMENTlink written 8 months ago by James W. MacDonald45k

Thanks a lot. I did not know that. ha......

ADD REPLYlink written 8 months ago by Chao-Jen Wong30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 144 users visited in the last hour