Question

Possible error in TxDb.Mmusculus.UCSC.mm10.ensGene_3.4.0?

1

Entering edit mode

Chao-Jen Wong ▴ 40

@chao-jen-wong-7035

Last seen 2.4 years ago

USA/Seattle/Fred Hutchinson Cancer Rese…

As I was using the TxDb.Mmusculus.UCSC.mm10.ensGene package installed using biocLite(), the ID appears to be neither Eensembl ID nor EntrezID. I believe those gene IDs are wrong. (I am using Bioconductor 3.5 on R/3.4, and all the packages are just updated. see sessionInfo() below.)

> library(TxDb.Mmusculus.UCSC.mm10.ensGene)
> features <- exonsBy(TxDb.Mmusculus.UCSC.mm10.ensGene)

> features <- keepStandardChromosomes(features, pruning.mode="fine")

> TxDb.Mmusculus.UCSC.mm10.ensGene
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: UCSC
# Genome: mm10
# Organism: Mus musculus
# Taxonomy ID: 10090
# UCSC Table: ensGene
# UCSC Track: Ensembl Genes
# Resource URL: http://genome.ucsc.edu/
# Type of Gene ID: Ensembl gene ID
# Full dataset: yes
# miRBase build ID: NA
# transcript_nrow: 94647
# exon_nrow: 348801
# cds_nrow: 226312
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2016-09-29 04:15:25 +0000 (Thu, 29 Sep 2016)
# GenomicFeatures version at creation time: 1.25.17
# RSQLite version at creation time: 1.0.0
# DBSCHEMAVERSION: 1.1
> tail(names(features))
[1] "94642" "94643" "94644" "94645" "94646" "94647"
> head(names(features))
[1] "1" "2" "3" "4" "5" "6"

Then I've tried to build the TxDb myself using makeTxDbFromUCSC(), I got a warning message saying there're some error due to the process in .extractCdsLocsFromUCSCTxTable. The gene IDs that that are supposed to be Ensembl ID are replaced by just a sequence of integer started from 1. The ID are still wrong

> tablename <- "ensGene"

> txdb <- makeTxDbFromUCSC(genome="mm10",

+                          tablename="ensGene")

Download the ensGene table ... OK

Extract the 'transcripts' data frame ... OK

Extract the 'splicings' data frame ... OK

Download and preprocess the 'chrominfo' data frame ... OK

Prepare the 'metadata' data frame ... OK

Make the TxDb object ... OK

Warning message:

In .extractCdsLocsFromUCSCTxTable(ucsc_txtable, exon_locs) :

  UCSC data anomaly in 6805 transcript(s): the cds cumulative length is

  not a multiple of 3 for transcripts ‘ENSMUST00000155084’

  ‘ENSMUST00000141339’ ‘ENSMUST00000138768’ ‘ENSMUST00000138182’

  ‘ENSMUST00000134301’ ‘ENSMUST00000145077’ ‘ENSMUST00000147158’

  ‘ENSMUST00000145280’ ‘ENSMUST00000151888’ ‘ENSMUST00000027294’

  ‘ENSMUST00000142789’ ‘ENSMUST00000134238’ ‘ENSMUST00000160023’

  ‘ENSMUST00000161066’ ‘ENSMUST00000156887’ ‘ENSMUST00000147758’

  ‘ENSMUST00000135246’ ‘ENSMUST00000178156’ ‘ENSMUST00000178226’

  ‘ENSMUST00000177679’ ‘ENSMUST00000114423’ ‘ENSMUST00000172068’

  ‘ENSMUST00000167971’ ‘ENSMUST00000141545’ ‘ENSMUST00000142009’

  ‘ENSMUST00000135075’ ‘ENSMUST00000156636’ ‘ENSMUST00000123647’

  ‘ENSMUST00000134947’ ‘ENSMUST00000149732’ ‘ENSMUST00000152111’

  ‘ENSMUST00000150059’ ‘ENSMUST00000173235’ ‘ENSMUST00000166055’

> features <- exonsBy(txdb)

> head(names(features))

[1] "1" "2" "3" "4" "5" "6"

Thanks,
Chao-Jen

------------------
> sessionInfo()

R Under development (unstable) (2017-01-15 r71979)

Platform: x86_64-pc-linux-gnu (64-bit)

Running under: Ubuntu 14.04.3 LTS


locale:

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              

 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    

 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   

 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 

 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       


attached base packages:

[1] stats4    parallel  stats     graphics  grDevices utils     datasets 

[8] methods   base     


other attached packages:

 [1] rtracklayer_1.35.6                    

 [2] TxDb.Mmusculus.UCSC.mm10.ensGene_3.4.0

 [3] GenomicFeatures_1.27.9                

 [4] AnnotationDbi_1.37.1                  

 [5] Biobase_2.35.0                        

 [6] GenomicRanges_1.27.23                 

 [7] GenomeInfoDb_1.11.9                   

 [8] IRanges_2.9.18                        

 [9] S4Vectors_0.13.15                     

[10] BiocGenerics_0.21.3                   

[11] BiocInstaller_1.25.3                  


loaded via a namespace (and not attached):

 [1] Rcpp_0.12.9                XVector_0.15.2            

 [3] GenomicAlignments_1.11.9   zlibbioc_1.21.0           

 [5] BiocParallel_1.9.5         lattice_0.20-34           

 [7] tools_3.4.0                grid_3.4.0                

 [9] SummarizedExperiment_1.5.7 DBI_0.5-1                 

[11] matrixStats_0.51.0         digest_0.6.12             

[13] Matrix_1.2-8               GenomeInfoDbData_0.99.0   

[15] bitops_1.0-6               RCurl_1.95-4.8            

[17] biomaRt_2.31.4             memoise_1.0.0             

[19] RSQLite_1.1-2              DelayedArray_0.1.7        

[21] compiler_3.4.0             Biostrings_2.43.4         

[23] Rsamtools_1.27.12          XML_3.98-1.5              

>

txdb genomicfeatures • 2.6k views

ADD COMMENT • link updated 8.8 years ago by James W. MacDonald 68k • written 8.8 years ago by Chao-Jen Wong ▴ 40

0

Entering edit mode

FYI, I use makeTxDbFromBiomart() instead and get the right gene IDs and transcripts. However, the metadata was wrong due to error detected by .Ensembl_getMySQLCoreDir(), which is not a big deal. sessionInfo() is the same as above. Thanks.

> txdb <- makeTxDbFromBiomart(biomart="ENSEMBL_MART_ENSEMBL",

+                             dataset="mmusculus_gene_ensembl")

Download and preprocess the 'transcripts' data frame ... OK

Download and preprocess the 'chrominfo' data frame ... FAILED! (=> skipped)

Download and preprocess the 'splicings' data frame ... OK

Download and preprocess the 'genes' data frame ... OK

Prepare the 'metadata' data frame ... Error in .Ensembl_getMySQLCoreDir(dataset) : 

  found 0 or more than 1 subdir for "mmusculus_gene_ensembl" dataset at ftp://ftp.ensembl.org/pub/current_mysql/

> head(names(f))

[1] "ENSMUSG00000000001" "ENSMUSG00000000003" "ENSMUSG00000000028"

[4] "ENSMUSG00000000031" "ENSMUSG00000000037" "ENSMUSG00000000049"

Wrong metadata

> txdb

TxDb object:

# Db type: TxDb

# Supporting package: GenomicFeatures

# Data source: UCSC

# Genome: mm10

# Organism: Mus musculus

# Taxonomy ID: 10090

# UCSC Table: ensGene

# UCSC Track: Ensembl Genes

# Resource URL: http://genome.ucsc.edu/

# Type of Gene ID: Ensembl gene ID

# Full dataset: yes

# miRBase build ID: NA

# transcript_nrow: 94647

# exon_nrow: 348801

# cds_nrow: 226312

# Db created by: GenomicFeatures package from Bioconductor

# Creation time: 2017-03-06 11:07:38 -0800 (Mon, 06 Mar 2017)

# GenomicFeatures version at creation time: 1.27.9

# RSQLite version at creation time: 1.1-2

# DBSCHEMAVERSION: 1.1

ADD REPLY • link 8.8 years ago Chao-Jen Wong ▴ 40

0

Entering edit mode

Isn't this `found 0 or more than 1 subdir for "mmusculus_gene_ensembl" dataset` error something to fix too? It seems due to the multiple strains of mouse in Ensembl. The shortnames code in .Ensembl_getMySQLCoreDir should be:

shortnames <- sub("(\\w)\\w*?_(\\w+?)_core_\\S+", "\\1\\2", core_dirs, perl=TRUE)

ADD REPLY • link 8.6 years ago matt.chambers42 ▴ 10

1

Entering edit mode

Hi Matt,

Sorry I missed this bug report about .Ensembl_getMySQLCoreDir(). Should be fixed in GenomicFeatures 1.28.1 (release) and 1.29.2 (devel). Please allow between 24h and 48h for these new versions to become available via biocLite().

Cheers,

H.

ADD REPLY • link 8.6 years ago Hervé Pagès 16k

0

Entering edit mode

Ensembl just released 89 and it broke this again:

ftp://ftp.ensembl.org/pub/current_mysql/

contains some .gz file named the same as the core dirs. I'm not sure if those are temporary or not, but non-directories should be excluded from Ensembl_listMySQLCoreDirs. The FTP listing will start with 'd' if it's a directory.

ADD REPLY • link 8.6 years ago matt.chambers42 ▴ 10

1

Entering edit mode

Hi Matt,

Thanks for keeping a diligent eye on this :-) Should be fixed in GenomicFeatures 1.28.2 (release) and 1.29.5 (devel). Please allow again between 24h and 48h for these new versions to become available via biocLite().

H.

ADD REPLY • link 8.6 years ago Hervé Pagès 16k

score 2 · Answer 1 · 2017-03-06

You have to specify if you want the names or not.

> exonsBy(TxDb.Mmusculus.UCSC.mm10.ensGene, use.names = TRUE)
GRangesList object of length 94647:
$ENSMUST00000160944
GRanges object with 1 range and 3 metadata columns:
      seqnames             ranges strand |   exon_id   exon_name exon_rank
         <Rle>          <IRanges>  <Rle> | <integer> <character> <integer>
  [1]     chr1 [3054233, 3054733]      + |         1        <NA>         1

$ENSMUST00000082908
GRanges object with 1 range and 3 metadata columns:
      seqnames             ranges strand | exon_id exon_name exon_rank
  [1]     chr1 [3102016, 3102125]      + |       2      <NA>         1

$ENSMUST00000161581
GRanges object with 2 ranges and 3 metadata columns:
      seqnames             ranges strand | exon_id exon_name exon_rank
  [1]     chr1 [3466587, 3466687]      + |       3      <NA>         1
  [2]     chr1 [3513405, 3513553]      + |       4      <NA>         2

...

> exonsBy(TxDb.Mmusculus.UCSC.mm10.ensGene)
GRangesList object of length 94647:
$1
GRanges object with 1 range and 3 metadata columns:
      seqnames             ranges strand |   exon_id   exon_name exon_rank
         <Rle>          <IRanges>  <Rle> | <integer> <character> <integer>
  [1]     chr1 [3054233, 3054733]      + |         1        <NA>         1

$2
GRanges object with 1 range and 3 metadata columns:
      seqnames             ranges strand | exon_id exon_name exon_rank
  [1]     chr1 [3102016, 3102125]      + |       2      <NA>         1

$3
GRanges object with 2 ranges and 3 metadata columns:
      seqnames             ranges strand | exon_id exon_name exon_rank
  [1]     chr1 [3466587, 3466687]      + |       3      <NA>         1
  [2]     chr1 [3513405, 3513553]      + |       4      <NA>         2

...
<94644 more elements>
-------
seqinfo: 66 sequences (1 circular) from mm10 genome