Last seen 2 days ago
Oh. My. God. UCSC is off their rocker these days. They completely switched up the knownGene table for GRCh38 to use Ensembl IDs and now they are mixing and matching things for GRCm38? When I built these I didn't think to look at the transcript IDs, given that they are using NCBI Gene IDs for the genes.
So no, you didn't miss anything and thanks for pointing that out. In the past UCSC did this weird thing (that you note) where they made up their own IDs for transcripts because evidently they weren't familiar with RefSeq? Maybe there was a reasonable rationale for that, but mixing and matching IDs from two annotation services that haven't even been able to do so for human just boggles the mind.
Anyway, that's enough ranting. For <del>GRCh39</del> GRCm39 I used the refGene table, which so far seems to be what you might expect it to be, and you can easily do the same for yourself by using
makeTxDbPackageFromUCSC, which is in the
> makeTxDbPackageFromUCSC("0.01", "me <email@example.com>", "me", ".", "Artistic","mm10","refGene", circ_seqs="chrM")
Download the refGene table ... OK
Download the hgFixed.refLink table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Creating package in ./TxDb.Mmusculus.UCSC.mm10.refGene
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: UCSC
# Genome: mm10
# Organism: Mus musculus
# Taxonomy ID: 10090
# UCSC Table: refGene
# UCSC Track: NCBI RefSeq
# Resource URL: http://genome.ucsc.edu/
# Type of Gene ID: Entrez Gene ID
# Full dataset: yes
# miRBase build ID: NA
# Nb of transcripts: 47382
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2021-01-13 10:52:30 -0500 (Wed, 13 Jan 2021)
# GenomicFeatures version at creation time: 1.42.1
# RSQLite version at creation time: 2.2.1
# DBSCHEMAVERSION: 1.2
In .extract_cds_locs_from_UCSC_txtable(ucsc_txtable) :
UCSC data anomaly in 119 transcript(s): the cds cumulative length is
not a multiple of 3 for transcripts 'NM_011633' 'NM_198024'
'NM_001160424' 'NM_009268' 'NM_001190454' 'NM_001290729' 'NM_025576'
'NM_001177397' 'NM_001081960' 'NM_010974' 'NM_001128086'
'NM_001142737' 'NM_001289428' 'NM_001267808' 'NM_001301307'
'NM_001109684' 'NM_021466' 'NM_025988' 'NM_016901' 'NM_001347054'
'NM_011261' 'NM_001142760' 'NM_011022' 'NM_008848' 'NM_024470'
'NM_010707' 'NM_001346422' 'NM_001301034' 'NM_001301737' 'NM_010039'
'NM_008264' 'NM_010646' 'NM_001347053' 'NM_001206926' 'NM_001177396'
'NM_009046' 'NM_207683' 'NM_146484' 'NM_001277980' 'NM_001114347'
'NM_001277958' 'NM_001130175' 'NM_001277959' 'NM_144531' 'NM_181398'
'NM_001177416' 'NM_001033980' 'NM_001358490' 'NM_008653' 'NM_009485'
'NM_011154' 'NM_010115' 'NM_001142742' 'NM_008710' 'NM_001159419'
'NM_001286602' 'NM_001177398' 'NM_148413' 'NM_010846' 'NM_011413'
'NM_001163415' 'NM_001142739' 'NM_001271586' 'NM_00127 [... truncated]
## I am on Windows so have to specify the type
> install.packages("TxDb.Mmusculus.UCSC.mm10.refGene/", repos = NULL, type = "source")
Installing package into 'C:/Users/jmacdon/AppData/Roaming/R/win-library/4.0'
(as 'lib' is unspecified)
* installing *source* package 'TxDb.Mmusculus.UCSC.mm10.refGene' ...
** using staged installation
** byte-compile and prepare package for lazy loading
*** installing help indices
converting help for package 'TxDb.Mmusculus.UCSC.mm10.refGene'
finding HTML links ... done
** building package indices
** testing if installed package can be loaded from temporary location
*** arch - i386
*** arch - x64
** testing if installed package can be loaded from final location
*** arch - i386
*** arch - x64
** testing if installed package keeps a record of temporary installation path
* DONE (TxDb.Mmusculus.UCSC.mm10.refGene)
> head(keys(TxDb.Mmusculus.UCSC.mm10.refGene,"TXNAME" ))
 "NM_001355712" "NM_008866" "NM_001159750" "NM_011541" "NM_001159751"
> head(keys(TxDb.Mmusculus.UCSC.mm10.refGene ))
 "100009600" "100009609" "100009614" "100009664" "100012" "100017"
So you won't be able to get those made up IDs that UCSC used to use, but you can get RefSeq IDs, which, like, correspond to something in the real world.