Error in TxDb mm10 coordinates?
1
0
Entering edit mode
Mark ▴ 10
@mark-11758
Last seen 4.7 years ago
Imperial College London

Hi, 

The TxDb.Mmusculus.UCSC.mm10.knownGene package appears to be giving me some strange co-ordinates for certain genes, making them huge, e.g. this microRNA which stretches 35 Mb according to the txdb package but 66 bp according to UCSC. Any suggestions/something I'm overlooking?

library(TxDb.Mmusculus.UCSC.mm10.knownGene)
txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
genes <- genes(txdb)
subset(genes, width(genes) > 35000000)

Output:

GRanges object with 1 range and 1 metadata column:
            seqnames               ranges strand |     gene_id
               <Rle>            <IRanges>  <Rle> | <character>
  102465114    chr19 [24942236, 60774397]      - |   102465114
  -------
  seqinfo: 66 sequences (1 circular) from mm10 genome

 

Session info:

R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] BiocInstaller_1.24.0                    
 [2] TxDb.Mmusculus.UCSC.mm10.knownGene_3.4.0
 [3] GenomicFeatures_1.26.0                  
 [4] AnnotationDbi_1.36.0                    
 [5] Biobase_2.34.0                          
 [6] GenomicRanges_1.26.1                    
 [7] GenomeInfoDb_1.10.1                     
 [8] IRanges_2.8.1                           
 [9] S4Vectors_0.12.0                        
[10] BiocGenerics_0.20.0                     

loaded via a namespace (and not attached):
 [1] XVector_0.14.0             zlibbioc_1.20.0           
 [3] GenomicAlignments_1.10.0   BiocParallel_1.8.1        
 [5] lattice_0.20-33            tools_3.3.1               
 [7] SummarizedExperiment_1.4.0 grid_3.3.1                
 [9] DBI_0.5-1                  Matrix_1.2-7.1            
[11] rtracklayer_1.34.1         bitops_1.0-6              
[13] RCurl_1.95-4.8             biomaRt_2.30.0            
[15] RSQLite_1.1                Biostrings_2.42.0         
[17] Rsamtools_1.26.1           XML_3.98-1.5   

Thanks,

Mark

txdb txdb.mmusculus.ucsc.mm10.knowngene • 631 views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 4 hours ago
United States

By definition, the gene extent is the start of the 'first' transcript to the end of the 'last' transcript. For non-coding RNA species, which may be found multiple places on a chromosome, this has the unintended effect of returning a really long gene that doesn't really exist. If you did

txs <- transcriptsBy(TxDb.Mmusculus.UCSC.mm10.knownGene)

txs["102465114"]

You sill see that there are two transcripts for this miRNA, spaced quite far apart on chr19.

ADD COMMENT
0
Entering edit mode

Ah yeah, should have checked that. Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 309 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6