Question

Problem with TxDb.Mmusculus.UCSC.mm9.knownGene

0

Entering edit mode

nathan.harmston • 0

@nathanharmston-7289

Last seen 9.2 years ago

United Kingdom

Within the gene annotations provided in TxDb.Mmusculus.UCSC.mm9.knownGene there appears to be a snoRNA which is almost 36Mb long, however in reality it is only 56bp long. I've used makeTranscriptFromUCSC as well and the same thing happens.

Any ideas why this is happening? Or is there a fix for this?

> library(TxDb.Mmusculus.UCSC.mm9.knownGene)

> ucsc.genes = genes(TxDb.Mmusculus.UCSC.mm9.knownGene)

> ucsc.genes[which(width(ucsc.genes) == max(width(ucsc.genes)))]

GRanges object with 1 range and 1 metadata column:

seqnames ranges strand | gene_id

100217439 chr19 [25016728, 60850288] - | 100217439

> sessionInfo()

R version 3.1.1 (2014-07-10)

Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:

[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:

[1] stats4 parallel stats graphics grDevices utils datasets

[8] methods base

other attached packages:

[1] XVector_0.6.0

[2] TxDb.Mmusculus.UCSC.mm9.knownGene_3.0.0

[3] GenomicFeatures_1.18.3

[4] AnnotationDbi_1.28.1

[5] Biobase_2.26.0

[6] GenomicRanges_1.18.4

[7] GenomeInfoDb_1.2.4

[8] IRanges_2.0.1

[9] S4Vectors_0.4.0

[10] BiocGenerics_0.12.1

loaded via a namespace (and not attached):

[1] base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.8

[4] BiocParallel_1.0.0 biomaRt_2.22.0 Biostrings_2.34.1

[7] bitops_1.0-6 brew_1.0-6 checkmate_1.5.1

[10] codetools_0.2-9 DBI_0.3.1 digest_0.6.8

[13] fail_1.2 foreach_1.4.2 GenomicAlignments_1.2.1

[16] iterators_1.0.7 RCurl_1.95-4.5 Rsamtools_1.18.2

[19] RSQLite_1.0.0 rtracklayer_1.26.2 sendmailR_1.2-1

[22] stringr_0.6.2 tools_3.1.1 XML_3.98-1.1

[25] zlibbioc_1.12.0

software error txdb.mmusculus.ucsc.mm9.knowngene genomicfeatures • 1.7k views

ADD COMMENT • link updated 9.2 years ago by Hervé Pagès 16k • written 9.2 years ago by nathan.harmston • 0

score 2 · Answer 1 · 2015-01-26

Hi Nathan,

The code you show us is looking at genes, not transcripts. What it tells us is that the 100217439 gene (Snora19) is almost 36Mb long. However, if we look at the transcripts for this gene:

> txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene

> transcriptsBy(txdb, by="gene")[["100217439"]]
GRanges object with 2 ranges and 2 metadata columns:
      seqnames               ranges strand |     tx_id     tx_name
         <Rle>            <IRanges>  <Rle> | <integer> <character>
  [1]    chr19 [25016728, 25016781]      - |     52655  uc012bjt.1
  [2]    chr19 [60850233, 60850288]      - |     53038  uc012bog.1
  -------
  seqinfo: 35 sequences (1 circular) from mm9 genome

we see that 2 transcripts are linked to it. The individual lengths of these transcripts are:

> width(transcriptsBy(txdb, by="gene")[["100217439"]])
[1] 54 56

which is what we expect.

It's important to keep in mind that the genomic range reported by the genes() extractor for a given gene is inferred from the ranges of the transcripts that are linked to that gene. More precisely, the gene is considered to go from the first transcript start to the last transcript end (when moving in the 5' to 3' direction along the chromosome). This seems like a reasonable choice because most of the time, when more than one transcript is linked to a gene, it's because we are in a situation of alternate splicing. So this simple way of inferring the gene range is somewhat meaningful and can be useful for a wide range of downstream analysis.

However, the case of Snora19 is special:

http://genome.ucsc.edu/cgi-bin/hgc?g=refGene&i=NR_034047&c=chr19&o=25016727&db=mm9

Here we are in the presence of a small nucleolar RNA which location on the reference genome is ambiguous. Hence the 2 transcripts reported by UCSC (uc012bjt.1 and uc012bog.1), which are actually the 2 genomic locations that were obtained by aligning the sequence of this small nucleolar RNA to the reference genome. In that case, it seems that having the genes() extractor report a gene that is 36Mb long is maybe wrong, or at least not that useful or pertinent. I don't know if there is anything we should do about this. Suggestions are welcome.

Cheers,

H.

score 1 · Answer 2 · 2015-01-26

The reason is that there are two transcripts for this snoRNA:

> tx <- transcriptsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, "gene")
> tx["100217439"]
GRangesList object of length 1:
$100217439
GRanges object with 2 ranges and 2 metadata columns:
      seqnames               ranges strand |     tx_id     tx_name
         <Rle>            <IRanges>  <Rle> | <integer> <character>
  [1]    chr19 [25016728, 25016781]      - |     52655  uc012bjt.1
  [2]    chr19 [60850233, 60850288]      - |     53038  uc012bog.1

And usually a gene spans the extent of all the transcripts. In other words, for a translated gene, you can pretty much just go from the smallest genomic position to the largest, and be assured you have captured the entirety of the gene. And since all the transcripts are in pretty much the same place, this makes sense. But with untranslated content where there usually aren't any exons, this breaks down.

I would bet there are other edge cases like this, and they are likely to proliferate as we discover more of these short untranslated RNAs. How to reliably determine the start and stop for a gene looks to be getting pretty complicated.