Question

Transcripts in TxDb.Hsapiens.UCSC.hg38.knownGene but not in ucsc genome brower

1

Entering edit mode

shilin.zhao ▴ 40

@shilinzhao-7822

Last seen 7.1 years ago

United States

Hello,

I am going to do some mapping from gene to their chromosome regions. And I found some gene with different chromosome regions as in UCSC genome browser. For example, gene 4499 (MT1M) is in chr16:56632233-56633986 in UCSC genome browser. But in R, it is in chr16:56617461-56633986, because it has two transcripts in R but only one transcript in UCSC genome browser. "uc002ejn.3" is the same in R as UCSC genome browser, but I can't find "uc010vhe.2" in UCSC genome browser. Any suggestion for it? Thanks!

Here is the R code:

require(TxDb.Hsapiens.UCSC.hg38.knownGene)

require(GenomicRanges)

geneDb=TxDb.Hsapiens.UCSC.hg38.knownGene

allGeneRange<-genes(geneDb)

allGeneRange["4499"]
#GRanges object with 1 range and 1 metadata column:
# seqnames ranges strand | gene_id
# <Rle> <IRanges> <Rle> | <character>
# 4499 chr16 [56617461, 56633986] + | 4499
# -------
# seqinfo: 455 sequences (1 circular) from hg38 genome

txs <- transcriptsBy(TxDb.Hsapiens.UCSC.hg38.knownGene)

txs["4499"]

#GRangesList object of length 1:
#$4499
#GRanges object with 2 ranges and 2 metadata columns:
# seqnames ranges strand | tx_id tx_name
# <Rle> <IRanges> <Rle> | <integer> <character>
# [1] chr16 [56617461, 56633986] + | 68646 uc010vhe.2
# [2] chr16 [56632622, 56633986] + | 68649 uc002ejn.3

#-------
#seqinfo: 455 sequences (1 circular) from hg38 genome

sessionInfo()

R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.7 (Final)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] TxDb.Hsapiens.UCSC.hg38.knownGene_3.1.3
[2] GenomicFeatures_1.24.5
[3] AnnotationDbi_1.34.4
[4] Biobase_2.32.0
[5] GenomicRanges_1.24.3
[6] GenomeInfoDb_1.8.7
[7] IRanges_2.6.1
[8] S4Vectors_0.10.3
[9] BiocGenerics_0.18.0

loaded via a namespace (and not attached):
[1] XML_3.98-1.4 Rsamtools_1.24.0
[3] Biostrings_2.40.2 GenomicAlignments_1.8.4
[5] bitops_1.0-6 DBI_0.5-1
[7] RSQLite_1.0.0 zlibbioc_1.18.0
[9] XVector_0.12.1 BiocParallel_1.6.6
[11] tools_3.3.1 biomaRt_2.28.0
[13] RCurl_1.95-4.8 rtracklayer_1.32.2
[15] SummarizedExperiment_1.2.3

annotation txdb.hsapiens.ucsc.hg38.knowngene bug • 3.1k views

ADD COMMENT • link 8.7 years ago shilin.zhao ▴ 40

score 1 · Answer 1 · 2016-10-13

I've find the reason for the first question. It related to the version. The latest version of TxDb.Hsapiens.UCSC.hg38.knownGene was build in 2015. And when I changed to the development version of TxDb.Hsapiens.UCSC.hg38.knownGene, which was build in 2016/09, the result is correct.

But I found another question in the development version, maybe it is a bug?

require(TxDb.Hsapiens.UCSC.hg38.knownGene)
require(GenomicRanges)

geneDb=TxDb.Hsapiens.UCSC.hg38.knownGene
allGeneRange<-genes(geneDb)
allGeneRange["875"]
txs <- transcriptsBy(TxDb.Hsapiens.UCSC.hg38.knownGene)
txs["875"]

We can find CBS gene (txs["875"]) has 25 transcripts, from two regions: chr21 [6444869, 6467509] and chr21 [43075107, 43076288]

1. CBS gene ("875") was only in chr21 [43075107, 43076288]. The region of chr21 [6444869, 6467509] was CBSL gene ("102724560"). But CBSL was not in the database, and its transcripts were recorded in CBS.

2. The gene region of CBS gene (allGeneRange["875"]) was in chr21 [6444869, 43076943], which included all the region between 6444869-43076943. But it is not correct as they were two separate regions.

score 1 · Answer 2 · 2016-10-17

Hi Shilin,

About 1.

These TxDb objects were made with the makeTxDbFromUCSC() function. This function looks at the knownToLocusLink table to get the mapping between transcripts in the "GENCODE v24" track and their corresponding genes. See https://genome.ucsc.edu/cgi-bin/hgTables?hgta_doSchemaDb=hg38&hgta_doSchemaTable=knownToLocusLink for the details about this table.

Let's fetch this table from R:

library(rtracklayer)
session <- browserSession()
genome(session) <- "hg38"
query <- ucscTableQuery(session, "GENCODE v24", table="knownToLocusLink")
knownToLocusLink <- getTable(query)

knownToLocusLink[knownToLocusLink$value == "875", ]
#             name value
# 12819 uc032prg.2   875
# 12820 uc032prf.2   875
# 12821 uc032prh.1   875
# ...
# 13453 uc062anf.1   875
# 13454 uc062ang.1   875
# 13455 uc062anh.1   875

These 25 transcripts linked to Entrez Gene 875 are exactly the same as those I get when using transcriptsBy() on TxDb.Hsapiens.UCSC.hg38.knownGene (with soon-to-be-released BioC 3.4):

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
tx875 <- transcriptsBy(txdb, by="gene")[["875"]]
tx875
# GRanges object with 25 ranges and 2 metadata columns:
#        seqnames               ranges strand |     tx_id     tx_name
#           <Rle>            <IRanges>  <Rle> | <integer> <character>
#    [1]    chr21   [6444869, 6467509]      - |    171660  uc032prg.2
#    [2]    chr21   [6444869, 6467532]      - |    171661  uc032prf.2
#    [3]    chr21   [6444869, 6468040]      - |    171662  uc032prh.1
#    ...      ...                  ...    ... .       ...         ...
#   [23]    chr21 [43068321, 43075864]      - |    172476  uc062anf.1
#   [24]    chr21 [43068404, 43073508]      - |    172477  uc062ang.1
#   [25]    chr21 [43075107, 43076288]      - |    172478  uc062anh.1
#   -------
#   seqinfo: 455 sequences (1 circular) from hg38 genome

They cover 2 regions on chr21:

reduce(tx875)
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames               ranges strand
#          <Rle>            <IRanges>  <Rle>
#   [1]    chr21 [ 6444869,  6468040]      -
#   [2]    chr21 [43053191, 43076943]      -
#   -------
#   seqinfo: 455 sequences (1 circular) from hg38 genome

More precisely, the first 9 transcripts cover region chr21:6444869-6468040:

reduce(tx875[1:9])
# GRanges object with 1 range and 0 metadata columns:
#       seqnames             ranges strand
#          <Rle>          <IRanges>  <Rle>
#   [1]    chr21 [6444869, 6468040]      -
#   -------
#   seqinfo: 455 sequences (1 circular) from hg38 genome

and the 16 other transcripts cover region chr21:43053191-43076943:

reduce(tx875[10:25])
# GRanges object with 1 range and 0 metadata columns:
#       seqnames               ranges strand
#          <Rle>            <IRanges>  <Rle>
#   [1]    chr21 [43053191, 43076943]      -
#   -------
#   seqinfo: 455 sequences (1 circular) from hg38 genome

About 2

The genes() extractor considers that the region of a gene goes from the start of its first transcript to the end of its last transcript (when moving in the 5' to 3' direction along the chromosome). It's a very simple and somewhat naive semantic, but of course not perfect. Its obvious advantage being that it returns 1 range per gene. Note that this was discussed here Problem with TxDb.Mmusculus.UCSC.mm9.knownGene almost 2 years ago. Maybe a better semantic would be to do something like this:

genes2 <- function(txdb)
{
    tx_by_gene <- transcriptsBy(txdb, by="gene")
    genes <- reduce(tx_by_gene)
    sort(unlist(genes))
}

With this semantic, a gene can be associated with more than 1 range:

gn2 <- genes2(txdb)
gn2[names(gn2) == "875"]
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames               ranges strand
#          <Rle>            <IRanges>  <Rle>
#   875    chr21 [ 6444869,  6468040]      -
#   875    chr21 [43053191, 43076943]      -
#   -------
#   seqinfo: 455 sequences (1 circular) from hg38 genome

Maybe that's a more sensible semantic?

Hope this helps.

H.

score 0 · Answer 3 · 2016-10-17

0