Search
Question: Missing genes from TxDb.Hsapiens.UCSC.hg19.knownGene
0
14 months ago by
izzy.yichao.cai10 wrote:

Hi all,

I am trying to construct a gene track using TxDb. Hsapiens.UCSC.hg19.knownGene in Gviz package.

I found a region that does not return any gene while when I look up in the TxDb there is actually transcripts.

Please see the following code that I used:

library(Gviz)
library(GenomicRanges)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(org.Hs.eg.db)

## Wired region
chrm <- "chr6"
chrstart <- 30244097
chrend <- 31444097
geneName <- "DDR1"

## One of the good ones
#chrm <- "chr2"
#chrstart <- 176357518
#chrend <- 177557518
#geneName <- "HOXD13"

## Generate gene track in a given region
grtrack <- GeneRegionTrack(TxDb.Hsapiens.UCSC.hg19.knownGene,
chromosome = chrm,start = chrstart, end =chrend, name = "\nGenes\n",showTitle=FALSE, collapseTranscripts=TRUE,
shape="arrow",fontsize.group=50,transcriptAnnotation = "symbol",fill="black",fontcolor.group="black",
fontsize=30,cex.title=1,col.title="black",background.title="white"
)
## Map the entrez ID with symbol for genes returned in grtrack
symbols <- unlist(mapIds(org.Hs.eg.db, gene(grtrack), "SYMBOL", "ENTREZID", multiVals = "first"))


When using the wired region, the grtrack actually return nothing:

> gene(grtrack)
character(0)


If I do a look up in the TxDb, there are genes in this wired region(the gene I am interested here is DDR1, entrezID 780):

> select(TxDb.Hsapiens.UCSC.hg19.knownGene, keys = "780", keytype = "GENEID", columns = c("CDSCHROM", "CDSEND", "CDSSTART"))
'select()' returned 1:many mapping between keys and columns
GENEID       CDSCHROM CDSSTART   CDSEND
1      780           chr6 30856507 30856591
2      780           chr6 30856685 30856787
3      780           chr6 30856979 30857207
4      780           chr6 30858750 30858897
5      780           chr6 30859157 30859256
6      780           chr6 30859779 30859965
7      780           chr6 30860073 30860319
8      780           chr6 30860845 30860940


This is the only exception that I found. Other regions work perfectly fine.

Anybody got a hint? Thanks for helping!

modified 14 months ago by James W. MacDonald48k • written 14 months ago by izzy.yichao.cai10
0
14 months ago by
United States
James W. MacDonald48k wrote:

You have only shown the very top of the output from your call to select. I'll show something different:

> txs <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene)
> txs["780"]
GRangesList object of length 1:
\$780
GRanges object with 72 ranges and 2 metadata columns:
seqnames               ranges strand |     tx_id     tx_name
<Rle>            <IRanges>  <Rle> | <integer> <character>
[1]          chr6 [30850694, 30867933]      + |     24187  uc010jse.3
[2]          chr6 [30851861, 30867933]      + |     24188  uc003nrq.3
[3]          chr6 [30852315, 30867933]      + |     24189  uc003nrr.3
[4]          chr6 [30852315, 30867933]      + |     24190  uc003nrs.3
[5]          chr6 [30852315, 30867933]      + |     24191  uc003nrt.3
...           ...                  ...    ... .       ...         ...
[68] chr6_qbl_hap6   [2149452, 2152954]      + |     81650  uc011iku.1
[69] chr6_qbl_hap6   [2149452, 2160922]      + |     81651  uc011ikv.2
[70] chr6_qbl_hap6   [2149452, 2160922]      + |     81652  uc011ikx.2
[71] chr6_qbl_hap6   [2149452, 2160922]      + |     81653  uc011iky.2
[72] chr6_qbl_hap6   [2152144, 2159073]      + |     81655  uc011ikw.1

Or alternatively

> unique(as.character(seqnames(txs["780"][[1]])))
[1] "chr6"           "chr6_cox_hap2"  "chr6_dbb_hap3"  "chr6_mann_hap4"
[5] "chr6_mcf_hap5"  "chr6_qbl_hap6"

For genes like this, it's hard to define what the 'gene' is, when you are defining it by genomic position. Normally one would define it as the entire range of the coding sequences for all transcripts. But if the transcripts are found in six different places, how does one define such a thing? Since it's not clear how to define what the gene is in this context, it gets dropped.

Thank you, James! Sorry for the incomplete output. I've noticed DDR1 has different transcripts on multi chromosomes.

What I thought previously is that Gviz would collect all the transcripts in a specified region and collapse them into a single gene. However, it turns out that it would instead look at all transcripts of a 'gene' if the 'gene' is found in that specific region and try to collapse all of them into a single gene. In this case, DDR1 has different positions and thus get dropped.