Question

Missing genes from TxDb.Hsapiens.UCSC.hg19.knownGene

0

Entering edit mode

izzy.yichao.cai ▴ 10

@izzyyichaocai-12853

Last seen 6.7 years ago

Hi all,

I am trying to construct a gene track using TxDb. Hsapiens.UCSC.hg19.knownGene in Gviz package.

I found a region that does not return any gene while when I look up in the TxDb there is actually transcripts.

Please see the following code that I used:

library(Gviz)
library(GenomicRanges)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(org.Hs.eg.db)

## Wired region
chrm <- "chr6"
chrstart <- 30244097
chrend <- 31444097
geneName <- "DDR1"

## One of the good ones
#chrm <- "chr2"
#chrstart <- 176357518
#chrend <- 177557518
#geneName <- "HOXD13"

## Generate gene track in a given region
grtrack <- GeneRegionTrack(TxDb.Hsapiens.UCSC.hg19.knownGene, 
                           chromosome = chrm,start = chrstart, end =chrend, name = "\nGenes\n",showTitle=FALSE, collapseTranscripts=TRUE,
                           shape="arrow",fontsize.group=50,transcriptAnnotation = "symbol",fill="black",fontcolor.group="black",
                           fontsize=30,cex.title=1,col.title="black",background.title="white"
                          )
## Map the entrez ID with symbol for genes returned in grtrack
symbols <- unlist(mapIds(org.Hs.eg.db, gene(grtrack), "SYMBOL", "ENTREZID", multiVals = "first"))

When using the wired region, the grtrack actually return nothing:

> gene(grtrack)
character(0)

If I do a look up in the TxDb, there are genes in this wired region(the gene I am interested here is DDR1, entrezID 780):

> select(TxDb.Hsapiens.UCSC.hg19.knownGene, keys = "780", keytype = "GENEID", columns = c("CDSCHROM", "CDSEND", "CDSSTART"))
'select()' returned 1:many mapping between keys and columns
    GENEID       CDSCHROM CDSSTART   CDSEND
1      780           chr6 30856507 30856591
2      780           chr6 30856685 30856787
3      780           chr6 30856979 30857207
4      780           chr6 30858750 30858897
5      780           chr6 30859157 30859256
6      780           chr6 30859779 30859965
7      780           chr6 30860073 30860319
8      780           chr6 30860845 30860940

This is the only exception that I found. Other regions work perfectly fine.

Anybody got a hint? Thanks for helping!

txdb txdb.hsapiens.ucsc.hg19.knowngene gviz • 1.8k views

ADD COMMENT • link updated 7.5 years ago by James W. MacDonald 68k • written 7.5 years ago by izzy.yichao.cai ▴ 10

score 0 · Answer 1 · 2017-10-11

You have only shown the very top of the output from your call to select. I'll show something different:

> txs <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene)
> txs["780"]
GRangesList object of length 1:
$780
GRanges object with 72 ranges and 2 metadata columns:
            seqnames               ranges strand |     tx_id     tx_name
               <Rle>            <IRanges>  <Rle> | <integer> <character>
   [1]          chr6 [30850694, 30867933]      + |     24187  uc010jse.3
   [2]          chr6 [30851861, 30867933]      + |     24188  uc003nrq.3
   [3]          chr6 [30852315, 30867933]      + |     24189  uc003nrr.3
   [4]          chr6 [30852315, 30867933]      + |     24190  uc003nrs.3
   [5]          chr6 [30852315, 30867933]      + |     24191  uc003nrt.3
   ...           ...                  ...    ... .       ...         ...
  [68] chr6_qbl_hap6   [2149452, 2152954]      + |     81650  uc011iku.1
  [69] chr6_qbl_hap6   [2149452, 2160922]      + |     81651  uc011ikv.2
  [70] chr6_qbl_hap6   [2149452, 2160922]      + |     81652  uc011ikx.2
  [71] chr6_qbl_hap6   [2149452, 2160922]      + |     81653  uc011iky.2
  [72] chr6_qbl_hap6   [2152144, 2159073]      + |     81655  uc011ikw.1

Or alternatively

> unique(as.character(seqnames(txs["780"][[1]])))
[1] "chr6"           "chr6_cox_hap2"  "chr6_dbb_hap3"  "chr6_mann_hap4"
[5] "chr6_mcf_hap5"  "chr6_qbl_hap6"

For genes like this, it's hard to define what the 'gene' is, when you are defining it by genomic position. Normally one would define it as the entire range of the coding sequences for all transcripts. But if the transcripts are found in six different places, how does one define such a thing? Since it's not clear how to define what the gene is in this context, it gets dropped.