Difference in TxDb.Hsapiens.UCSC.hg19.knownGene and UCSC knownGene
Entering edit mode
Last seen 3.2 years ago


I wanted a list of all exons in human genome (hg19) along with there coordinates. For this purpose, I downloaded knownGene.txt.gz from UCSC, and extracted and removed duplicate exon-coordinates for all the transcripts. Unique exons with coordinates were ~3 million. Besides this, I also used TxDb.Hsapiens.UCSC.hg19.knownGene package to extract list of exons using the command:  

exon_list <- as.data.frame(exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "tx"))

which listed out ~7 million unique exon-coordinates and most of them are not found in the list generated from UCSC. Following is the chunk form the file I prepared:

chr    start    end    strand
chr1    11874    12227    +
chr1    12613    12721    +
chr1    13221    14409    +
chr1    12595    12721    +
chr1    13403    14409    +  

I am confused which is the reliable way to obtain all exons and why there is difference between the two sources. Thanks in advance.



annotate exons txdb.hsapiens.ucsc.hg19.knowngene • 1.8k views
Entering edit mode

You will have to give more details about how you got your data from UCSC. In addition:

> ex <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "tx")

> sum(elementLengths(ex))
[1] 742493

So there are not even 1M non-unique exons! In other words, there are about 742K exons that exist in all the transcripts in knownGene. But there are likely to be multiple identical exons listed here, as two transcripts of the same gene are likely to have one or more exons that are identical (although not all are identical, obvs).

But it is easier to just get the (canonical, if I am not mistaken) exons using exons():

> ex <- exons(TxDb.Hsapiens.UCSC.hg19.knownGene)
> length(ex)
[1] 289969

which is an order of magnitude fewer exons than you say you got from the direct download.


Login before adding your answer.

Traffic: 485 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6