Difference in TxDb.Hsapiens.UCSC.hg19.knownGene and UCSC knownGene
0
0
Entering edit mode
Last seen 14 months ago
Sweden

Hi,

I wanted a list of all exons in human genome (hg19) along with there coordinates. For this purpose, I downloaded knownGene.txt.gz from UCSC, and extracted and removed duplicate exon-coordinates for all the transcripts. Unique exons with coordinates were ~3 million. Besides this, I also used TxDb.Hsapiens.UCSC.hg19.knownGene package to extract list of exons using the command:

exon_list <- as.data.frame(exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "tx"))

which listed out ~7 million unique exon-coordinates and most of them are not found in the list generated from UCSC. Following is the chunk form the file I prepared:

chr    start    end    strand
chr1    11874    12227    +
chr1    12613    12721    +
chr1    13221    14409    +
chr1    12595    12721    +
chr1    13403    14409    +

I am confused which is the reliable way to obtain all exons and why there is difference between the two sources. Thanks in advance.

regards,

annotate exons txdb.hsapiens.ucsc.hg19.knowngene • 1.4k views
0
Entering edit mode

You will have to give more details about how you got your data from UCSC. In addition:

> ex <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "tx")

> sum(elementLengths(ex))
[1] 742493

So there are not even 1M non-unique exons! In other words, there are about 742K exons that exist in all the transcripts in knownGene. But there are likely to be multiple identical exons listed here, as two transcripts of the same gene are likely to have one or more exons that are identical (although not all are identical, obvs).

But it is easier to just get the (canonical, if I am not mistaken) exons using exons():

> ex <- exons(TxDb.Hsapiens.UCSC.hg19.knownGene)
> length(ex)
[1] 289969

which is an order of magnitude fewer exons than you say you got from the direct download.