TxDb.Hsapiens.UCSC.hg19.knownGene Exons that are not part of any gene
1
0
Entering edit mode
@aliaksei-holik-4992
Last seen 5.6 years ago
Spain/Barcelona/Centre for Genomic Regu…

Dear Bioconductors,

This is a bit of a curiosity question. I have been working with TxDb.Hsapiens.UCSC.hg19.knownGene package and noticed that there are some exons, that do not seem to be a part of any gene.

> # get all the genes
> genic.regions <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
> # get all the exons
> exonic.regions <- exons(TxDb.Hsapiens.UCSC.hg19.knownGene)
> # Find the overlaps between the genes and exons
> findOverlaps(genic.regions, exonic.regions)
Hits object with 270213 hits and 0 metadata columns:
queryHits subjectHits
<integer>   <integer>
[1]         1      250809
[2]         1      250810
[3]         1      250811
[4]         1      250812
[5]         1      250813
...       ...         ...
[270209]     23056      266961
[270210]     23056      266962
[270211]     23056      266963
[270212]     23056      266964
[270213]     23056      266965
-------
queryLength: 23056
subjectLength: 289969

As you can see, there are nearly 290000 exons, but only about 270000 overlap with any of the genes. I can see it very clearly, if I try to plot genes and exons overlapping a fragment of a chromosome. There's a few exons (marked by the green triangle) that do not appear to be part of any gene. So my question is, what might they be and how I should deal with them if, for instance, I'm trying to get coordinates of the intronic or intergenic regions?

0
Entering edit mode

I don't think your images are showing, if you have any.

0
Entering edit mode

Thanks, fixed it.

2
Entering edit mode
@james-w-macdonald-5106
Last seen 2 days ago
United States

There are undoubtedly many reasons that the exons and genes don't all line up. One reason is likely the distinction between what is considered a gene. If you look at the rownames of the GRanges object you get when you do genes(TxDb), those are all Entrez Gene IDs, which is in one sense the list of all the 'genes'.

But there are any number of 'genes' that don't (yet) have Entrez Gene IDs. There are lots of lincRNA, piRNA, and probably even miRNA sequences that are not in the Gene database. For example, if we get all naive and stuff, we can check this out.

> ex <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")
> exns <- exons(TxDb.Hsapiens.UCSC.hg19.knownGene)

> exns[!exns %over% unlist(ex),]
GRanges object with 19332 ranges and 1 metadata column:
seqnames           ranges strand   |   exon_id
<Rle>        <IRanges>  <Rle>   | <integer>
[1]           chr1 [321084, 321115]      +   |         8
[2]           chr1 [321146, 321207]      +   |         9
[3]           chr1 [420206, 420296]      +   |        20
[4]           chr1 [420992, 421258]      +   |        21
[5]           chr1 [421396, 421839]      +   |        22
...            ...              ...    ... ...       ...
[19328] chrUn_gl000241   [35706, 35859]      -   |    289965
[19329] chrUn_gl000241   [36711, 36875]      -   |    289966
[19330] chrUn_gl000243   [11501, 11530]      +   |    289967
[19331] chrUn_gl000243   [13608, 13637]      +   |    289968
[19332] chrUn_gl000247   [ 5787,  5816]      -   |    289969
-------
seqinfo: 93 sequences (1 circular) from hg19 genome

So like you say, about 20K exons without a corresponding gene. The first two are piRNAs, and the next three are clone images. So with a sample size of 5 out of 20K, I would venture to guess it's probably a combination of all sorts of things that have been reported by someone somewhere, that have not yet become 'real' enough to make it into the Gene database.

0
Entering edit mode

FWIW I'll just add to Jim's excellent answer that, from a pure data structure point of view, the genes in a TxDb object are the parents of the transcripts and the transcripts are the parents of the exons. In other words: exons are linked to transcripts (in a child-to-parent relationship) and transcripts are linked to genes (also in a child-to-parent relationship). There is NO direct link from exons to genes. Furthermore every exon in a TxDb object is guaranteed to be linked to (at least) one transcript (i.e., to have at least one parent), but not every transcript is guaranteed to be linked to a gene. Said otherwise: there can be orphan transcripts but NO orphan exons. So the exons that you see without a gene are exons that are linked to a transcript without a gene.

Now Jim gave many of the reasons why some transcripts in the UCSC knownGene table (i.e. UCSC Genes track) are not associated with an Entrez gene ID. Note however that this is in the end a question for the UCSC folks. It's important to realize that we don't curate their data, we just take what they provide and store it as-is in the TxDb.Hsapiens.UCSC.hg19.knownGene package.,

Hope this helps and that I didn't confuse you too much.

Cheers,

H.

0
Entering edit mode

That's very clear. Thank you James and Hervé!