Search
Question: TxDb.Hsapiens.UCSC.hg19.knownGene Exons that are not part of any gene
0
gravatar for Aliaksei Holik
13 months ago by
Spain/Barcelona/Centre for Genomic Regulation
Aliaksei Holik350 wrote:

Dear Bioconductors,

This is a bit of a curiosity question. I have been working with TxDb.Hsapiens.UCSC.hg19.knownGene package and noticed that there are some exons, that do not seem to be a part of any gene.

> # get all the genes
> genic.regions <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
> # get all the exons
> exonic.regions <- exons(TxDb.Hsapiens.UCSC.hg19.knownGene)
> # Find the overlaps between the genes and exons
> findOverlaps(genic.regions, exonic.regions)
Hits object with 270213 hits and 0 metadata columns:
           queryHits subjectHits
           <integer>   <integer>
       [1]         1      250809
       [2]         1      250810
       [3]         1      250811
       [4]         1      250812
       [5]         1      250813
       ...       ...         ...
  [270209]     23056      266961
  [270210]     23056      266962
  [270211]     23056      266963
  [270212]     23056      266964
  [270213]     23056      266965
  -------
  queryLength: 23056
  subjectLength: 289969

As you can see, there are nearly 290000 exons, but only about 270000 overlap with any of the genes. I can see it very clearly, if I try to plot genes and exons overlapping a fragment of a chromosome. There's a few exons (marked by the green triangle) that do not appear to be part of any gene. So my question is, what might they be and how I should deal with them if, for instance, I'm trying to get coordinates of the intronic or intergenic regions?

Discrepancy between Genes and exons in TxDb.Hsapiens.UCSC.hg19.knownGene

ADD COMMENTlink modified 13 months ago • written 13 months ago by Aliaksei Holik350

I don't think your images are showing, if you have any.

ADD REPLYlink modified 13 months ago • written 13 months ago by Aaron Lun13k

Thanks, fixed it.

ADD REPLYlink written 13 months ago by Aliaksei Holik350
2
gravatar for James W. MacDonald
13 months ago by
United States
James W. MacDonald42k wrote:

There are undoubtedly many reasons that the exons and genes don't all line up. One reason is likely the distinction between what is considered a gene. If you look at the rownames of the GRanges object you get when you do genes(TxDb), those are all Entrez Gene IDs, which is in one sense the list of all the 'genes'.

But there are any number of 'genes' that don't (yet) have Entrez Gene IDs. There are lots of lincRNA, piRNA, and probably even miRNA sequences that are not in the Gene database. For example, if we get all naive and stuff, we can check this out.

> ex <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")
> exns <- exons(TxDb.Hsapiens.UCSC.hg19.knownGene)

> exns[!exns %over% unlist(ex),]
GRanges object with 19332 ranges and 1 metadata column:
                seqnames           ranges strand   |   exon_id
                   <Rle>        <IRanges>  <Rle>   | <integer>
      [1]           chr1 [321084, 321115]      +   |         8
      [2]           chr1 [321146, 321207]      +   |         9
      [3]           chr1 [420206, 420296]      +   |        20
      [4]           chr1 [420992, 421258]      +   |        21
      [5]           chr1 [421396, 421839]      +   |        22
      ...            ...              ...    ... ...       ...
  [19328] chrUn_gl000241   [35706, 35859]      -   |    289965
  [19329] chrUn_gl000241   [36711, 36875]      -   |    289966
  [19330] chrUn_gl000243   [11501, 11530]      +   |    289967
  [19331] chrUn_gl000243   [13608, 13637]      +   |    289968
  [19332] chrUn_gl000247   [ 5787,  5816]      -   |    289969
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

So like you say, about 20K exons without a corresponding gene. The first two are piRNAs, and the next three are clone images. So with a sample size of 5 out of 20K, I would venture to guess it's probably a combination of all sorts of things that have been reported by someone somewhere, that have not yet become 'real' enough to make it into the Gene database.

ADD COMMENTlink written 13 months ago by James W. MacDonald42k

FWIW I'll just add to Jim's excellent answer that, from a pure data structure point of view, the genes in a TxDb object are the parents of the transcripts and the transcripts are the parents of the exons. In other words: exons are linked to transcripts (in a child-to-parent relationship) and transcripts are linked to genes (also in a child-to-parent relationship). There is NO direct link from exons to genes. Furthermore every exon in a TxDb object is guaranteed to be linked to (at least) one transcript (i.e., to have at least one parent), but not every transcript is guaranteed to be linked to a gene. Said otherwise: there can be orphan transcripts but NO orphan exons. So the exons that you see without a gene are exons that are linked to a transcript without a gene.

Now Jim gave many of the reasons why some transcripts in the UCSC knownGene table (i.e. UCSC Genes track) are not associated with an Entrez gene ID. Note however that this is in the end a question for the UCSC folks. It's important to realize that we don't curate their data, we just take what they provide and store it as-is in the TxDb.Hsapiens.UCSC.hg19.knownGene package.,

Hope this helps and that I didn't confuse you too much.

Cheers,

H.

ADD REPLYlink modified 13 months ago • written 13 months ago by Hervé Pagès ♦♦ 11k

That's very clear. Thank you James and Hervé!

ADD REPLYlink written 12 months ago by Aliaksei Holik350
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 162 users visited in the last hour