Question

Get canonical exon locations for gene

0

Entering edit mode

dmccabe • 0

@dmccabe-11158

Last seen 6.1 years ago

I'm trying to get the positions of exons in the DMD gene using reference assembly hg19. All information I've found indicates that DMD has 79 exons. Yet I get more than 90 variously overlapping exons using GenomicFeatures:

library(GenomicRanges)
library(Homo.sapiens)
library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
dmd_exbytx <- exonsBy(txdb, "gene")[["1756"]]

> dmd_exbytx
GRanges object with 90 ranges and 2 metadata columns:
       seqnames               ranges strand |   exon_id   exon_name
          <Rle>            <IRanges>  <Rle> | <integer> <character>
   [1]     chrX [31137345, 31140047]      - |    272767        <NA>
   [2]     chrX [31144759, 31144790]      - |    272768        <NA>
   [3]     chrX [31152219, 31152311]      - |    272769        <NA>
   [4]     chrX [31164408, 31164531]      - |    272770        <NA>
   [5]     chrX [31165392, 31165635]      - |    272771        <NA>
   ...      ...                  ...    ... .       ...         ...
  [86]     chrX [33038256, 33038317]      - |    272852        <NA>
  [87]     chrX [33146180, 33146544]      - |    272853        <NA>
  [88]     chrX [33146264, 33146545]      - |    272854        <NA>
  [89]     chrX [33229399, 33229673]      - |    272855        <NA>
  [90]     chrX [33357376, 33357726]      - |    272856        <NA>
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

What's going on here? Maybe I'm not understanding the biology, which would make sense since I'm not a biologist. I just want the start and end positions for the 79 exons in DMD, but hours of internet searching has gotten me nowhere.

genomicfeatures r bioconductor • 1.2k views

ADD COMMENT • link updated 6.1 years ago by James W. MacDonald 65k • written 6.1 years ago by dmccabe • 0

score 2 · Accepted Answer · 2018-03-13

There isn't really anything as a 'canonical' set of exons, for any gene. If you were to look at the UCSC genome browser, you can see that there are lots of different possible transcripts, each of which is made up of some combination of the 93 or so different exonic regions. In addition, what information we know about a gene is based on all the information that has been submitted to one or more different groups that are responsible for saying what is and isn't part of the genome. Unsurprisingly, the two main groups at NCBI and EBI/EMBL don't always agree about things, so if you were to use say the EnsDb.Hsapiens V90 package from the AnnotationHub and naively look for the number of exons you would get even more:

> hub <- AnnotationHub()

> ensdb <- hub[["AH57757"]]

> ex2 <- exonsBy(ensdb, "gene")
> length(ex2[["ENSG00000198947"]])
[1] 152

And if we reduce the exons, we still have more than UCSC says:

> length(reduce(ex2[["ENSG00000198947"]]))
[1] 94

But this is a naive way of looking at genes, which are not really a thing. There are transcripts, which are made up of the various combinations of exons that may be used to generate a protein. A gene is made up of all the exons that may be used to generate transcripts, but for many genes there is no transcript that is made up of all the exons, so it's a bit of an artificial construct. If we look at the lengths of all the transcripts that make up the DMD gene, they are made up of highly variable numbers of exons:

> enstr <- mapIds(ensdb, "ENSG00000198947", "TXID","GENEID", multiVals="list")
> sapply(ex3[enstr[[1]]], length)
ENST00000481143 ENST00000378723 ENST00000358062 ENST00000378677 ENST00000357033 
              2              17              30              79              79 
ENST00000378702 ENST00000474231 ENST00000361471 ENST00000378680 ENST00000378705 
             18              35              16              13               8 
ENST00000475732 ENST00000469142 ENST00000634285 ENST00000634315 ENST00000445312 
              3               6               1               2               5 
ENST00000471779 ENST00000488902 ENST00000493412 ENST00000420596 ENST00000448370 
              3               5               4               4               3 
ENST00000288447 ENST00000480751 ENST00000447523 ENST00000472681 ENST00000472266 
             18               3               5               5               2 
ENST00000463609 ENST00000378707 ENST00000343523 ENST00000541735 ENST00000359836 
              2              36              25              32              34 
ENST00000619831 ENST00000620040 
             81              81

There are two transcripts for this gene that have 79 exons, and you might call those 'canonical', but they are just two of the many possible isoforms, and may be canonical for a particular set of tissues, but are probably not in any sense 'the gene'.