Problem with extracting information about exons from TxDb.Hsapiens.UCSC.hg19.knownGene: incorrect number of exons and overlapping exons
1
0
Entering edit mode
olavur ▴ 10
@olavur-16765
Last seen 5.4 years ago

I'm trying to obtain exon intervals for a particular gene, but am running into two problems. First, the TxDb query returns a GRanges object with 12 ranges, but I know the gene has 10 exons. Second, two of the exons overlap (as I will show later).

library(TxDb.Hsapiens.UCSC.hg19.knownGene)

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

# UCSC/Entrez ID of SLC22A5.
 gene_id <- '6584'

# Get exons in SLC22A5.
exons_by_gene <- exonsBy(txdb, by='gene')
exons <- exons_by_gene[[gene_id]]

The result:

> as.data.frame(exons_gene)
   seqnames     start       end width strand exon_id exon_name
1      chr5 131705401 131706057   657      +   77618      <NA>
2      chr5 131713856 131713927    72      +   77619      <NA>
3      chr5 131714070 131714173   104      +   77620      <NA>
4      chr5 131719839 131719993   155      +   77621      <NA>
5      chr5 131721020 131721191   172      +   77622      <NA>
6      chr5 131722717 131722843   127      +   77623      <NA>
7      chr5 131724613 131724713   101      +   77624      <NA>
8      chr5 131726382 131726596   215      +   77625      <NA>
9      chr5 131726468 131726596   129      +   77626      <NA>
10     chr5 131728125 131728307   183      +   77627      <NA>
11     chr5 131729368 131729503   136      +   77628      <NA>
12     chr5 131729877 131731306  1430      +   77629      <NA>

Exons with IDs 77625 and 77626 overlap:

> ov <- findOverlapPairs(exons_gene, exons_gene)
> first(ov)$exon_id
 [1] 77618 77619 77620 77621 77622 77623 77624 77625 77625 77626 77626 77627 77628 77629
> second(ov)$exon_id
 [1] 77618 77619 77620 77621 77622 77623 77624 77625 77626 77625 77626 77627 77628 77629
txdb txdb.hsapiens.ucsc.hg19.knowngene granges exons ucsc • 1.4k views
ADD COMMENT
2
Entering edit mode
@james-w-macdonald-5106
Last seen 5 hours ago
United States

We just process the data available at UCSC for these packages, so any errors are theirs not ours. But to check...

> library(RMySQL)
Loading required package: DBI
> con <- dbConnect(MySQL(), user = "genome", host = "genome-mysql.soe.ucsc.edu", dbname="hg19")
> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> txids <- select(TxDb.Hsapiens.UCSC.hg19.knownGene, "6584", "TXNAME", "GENEID")
'select()' returned 1:many mapping between keys and columns
> txids
  GENEID     TXNAME
1   6584 uc003kww.4
2   6584 uc003kwx.4
3   6584 uc010jdr.1

> z <- dbGetQuery(con, paste0("select exonStarts exonEnds from knownGene where name in ('", paste(txids[,2], collapse = "','"), "');"))

> starts <- do.call(c, sapply(z[,1], strsplit, split = ","))

> ends <- do.call(c, sapply(z[,2], strsplit, split = ","))

> names(starts) <- NULL
> names(ends) <- NULL
> starts <- as.numeric(starts)
> ends <- as.numeric(ends)
> d.f <- data.frame(starts = starts, ends = ends, concat = paste(starts, ends, sep = "-"), width = ends - starts)
> d.f <- d.f[!duplicated(d.f$concat),]
> d.f
      starts      ends              concat width
1  131705400 131706057 131705400-131706057   657
2  131714069 131714173 131714069-131714173   104
3  131719838 131719993 131719838-131719993   155
4  131721019 131721191 131721019-131721191   172
5  131722716 131722843 131722716-131722843   127
6  131724612 131724713 131724612-131724713   101
7  131726381 131726596 131726381-131726596   215
8  131728124 131728307 131728124-131728307   183
9  131729367 131729503 131729367-131729503   136
10 131729876 131731306 131729876-131731306  1430
12 131713855 131713927 131713855-131713927    72
25 131726467 131726596 131726467-131726596   129
> dim(d.f)
[1] 12  4

Which looks pretty identical to what you have. Additionally,

> dbGetQuery(con, paste0("select exonCount from knownGene where name in ('", paste(txids[,2], collapse = "','"), "');"))
  exonCount
1        10
2        11
3         6

UCSC says that one transcript has 11 exons, so I'm not sure why you are so sure there are only 10.

ADD COMMENT
0
Entering edit mode

Also, why do you think that two overlapping exons should only be counted as one? If they have different lengths, they are different, no?

ADD REPLY
0
Entering edit mode

Well, I thought I had reliable sources that the gene has 10 exons, but as you showed that depends on which transcript you're looking at. What I needed to do for my analysis to make sense is to pick a transcript, then there are no overlapping exons. Thanks for the help.

So my solution is:

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

# UCSC/Entrez ID of SLC22A5.
gene_id <- '6584'

# Get transcripts corresponding to the gene.
tx <- transcriptsBy(txdb, by='gene')  # All transcripts by gene.
tx <- tx[[gene_id]]

# Get exons corresponding to one of the transcripts in the gene.
exons <- exonsBy(txdb, by='tx')  # All exons by transcript.
tx_name <- tx$tx_name[1]  # Pick the first of the transcripts in the gene.
tx_id <- tx$tx_id[1]
exons <- exons[[tx_id]]
ADD REPLY
0
Entering edit mode

An alternative to that is to generate a fake transcript that consists of the full extent of all known exons. The obvious tradeoff being that in your method you are saying 'this is the important transcript' and in my method you are saying 'these are all the genomic regions that can be part of any transcript'. But that all depends on what your analysis might be. Anyway:

## after running your code

> exons <- exons[[tx_id]]
> exons
GRanges object with 10 ranges and 3 metadata columns:
       seqnames              ranges strand |   exon_id   exon_name exon_rank
          <Rle>           <IRanges>  <Rle> | <integer> <character> <integer>
   [1]     chr5 131705401-131706057      + |     77618        <NA>         1
   [2]     chr5 131714070-131714173      + |     77620        <NA>         2
   [3]     chr5 131719839-131719993      + |     77621        <NA>         3
   [4]     chr5 131721020-131721191      + |     77622        <NA>         4
   [5]     chr5 131722717-131722843      + |     77623        <NA>         5
   [6]     chr5 131724613-131724713      + |     77624        <NA>         6
   [7]     chr5 131726382-131726596      + |     77625        <NA>         7
   [8]     chr5 131728125-131728307      + |     77627        <NA>         8
   [9]     chr5 131729368-131729503      + |     77628        <NA>         9
  [10]     chr5 131729877-131731306      + |     77629        <NA>        10
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome


## And alternatively
> exons2 <- reduce(exonsBy(txdb, "gene"))
> exons2[[gene_id]]
GRanges object with 11 ranges and 0 metadata columns:
       seqnames              ranges strand
          <Rle>           <IRanges>  <Rle>
   [1]     chr5 131705401-131706057      +
   [2]     chr5 131713856-131713927      +
   [3]     chr5 131714070-131714173      +
   [4]     chr5 131719839-131719993      +
   [5]     chr5 131721020-131721191      +
   [6]     chr5 131722717-131722843      +
   [7]     chr5 131724613-131724713      +
   [8]     chr5 131726382-131726596      +
   [9]     chr5 131728125-131728307      +
  [10]     chr5 131729368-131729503      +
  [11]     chr5 131729877-131731306      +
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

 

ADD REPLY

Login before adding your answer.

Traffic: 818 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6