Question

Introns at transcript expression level

0

Entering edit mode

cookie ▴ 20

@cookie-7508

Last seen 9.7 years ago

USA

Hi,

I am trying to get the average number of introns per transcripts in the human genome. I am new to this, but I concluded that I have to use the package GenomicFeatures or Iranges.

Thus far I have made this code chuck:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb = TxDb.Hsapiens.UCSC.hg19.knownGene)

introns=intronsByTranscript(TxDb.Hsapiens.UCSC.hg38.knownGene)

And that's it. I am confused about what to do next, but I am quite sure I missing something here. I tried to do something with the findOverlaps function, but it seems strange to overlap the introns with the txdb object again, as I extracted the
the introns from transcripts. What is the intronsByTranscripts function really doing here?

What is the solution to this? Any help is appreciated.

Thanks!

genomicfeatures iranges • 2.0k views

ADD COMMENT • link updated 9.7 years ago by Hervé Pagès 16k • written 9.7 years ago by cookie ▴ 20

1

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 2 days ago

Seattle, WA, United States

Why the average number of introns per transcript and not the average number of exons?

Note that in a perfect world the latter should be equal to the former plus 1. In practice it's not:

ex_by_tx <- exonsBy(txdb, use.names=TRUE)
int_by_tx <- intronsByTranscript(txdb, use.names=TRUE)

ex_by_tx and int_by_tx are parallel GRangesList objects:

length(ex_by_tx)
# [1] 82960
length(int_by_tx)
# [1] 82960
identical(names(ex_by_tx), names(int_by_tx))
# [1] TRUE

The number of introns in a transcript is at most the number of exons minus 1:

all(elementLengths(int_by_tx) <= elementLengths(ex_by_tx) - 1)
# [1] TRUE

However, it's not always equal to the number of exons minus 1:

table(elementLengths(int_by_tx) == elementLengths(ex_by_tx) - 1)
# FALSE  TRUE 
#    33 82927

It's a rare situation though. After some investigation, this seems to happen for transcripts that UCSC reports as having adjacent exons. Like for example in transcript uc001sem.3:

ex_by_tx$uc001sem.3
# GRanges object with 3 ranges and 3 metadata columns:
#       seqnames               ranges strand |   exon_id   exon_name exon_rank
#          <Rle>            <IRanges>  <Rle> | <integer> <character> <integer>
#   [1]    chr12 [54366910, 54367707]      + |    162570        <NA>         1
#   [2]    chr12 [54368965, 54369667]      + |    162571        <NA>         2
#   [3]    chr12 [54369668, 54370203]      + |    162572        <NA>         3
#   -------
#   seqinfo: 93 sequences (1 circular) from hg19 genome

int_by_tx$uc001sem.3
# GRanges object with 1 range and 0 metadata columns:
#       seqnames               ranges strand
#          <Rle>            <IRanges>  <Rle>
#   [1]    chr12 [54367708, 54368964]      +
#   -------
#   seqinfo: 93 sequences (1 circular) from hg19 genome

Don't know what that means exactly from a biological point of view (one would need to ask to the UCSC people). Anyway, just thought I should mention it if you're going to count the number of introns per transcript. (Should probably add this to the man page for intronsByTranscript too.)

Cheers,

H.

ADD COMMENT • link 9.7 years ago Hervé Pagès 16k

0

Entering edit mode

This is quite interesting. Is there a paper on this? I am interested how it is possible exon +1 is equal introns.

ADD REPLY • link 9.7 years ago cookie ▴ 20

score 2 · Accepted Answer · 2015-03-24

The intronsByTranscript() function is creating a GRangesList item:

> intr <- intronsByTranscript(TxDb.Hsapiens.UCSC.hg19.knownGene)
> intr
GRangesList object of length 82960:
$1
GRanges object with 2 ranges and 0 metadata columns:
      seqnames         ranges strand
         <Rle>      <IRanges>  <Rle>
  [1]     chr1 [12228, 12612]      +
  [2]     chr1 [12722, 13220]      +

$2
GRanges object with 2 ranges and 0 metadata columns:
      seqnames         ranges strand
  [1]     chr1 [12228, 12594]      +
  [2]     chr1 [12722, 13402]      +

$3
GRanges object with 2 ranges and 0 metadata columns:
      seqnames         ranges strand
  [1]     chr1 [12228, 12645]      +
  [2]     chr1 [12698, 13220]      +

...
<82957 more elements>
-------
seqinfo: 93 sequences (1 circular) from hg19 genome

To find out more about this, and what you can do with it, see ?GRangesList. But to answer your question, you want elementLengths():

> lens <- elementLengths(intr)
> summary(lens)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
   0.000    2.000    6.000    7.948   11.000 4910.000