Why the average number of introns per transcript and not the average number of exons?
Note that in a perfect world the latter should be equal to the former plus 1. In practice it's not:
ex_by_tx <- exonsBy(txdb, use.names=TRUE)
int_by_tx <- intronsByTranscript(txdb, use.names=TRUE)
ex_by_tx
and int_by_tx
are parallel GRangesList objects:
length(ex_by_tx)
# [1] 82960
length(int_by_tx)
# [1] 82960
identical(names(ex_by_tx), names(int_by_tx))
# [1] TRUE
The number of introns in a transcript is at most the number of exons minus 1:
all(elementLengths(int_by_tx) <= elementLengths(ex_by_tx) - 1)
# [1] TRUE
However, it's not always equal to the number of exons minus 1:
table(elementLengths(int_by_tx) == elementLengths(ex_by_tx) - 1)
# FALSE TRUE
# 33 82927
It's a rare situation though. After some investigation, this seems to happen for transcripts that UCSC reports as having adjacent exons. Like for example in transcript uc001sem.3
:
ex_by_tx$uc001sem.3
# GRanges object with 3 ranges and 3 metadata columns:
# seqnames ranges strand | exon_id exon_name exon_rank
# <Rle> <IRanges> <Rle> | <integer> <character> <integer>
# [1] chr12 [54366910, 54367707] + | 162570 <NA> 1
# [2] chr12 [54368965, 54369667] + | 162571 <NA> 2
# [3] chr12 [54369668, 54370203] + | 162572 <NA> 3
# -------
# seqinfo: 93 sequences (1 circular) from hg19 genome
int_by_tx$uc001sem.3
# GRanges object with 1 range and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr12 [54367708, 54368964] +
# -------
# seqinfo: 93 sequences (1 circular) from hg19 genome
Don't know what that means exactly from a biological point of view (one would need to ask to the UCSC people). Anyway, just thought I should mention it if you're going to count the number of introns per transcript. (Should probably add this to the man page for intronsByTranscript
too.)
Cheers,
H.
Thanks a lot!