Question

Extract exon features from overlapping genes

0

Entering edit mode

Emilie • 0

@emilie-7807

Last seen 8.9 years ago

France

Hi,

I would like to extract exon features for specific transcripts. I am using the "exons" function from GenomicFeatures to retrieve exon ranks corresponding to a transcript ID. However, this doesn't work for overlapping genes where exons are strictly identical. Below is one example, but I got the same result for different transcript IDs, and gtf files from various sources. So my question is: how can I extract a list of exon ranks for a specific transcript when genes are overlapping ?

Thank you very much in advance,

Emilie

> library("GenomicFeatures")
> txdb2 <- makeTranscriptDbFromUCSC(genome="hg19", tablename = "ensGene")
> exons(txdb2,vals=list(tx_name="ENST00000336097"),columns=c("gene_id","exon_rank"))
GRanges object with 6 ranges and 2 metadata columns:
seqnames ranges strand | gene_id exon_rank
[1] chr17 [49230937, 49231023] + | ENSG00000239672 1
[2] chr17 [49231586, 49231805] + | ENSG00000011052,ENSG00000239672,ENSG00000243678 2
[3] chr17 [49233012, 49233141] + | ENSG00000011052,ENSG00000239672,ENSG00000243678 2,3
[4] chr17 [49237341, 49237442] + | ENSG00000011052,ENSG00000239672,ENSG00000243678 3,4
[5] chr17 [49238521, 49238633] + | ENSG00000011052,ENSG00000239672,ENSG00000243678 4,5
[6] chr17 [49239089, 49239422] + | ENSG00000239672 6
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
> exons(txdb2,vals=list(tx_name="ENST00000336097"),columns=c("exon_rank"))$exon_rank
IntegerList of length 6
[[1]] 1
[[2]] 2
[[3]] 2 3
[[4]] 3 4
[[5]] 4 5
[[6]] 6

> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu precise (12.04.5 LTS)

locale:
[1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
[5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8 LC_PAPER=fr_FR.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C 

attached base packages:
[1] grid stats4 parallel stats graphics grDevices utils datasets methods base other 

attached packages:
[1] GenomicFeatures_1.18.6 AnnotationDbi_1.28.2 Biobase_2.26.0
[4] BSgenome.Hsapiens.UCSC.hg19_1.4.0 gridExtra_0.9.1 plyr_1.8.1
[7] ggplot2_1.0.1 BSgenome_1.34.1 rtracklayer_1.26.3
[10] GenomicRanges_1.18.4 GenomeInfoDb_1.2.4 Biostrings_2.34.1
[13] XVector_0.6.0 IRanges_2.0.1 S4Vectors_0.4.0
[16] BiocGenerics_0.12.1 RSQLite_1.0.0 DBI_0.3.1

loaded via a namespace (and not attached):
[1] base64enc_0.1-2 BatchJobs_1.6 BBmisc_1.9 BiocParallel_1.0.3
[5] biomaRt_2.22.0 bitops_1.0-6 brew_1.0-6 checkmate_1.5.2
[9] codetools_0.2-11 colorspace_1.2-6 digest_0.6.8 fail_1.2
[13] foreach_1.4.2 GenomicAlignments_1.2.2 gtable_0.1.2 iterators_1.0.7
[17] MASS_7.3-40 munsell_0.4.2 proto_0.3-10 Rcpp_0.11.5
[21] RCurl_1.95-4.5 reshape2_1.4.1 Rsamtools_1.18.3 scales_0.2.4
[25] sendmailR_1.2-1 stringr_0.6.2 tools_3.1.3 XML_3.98-1.1
[29] zlibbioc_1.12.0

GenomicFeatures exons • 1.3k views

ADD COMMENT • link updated 8.9 years ago by Hervé Pagès 16k • written 8.9 years ago by Emilie • 0

score 0 · Answer 1 · 2015-05-28

Salut Emilie,

Indeed, sometimes an exon belongs to more than 1 transcript, and, in that case, exons() reports more than 1 rank for that exon. This is why the exon_rank metadata column returned by your call to exons() is an IntegerList (and not just an integer vector), so more than 1 rank can be specified for each exon. However, I can see that this IntegerList representation is problematic here because it doesn't tell you which transcript these ranks are supposed to be relative to.

Anyway the recommended way to get the exon structure of some (or all) transcripts is to use exonsBy():

ex_by_tx <- exonsBy(txdb2, by="tx", use.names=TRUE)
ex_by_tx[["ENST00000336097"]]
# GRanges object with 6 ranges and 3 metadata columns:
#       seqnames               ranges strand |   exon_id   exon_name exon_rank
#          <Rle>            <IRanges>  <Rle> | <integer> <character> <integer>
#   [1]    chr17 [49230937, 49231023]      + |    443106        <NA>         1
#   [2]    chr17 [49231586, 49231805]      + |    443116        <NA>         2
#   [3]    chr17 [49233012, 49233141]      + |    443119        <NA>         3
#   [4]    chr17 [49237341, 49237442]      + |    443122        <NA>         4
#   [5]    chr17 [49238521, 49238633]      + |    443126        <NA>         5
#   [6]    chr17 [49239089, 49239422]      + |    443130        <NA>         6
#   -------
#   seqinfo: 93 sequences (1 circular) from hg19 genome

Hope this helps,
H.