Extract exon features from overlapping genes
1
0
Entering edit mode
Emilie • 0
@emilie-7807
Last seen 8.9 years ago
France

Hi,

I would like to extract exon features for specific transcripts. I am using the "exons" function from GenomicFeatures to retrieve exon ranks corresponding to a transcript ID. However, this doesn't work for overlapping genes where exons are strictly identical. Below is one example, but I got the same result for different transcript IDs, and gtf files from various sources. So my question is: how can I extract a list of exon ranks for a specific transcript when genes are overlapping ?

Thank you very much in advance,

Emilie

> library("GenomicFeatures")
> txdb2 <- makeTranscriptDbFromUCSC(genome="hg19", tablename = "ensGene")
> exons(txdb2,vals=list(tx_name="ENST00000336097"),columns=c("gene_id","exon_rank"))
GRanges object with 6 ranges and 2 metadata columns:
seqnames ranges strand | gene_id exon_rank
[1] chr17 [49230937, 49231023] + | ENSG00000239672 1
[2] chr17 [49231586, 49231805] + | ENSG00000011052,ENSG00000239672,ENSG00000243678 2
[3] chr17 [49233012, 49233141] + | ENSG00000011052,ENSG00000239672,ENSG00000243678 2,3
[4] chr17 [49237341, 49237442] + | ENSG00000011052,ENSG00000239672,ENSG00000243678 3,4
[5] chr17 [49238521, 49238633] + | ENSG00000011052,ENSG00000239672,ENSG00000243678 4,5
[6] chr17 [49239089, 49239422] + | ENSG00000239672 6
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
> exons(txdb2,vals=list(tx_name="ENST00000336097"),columns=c("exon_rank"))$exon_rank
IntegerList of length 6
[[1]] 1
[[2]] 2
[[3]] 2 3
[[4]] 3 4
[[5]] 4 5
[[6]] 6

> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu precise (12.04.5 LTS)

locale:
[1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
[5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8 LC_PAPER=fr_FR.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C 

attached base packages:
[1] grid stats4 parallel stats graphics grDevices utils datasets methods base other 

attached packages:
[1] GenomicFeatures_1.18.6 AnnotationDbi_1.28.2 Biobase_2.26.0
[4] BSgenome.Hsapiens.UCSC.hg19_1.4.0 gridExtra_0.9.1 plyr_1.8.1
[7] ggplot2_1.0.1 BSgenome_1.34.1 rtracklayer_1.26.3
[10] GenomicRanges_1.18.4 GenomeInfoDb_1.2.4 Biostrings_2.34.1
[13] XVector_0.6.0 IRanges_2.0.1 S4Vectors_0.4.0
[16] BiocGenerics_0.12.1 RSQLite_1.0.0 DBI_0.3.1

loaded via a namespace (and not attached):
[1] base64enc_0.1-2 BatchJobs_1.6 BBmisc_1.9 BiocParallel_1.0.3
[5] biomaRt_2.22.0 bitops_1.0-6 brew_1.0-6 checkmate_1.5.2
[9] codetools_0.2-11 colorspace_1.2-6 digest_0.6.8 fail_1.2
[13] foreach_1.4.2 GenomicAlignments_1.2.2 gtable_0.1.2 iterators_1.0.7
[17] MASS_7.3-40 munsell_0.4.2 proto_0.3-10 Rcpp_0.11.5
[21] RCurl_1.95-4.5 reshape2_1.4.1 Rsamtools_1.18.3 scales_0.2.4
[25] sendmailR_1.2-1 stringr_0.6.2 tools_3.1.3 XML_3.98-1.1
[29] zlibbioc_1.12.0

 

GenomicFeatures exons • 1.3k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States

Salut Emilie,

Indeed, sometimes an exon belongs to more than 1 transcript, and, in that case, exons() reports more than 1 rank for that exon. This is why the exon_rank metadata column returned by your call to exons() is an IntegerList (and not just an integer vector), so more than 1 rank can be specified for each exon. However, I can see that this IntegerList representation is problematic here because it doesn't tell you which transcript these ranks are supposed to be relative to.

Anyway the recommended way to get the exon structure of some (or all) transcripts is to use exonsBy():

ex_by_tx <- exonsBy(txdb2, by="tx", use.names=TRUE)
ex_by_tx[["ENST00000336097"]]
# GRanges object with 6 ranges and 3 metadata columns:
#       seqnames               ranges strand |   exon_id   exon_name exon_rank
#          <Rle>            <IRanges>  <Rle> | <integer> <character> <integer>
#   [1]    chr17 [49230937, 49231023]      + |    443106        <NA>         1
#   [2]    chr17 [49231586, 49231805]      + |    443116        <NA>         2
#   [3]    chr17 [49233012, 49233141]      + |    443119        <NA>         3
#   [4]    chr17 [49237341, 49237442]      + |    443122        <NA>         4
#   [5]    chr17 [49238521, 49238633]      + |    443126        <NA>         5
#   [6]    chr17 [49239089, 49239422]      + |    443130        <NA>         6
#   -------
#   seqinfo: 93 sequences (1 circular) from hg19 genome

Hope this helps,
H.

ADD COMMENT

Login before adding your answer.

Traffic: 670 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6