Entering edit mode
alessandro.pastore
▴
20
@alessandropastore-10879
Last seen 7.0 years ago
I would like to generate a GRangesList of all gene introns with names. I can make the exon list but I do not see a elegant way do get the introns. any suggestion?
Thanks!
library(AnnotationHub)
edb <- query(AnnotationHub(), c("Ensembl 90 EnsDb", "Homo sapiens"))[[1]]
exons.Grange <- exons(edb, columns = c(listColumns(edb , "tx"), "gene_name"))
exons.Grange <- exons.Grange[duplicated(exons.Grange$exon_id),]
exons.Grange <- split(exons.Grange, exons.Grange$exon_id)
> exons.Grange
GRangesList object of length 221795:
$ENSE00000327880
GRanges object with 5 ranges and 11 metadata columns:
seqnames ranges strand | tx_id tx_biotype tx_seq_start tx_seq_end tx_cds_seq_start tx_cds_seq_end
<Rle> <IRanges> <Rle> | <character> <character> <integer> <integer> <integer> <integer>
ENSE00000327880 1 [27732603, 27732657] + | ENST00000419687 protein_coding 27725996 27761473 27726081 27760581
ENSE00000327880 1 [27732603, 27732657] + | ENST00000530324 protein_coding 27726028 27759764 27726081 27759657
ENSE00000327880 1 [27732603, 27732657] + | ENST00000234549 protein_coding 27726028 27760581 27726081 27760581
ENSE00000327880 1 [27732603, 27732657] + | ENST00000373949 protein_coding 27726028 27761964 27726081 27760581
ENSE00000327880 1 [27732603, 27732657] + | ENST00000010299 protein_coding 27726057 27760581 27726081 27760581
gene_id tx_support_level tx_name gene_name exon_id
<character> <integer> <character> <character> <character>
ENSE00000327880 ENSG00000009780 2 ENST00000419687 FAM76A ENSE00000327880
ENSE00000327880 ENSG00000009780 1 ENST00000530324 FAM76A ENSE00000327880
ENSE00000327880 ENSG00000009780 1 ENST00000234549 FAM76A ENSE00000327880
ENSE00000327880 ENSG00000009780 2 ENST00000373949 FAM76A ENSE00000327880
ENSE00000327880 ENSG00000009780 1 ENST00000010299 FAM76A ENSE00000327880
$ENSE00000328922
GRanges object with 2 ranges and 11 metadata columns:
seqnames ranges strand | tx_id tx_biotype tx_seq_start tx_seq_end tx_cds_seq_start
ENSE00000328922 3 [131018506, 131018716] - | ENST00000264992 protein_coding 131013875 131026802 131014057
ENSE00000328922 3 [131018506, 131018716] - | ENST00000507978 nonsense_mediated_decay 131013982 131026854 131017000
tx_cds_seq_end gene_id tx_support_level tx_name gene_name exon_id
ENSE00000328922 131025306 ENSG00000034533 1 ENST00000264992 ASTE1 ENSE00000328922
ENSE00000328922 131025306 ENSG00000034533 2 ENST00000507978 ASTE1 ENSE00000328922
$ENSE00000329326
GRanges object with 2 ranges and 11 metadata columns:
seqnames ranges strand | tx_id tx_biotype tx_seq_start tx_seq_end tx_cds_seq_start tx_cds_seq_end
ENSE00000329326 8 [132583694, 132583779] - | ENST00000250173 protein_coding 132572201 132675559 132578498 132675493
ENSE00000329326 8 [132583694, 132583779] - | ENST00000618342 protein_coding 132571953 132661667 132572306 132661667
gene_id tx_support_level tx_name gene_name exon_id
ENSE00000329326 ENSG00000129295 1 ENST00000250173 LRRC6 ENSE00000329326
ENSE00000329326 ENSG00000129295 5 ENST00000618342 LRRC6 ENSE00000329326
...
<221792 more elements>
-------
seqinfo: 388 sequences from GRCh38 genome

I'd say your approach seems to be pretty OK. There is no intron ID stored in the database, so you can't get that from an EnsDb.
Thanks ! I thought it would be nice to keep some kind of mcols information...