Question

how to get intron with ensembldb. package?

0

Entering edit mode

alessandro.pastore ▴ 20

@alessandropastore-10879

Last seen 5.4 years ago

I would like to generate a GRangesList of all gene introns with names. I can make the exon list but I do not see a elegant way do get the introns. any suggestion?

Thanks!

library(AnnotationHub)

edb <- query(AnnotationHub(), c("Ensembl 90 EnsDb", "Homo sapiens"))[[1]]

exons.Grange <- exons(edb, columns = c(listColumns(edb , "tx"), "gene_name"))

exons.Grange <- exons.Grange[duplicated(exons.Grange$exon_id),]

exons.Grange <- split(exons.Grange, exons.Grange$exon_id)

> exons.Grange
GRangesList object of length 221795:
$ENSE00000327880 
GRanges object with 5 ranges and 11 metadata columns:
                  seqnames               ranges strand |           tx_id     tx_biotype tx_seq_start tx_seq_end tx_cds_seq_start tx_cds_seq_end
                     <Rle>            <IRanges>  <Rle> |     <character>    <character>    <integer>  <integer>        <integer>      <integer>
  ENSE00000327880        1 [27732603, 27732657]      + | ENST00000419687 protein_coding     27725996   27761473         27726081       27760581
  ENSE00000327880        1 [27732603, 27732657]      + | ENST00000530324 protein_coding     27726028   27759764         27726081       27759657
  ENSE00000327880        1 [27732603, 27732657]      + | ENST00000234549 protein_coding     27726028   27760581         27726081       27760581
  ENSE00000327880        1 [27732603, 27732657]      + | ENST00000373949 protein_coding     27726028   27761964         27726081       27760581
  ENSE00000327880        1 [27732603, 27732657]      + | ENST00000010299 protein_coding     27726057   27760581         27726081       27760581
                          gene_id tx_support_level         tx_name   gene_name         exon_id
                      <character>        <integer>     <character> <character>     <character>
  ENSE00000327880 ENSG00000009780                2 ENST00000419687      FAM76A ENSE00000327880
  ENSE00000327880 ENSG00000009780                1 ENST00000530324      FAM76A ENSE00000327880
  ENSE00000327880 ENSG00000009780                1 ENST00000234549      FAM76A ENSE00000327880
  ENSE00000327880 ENSG00000009780                2 ENST00000373949      FAM76A ENSE00000327880
  ENSE00000327880 ENSG00000009780                1 ENST00000010299      FAM76A ENSE00000327880

$ENSE00000328922 
GRanges object with 2 ranges and 11 metadata columns:
                  seqnames                 ranges strand |           tx_id              tx_biotype tx_seq_start tx_seq_end tx_cds_seq_start
  ENSE00000328922        3 [131018506, 131018716]      - | ENST00000264992          protein_coding    131013875  131026802        131014057
  ENSE00000328922        3 [131018506, 131018716]      - | ENST00000507978 nonsense_mediated_decay    131013982  131026854        131017000
                  tx_cds_seq_end         gene_id tx_support_level         tx_name gene_name         exon_id
  ENSE00000328922      131025306 ENSG00000034533                1 ENST00000264992     ASTE1 ENSE00000328922
  ENSE00000328922      131025306 ENSG00000034533                2 ENST00000507978     ASTE1 ENSE00000328922

$ENSE00000329326 
GRanges object with 2 ranges and 11 metadata columns:
                  seqnames                 ranges strand |           tx_id     tx_biotype tx_seq_start tx_seq_end tx_cds_seq_start tx_cds_seq_end
  ENSE00000329326        8 [132583694, 132583779]      - | ENST00000250173 protein_coding    132572201  132675559        132578498      132675493
  ENSE00000329326        8 [132583694, 132583779]      - | ENST00000618342 protein_coding    132571953  132661667        132572306      132661667
                          gene_id tx_support_level         tx_name gene_name         exon_id
  ENSE00000329326 ENSG00000129295                1 ENST00000250173     LRRC6 ENSE00000329326
  ENSE00000329326 ENSG00000129295                5 ENST00000618342     LRRC6 ENSE00000329326

...
<221792 more elements>
-------
seqinfo: 388 sequences from GRCh38 genome

granges ensembldb GenomicFeatures • 1.4k views

ADD COMMENT • link 6.3 years ago alessandro.pastore ▴ 20

score 1 · Answer 1 · 2018-01-19

1

Entering edit mode

alessandro.pastore ▴ 20

@alessandropastore-10879

Last seen 5.4 years ago

I can generate a GRangesList of introns but the name are lost . 

intron.Grange <- transcripts(edb, columns = c(listColumns(edb , "tx"), "gene_name"), 
                             filter = list(GeneBiotypeFilter("protein_coding") ))

intron.Grange <- setdiff(intron.Grange, exons.Grange)

intron.Grange$intron_id <- paste("intron_id", seq(1:length(intron.Grange)), sep = "")

intron.Grange <- split(intron.Grange, intron.Grange$intron_id)

ADD COMMENT • link 6.3 years ago alessandro.pastore ▴ 20

0

Entering edit mode

I'd say your approach seems to be pretty OK. There is no intron ID stored in the database, so you can't get that from an EnsDb.