Question

Problems with "exonsBy" function from GenomicFeatures package

0

Entering edit mode

alcs417 • 0

@alcs417-14149

Last seen 7.0 years ago

Hi there,

I was using the R scripts written by Irsan(https://www.biostars.org/p/83901/) to obtain the exonic length for each protein-coding gene. However, the length information of many genes were not included in the final output. I randomly picked some of them and found that if I manually get the ensembl_transcript_id from biomaRt then the exonsBy() function works well. Could anybody tell me why? The scripts are pasted below:

library(biomaRt)
library(GenomicFeatures)
txdb2 <- makeTxDbFromBiomart(dataset="hsapiens_gene_ensembl")
exons.list.per.gene <- exonsBy(txdb2, by="gene")   
exonic.gene.length <- lapply(exons.list.per.gene, function(x){sum(width(reduce(x)))})

### The results obtained from the code above do not contain length information for "ENSG00000281887" (randomly picked as an example). But I can manually run to get the exonic length. ###

mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
res <- getBM(attributes=c("ensembl_gene_id", "ensembl_gene_id_version", "ensembl_transcript_id",
                    "hgnc_symbol"), filters="ensembl_gene_id", values="ENSG00000281887", mart=mart)
txdb2 <- makeTxDbFromBiomart(dataset="hsapiens_gene_ensembl", transcript_ids=res$ensembl_transcipt_id)
exons.list.per.gene <- exonsBy(txdb2, by="gene")   
exonic.gene.length <- lapply(exons.list.per.gene, function(x){sum(width(reduce(x)))})

I know I could run the script in a loop to get the gene length information for each gene, but it takes too much time since the "makeTxDbFromBiomart" is extremely slow. Any suggestions would be really appreciated.

exonsBy; GenomicFeatures; • 1.3k views

ADD COMMENT • link 7.0 years ago alcs417 • 0

0

Entering edit mode

note that an lapply() with GenomicRanges is usually a red flag for inefficient code. Try simply sum(width(reduce(exons.list.per.gene))), which runs in a fraction of the time and returns a vector (desired) rather than list.

ADD REPLY • link 7.0 years ago Martin Morgan 25k

0

Entering edit mode

alcs417 • 0

@alcs417-14149

Last seen 7.0 years ago

Thanks very much! It perfectly solved my problem!

ADD COMMENT • link 7.0 years ago alcs417 • 0

score 2 · Accepted Answer · 2017-10-10

Making the TxDb when there is already one available is unnecessary effort duplication. Johannes Rainer has already made EnsDb packages for human, and put them on the AnnotationHub.

> library(AnnotationHub)

Attaching package: 'AnnotationHub'

The following object is masked from 'package:Biobase':

    cache

> hub <- AnnotationHub()
snapshotDate(): 2017-04-25
> query(hub, c("ensdb","homo sapiens"))
AnnotationHub with 2 records
# snapshotDate(): 2017-04-25
# $dataprovider: Ensembl
# $species: Homo Sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH53211"]]'

            title                            
  AH53211 | Ensembl 87 EnsDb for Homo Sapiens
  AH53715 | Ensembl 88 EnsDb for Homo Sapiens
> z <- hub[["AH53715"]]
require("ensembldb")
downloading from 'https://annotationhub.bioconductor.org/fetch/60453'
retrieving 1 resource
  |======================================================================| 100%
> exs <- exonsBy(z, "gene")
exs[["ENSG00000281887"]]
> GRanges object with 6 ranges and 1 metadata column:
      seqnames                 ranges strand |         exon_id
         <Rle>              <IRanges>  <Rle> |     <character>
  [1]        7 [150716668, 150716690]      + | ENSE00003762946
  [2]        7 [150719042, 150719090]      + | ENSE00003720568
  [3]        7 [150720048, 150720406]      + | ENSE00003728945
  [4]        7 [150737505, 150737708]      + | ENSE00003747063
  [5]        7 [150740879, 150740927]      + | ENSE00003728722
  [6]        7 [150742183, 150743646]      + | ENSE00003751629
  -------
  seqinfo: 389 sequences from GRCh38 genome
> length(exs)
[1] 64592