Hi there,
I was using the R scripts written by Irsan(https://www.biostars.org/p/83901/) to obtain the exonic length for each protein-coding gene. However, the length information of many genes were not included in the final output. I randomly picked some of them and found that if I manually get the ensembl_transcript_id from biomaRt then the exonsBy() function works well. Could anybody tell me why? The scripts are pasted below:
library(biomaRt) library(GenomicFeatures) txdb2 <- makeTxDbFromBiomart(dataset="hsapiens_gene_ensembl") exons.list.per.gene <- exonsBy(txdb2, by="gene") exonic.gene.length <- lapply(exons.list.per.gene, function(x){sum(width(reduce(x)))})
### The results obtained from the code above do not contain length information for "ENSG00000281887" (randomly picked as an example). But I can manually run to get the exonic length. ###
mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl") res <- getBM(attributes=c("ensembl_gene_id", "ensembl_gene_id_version", "ensembl_transcript_id", "hgnc_symbol"), filters="ensembl_gene_id", values="ENSG00000281887", mart=mart) txdb2 <- makeTxDbFromBiomart(dataset="hsapiens_gene_ensembl", transcript_ids=res$ensembl_transcipt_id) exons.list.per.gene <- exonsBy(txdb2, by="gene") exonic.gene.length <- lapply(exons.list.per.gene, function(x){sum(width(reduce(x)))})
I know I could run the script in a loop to get the gene length information for each gene, but it takes too much time since the "makeTxDbFromBiomart" is extremely slow. Any suggestions would be really appreciated.
note that an lapply() with GenomicRanges is usually a red flag for inefficient code. Try simply
sum(width(reduce(exons.list.per.gene)))
, which runs in a fraction of the time and returns a vector (desired) rather than list.