Hi,
I have total RNA-seq data and would like to compute TPM and FPKM values for a read count matrix generated for lncRNAs using the annotation file gencode.v28.long_noncoding_RNAs.gtf. To compute TPM and FPKM from the counts I first need the lengths of the lncRNA genes, which I obtain by summing their exon lengths accounting for overlaps using the R package GenomicFeatures. Then I compute effective length as gene length minus average fragment length + 1 (average fragment length was average insert length for any sample as provided to me by the sequencing lab). I used the following code:
library(GenomicFeatures)
txdb <- makeTxDbFromGFF("/home/inah/MESA_mRNAseq_New/gencode.v28.long_noncoding_RNAs.gtf", format="gtf",organism="Homo sapiens")
exons.list.per.gene <- exonsBy(txdb,by="gene")
exonic.gene.sizes <- lapply(exons.list.per.gene,function(x){sum(width(reduce(x)))})
exonic.gene.lengths <- as.numeric(exonic.gene.sizes)
names(exonic.gene.lengths) <- names(exonic.gene.sizes)
summary(exonic.gene.lengths)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 68 499 734 1440 1775 205012
35 of the 15779 exonic.gene.lengths values are less than the average fragment length of 134 (for one sample). So something must be going wrong here. So the effective gene length would be negative for these 35 lncRNAs.
Does any GenomicFeatures (or other) expert have an idea of what is going on here? Am I calculating the lengths correctly? Is this maybe a problem with lncRNAs?
Thank you ... Ina
Update:
I just did the same for protein-coding genes using mRNA-seq data and am also getting gene lengths that are shorter than the fragment lengths, so something may be wrong with my gene length calculation? See below.
library(GenomicFeatures)
txdb <- makeTxDbFromGFF("/home/inah/MESA_mRNAseq_New/Homo_sapiens.GRCh38.85.gtf", format="gtf",organism="Homo sapiens")
exons.list.per.gene <- exonsBy(txdb,by="gene")
exonic.gene.sizes <- lapply(exons.list.per.gene,function(x){sum(width(reduce(x)))})
exonic.gene.lengths <- as.numeric(exonic.gene.sizes)
names(exonic.gene.lengths) <- names(exonic.gene.sizes)
summary(exonic.gene.lengths)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 8 407 950 2328 3148 205012
sum(exonic.gene.lengths<134) # 5889