Question

how to calculate gene length to be used in rpkm() in edgeR

0

Entering edit mode

shirley zhang ★ 1.0k

@shirley-zhang-2038

Last seen 11.4 years ago

Dear List,

I've been used edgeR for differential expression analysis for data generated from the same tissue, but different conditions.

Now I have a RNAseq data A (n=20), and would like to compare them with another RNAseq data B (n=1,000 across different tissues). Since data B is normalized and batch-effect adjusted RPKM value, I need to generate RPKM value for my own data A.

I already had a count table, and would like to use rpkm() in edgeR, but first I have to get a gene length vector. My question is how to count gene length from an "Ensembl.gtf" file by taking into account the following:

1. Gene 1 is much longer than Gene 2 if including both exon and intron. But Gene 1 only has 3 exons, and Gene 2 has 10 exons --> for the transcripts, Gene2>Gene1

2. For the same Gene, there are > 1 transcript isoforms. In different tissues, different transcript isoforms will be expressed.

Many thanks,
Shirley

RNASeq edgeR rpkm • 17k views

ADD COMMENT • link updated 2.8 years ago by Gordon Smyth 53k • written 11.8 years ago by shirley zhang ★ 1.0k

score 3 · Answer 1 · 2014-05-04

Hi Ryan and Shirley,

The appropriate gene length should match the method and annotation that was used to count the reads.

I'm assuming that the counting method and annotation used for the new data A might differ from that used for data B, so the appropriate gene lengths might not be the same.

The software used to count the reads should also return the appropriate gene length. For example, here is a case study showing how gene lengths are returned by the featureCounts function and used to compute rpkm in edgeR:

https://subread.sourceforge.net/RNAseqCaseStudy.html

In the latest version of edgeR, the rpkm() will even find the gene lengths automatically in the DGEList object. In this case study, the gene length is defined to be the total length of all exons in the gene, including the 3'UTR, because featureCounts counts all reads that overlap any exon.

Best wishes
Gordon

Gordon Smyth · Answer 2 · 2014-05-02

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 16 months ago

Icahn School of Medicine at Mount Sinai…

Hi Shirley,

The appropriate gene length to use is whatever gene length was used to compute RPKM values for data set B. If you don't have that information, then I don't see how you can compute comparable RPKM values for your data.

-Ryan

ADD COMMENT • link updated 11.3 years ago by Gordon Smyth 53k • written 11.8 years ago by Ryan C. Thompson ★ 7.9k