how to calculate gene length to be used in rpkm() in edgeR
Entering edit mode
shirley zhang ★ 1.0k
Last seen 8.3 years ago

Dear List,

I've been used edgeR for differential expression analysis for data generated from the same tissue, but different conditions.

Now I have a RNAseq data A (n=20), and would like to compare them with another RNAseq data B (n=1,000 across different tissues). Since data B is normalized and batch-effect adjusted RPKM value, I need to generate RPKM value for my own data A.

I already had a count table, and would like to use rpkm() in edgeR, but first I have to get a gene length vector. My question is how to count gene length from an "Ensembl.gtf" file by taking into account the following:

1. Gene 1 is much longer than Gene 2 if including both exon and intron. But Gene 1 only has 3 exons, and Gene 2 has 10 exons --> for the transcripts, Gene2>Gene1

2. For the same Gene, there are > 1 transcript isoforms. In different tissues, different transcript isoforms will be expressed.

Many thanks,

RNASeq edgeR rpkm • 14k views
Entering edit mode
Last seen 51 minutes ago
WEHI, Melbourne, Australia

Hi Ryan and Shirley,

The appropriate gene length should match the method and annotation that was used to count the reads.

I'm assuming that the counting method and annotation used for the new data A might differ from that used for data B, so the appropriate gene lengths might not be the same.

The software used to count the reads should also return the appropriate gene length. For example, here is a case study showing how gene lengths are returned by the featureCounts function and used to compute rpkm in edgeR:

In the latest version of edgeR, the rpkm() will even find the gene lengths automatically in the DGEList object. In this case study, the gene length is defined to be the total length of all exons in the gene, including the 3'UTR, because featureCounts counts all reads that overlap any exon.

Best wishes

Entering edit mode
Last seen 2.3 years ago
Scripps Research, La Jolla, CA

Hi Shirley,

The appropriate gene length to use is whatever gene length was used to compute RPKM values for data set B. If you don't have that information, then I don't see how you can compute comparable RPKM values for your data.



Login before adding your answer.

Traffic: 467 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6