how to calculate gene length to be used in rpkm() in edgeR
2
0
Entering edit mode
shirley zhang ★ 1.0k
@shirley-zhang-2038
Last seen 7.0 years ago

Dear List,

I've been used edgeR for differential expression analysis for data generated from the same tissue, but different conditions.

Now I have a RNAseq data A (n=20), and would like to compare them with another RNAseq data B (n=1,000 across different tissues). Since data B is normalized and batch-effect adjusted RPKM value, I need to generate RPKM value for my own data A.

I already had a count table, and would like to use rpkm() in edgeR, but first I have to get a gene length vector. My question is how to count gene length from an "Ensembl.gtf" file by taking into account the following:

1. Gene 1 is much longer than Gene 2 if including both exon and intron. But Gene 1 only has 3 exons, and Gene 2 has 10 exons --> for the transcripts, Gene2>Gene1

2. For the same Gene, there are > 1 transcript isoforms. In different tissues, different transcript isoforms will be expressed.

Many thanks,
Shirley

RNASeq edgeR rpkm • 13k views
ADD COMMENT
3
Entering edit mode
@gordon-smyth
Last seen 6 hours ago
WEHI, Melbourne, Australia

Hi Ryan and Shirley,

The appropriate gene length should match the method and annotation that was used to count the reads.

I'm assuming that the counting method and annotation used for the new data A might differ from that used for data B, so the appropriate gene lengths might not be the same.

The software used to count the reads should also return the appropriate gene length. For example, here is a case study showing how gene lengths are returned by the featureCounts function and used to compute rpkm in edgeR:

  http://bioinf.wehi.edu.au/RNAseqCaseStudy

In the latest version of edgeR, the rpkm() will even find the gene lengths automatically in the DGEList object. In this case study, the gene length is defined to be the total length of all exons in the gene, including the 3'UTR, because featureCounts counts all reads that overlap any exon.

Best wishes
Gordon 

ADD COMMENT
0
Entering edit mode
@ryan-c-thompson-5618
Last seen 12 months ago
Scripps Research, La Jolla, CA

Hi Shirley,

The appropriate gene length to use is whatever gene length was used to compute RPKM values for data set B. If you don't have that information, then I don't see how you can compute comparable RPKM values for your data.

-Ryan

ADD COMMENT

Login before adding your answer.

Traffic: 209 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6