Search
Question: how to calculate gene length to be used in rpkm() in edgeR
0
gravatar for shirley zhang
3.6 years ago by
shirley zhang1.0k
shirley zhang1.0k wrote:

Dear List,

I've been used edgeR for differential expression analysis for data generated from the same tissue, but different conditions.

Now I have a RNAseq data A (n=20), and would like to compare them with another RNAseq data B (n=1,000 across different tissues). Since data B is normalized and batch-effect adjusted RPKM value, I need to generate RPKM value for my own data A.

I already had a count table, and would like to use rpkm() in edgeR, but first I have to get a gene length vector. My question is how to count gene length from an "Ensembl.gtf" file by taking into account the following:

1. Gene 1 is much longer than Gene 2 if including both exon and intron. But Gene 1 only has 3 exons, and Gene 2 has 10 exons --> for the transcripts, Gene2>Gene1

2. For the same Gene, there are > 1 transcript isoforms. In different tissues, different transcript isoforms will be expressed.

Many thanks,
Shirley

ADD COMMENTlink modified 3.1 years ago by Gordon Smyth32k • written 3.6 years ago by shirley zhang1.0k
1
gravatar for Gordon Smyth
3.5 years ago by
Gordon Smyth32k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth32k wrote:

Hi Ryan and Shirley,

The appropriate gene length should match the method and annotation that was used to count the reads.

I'm assuming that the counting method and annotation used for the new data A might differ from that used for data B, so the appropriate gene lengths might not be the same.

The software used to count the reads should also return the appropriate gene length. For example, here is a case study showing how gene lengths are returned by the featureCounts function and used to compute rpkm in edgeR:

  http://bioinf.wehi.edu.au/RNAseqCaseStudy

In the latest version of edgeR, the rpkm() will even find the gene lengths automatically in the DGEList object. In this case study, the gene length is defined to be the total length of all exons in the gene, including the 3'UTR, because featureCounts counts all reads that overlap any exon.

Best wishes
Gordon 

ADD COMMENTlink modified 3.1 years ago • written 3.5 years ago by Gordon Smyth32k
0
gravatar for Ryan C. Thompson
3.6 years ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson6.1k wrote:

Hi Shirley,

The appropriate gene length to use is whatever gene length was used to compute RPKM values for data set B. If you don't have that information, then I don't see how you can compute comparable RPKM values for your data.

-Ryan

ADD COMMENTlink modified 3.1 years ago by Gordon Smyth32k • written 3.6 years ago by Ryan C. Thompson6.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 329 users visited in the last hour