Question

edgeR rpkm function

1

Entering edit mode

Makis Motakis ▴ 30

@makis-motakis-5507

Last seen 8.7 years ago

Japan

Hello all,

I want to calculate RPKM values of my data and, following previous posts, I use the function rpkm() of edgeR. Initially, I checked how the function works on the hypothetical data of http://blog.nextgenetics.net/?e=51 (Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Wagner GP, Kin K, Lynch VJ. Theory Biosci. 2012) and I got as output the "inconsistent" values presented at the second table of "Inconsistency with RPKM" paragraph of the above webpage. Could someone please advice if there is actually a problem with the rpkm() function in edgeR? if yes, do I have to recalculate the values manually or is there an updated function?

Thank you

Mike

edger rpkm • 20k views

ADD COMMENT • link updated 13 months ago by Gordon Smyth 50k • written 9.3 years ago by Makis Motakis ▴ 30

score 2 · Answer 1 · 2015-02-01

2

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 9 hours ago

WEHI, Melbourne, Australia

In edgeR, you should run calcNormFactors() before running rpkm(), for example:

y <- DGEList(counts=counts,genes=data.frame(Length=GeneLength))
y <- calcNormFactors(y)
RPKM <- rpkm(y)

Then rpkm will use the normalized effective library sizes to compute rpkm instead of the raw library sizes. This solves the problem pointed out by Wagner et al.

We view the edgeR approach as better than either raw rpkm or the fix suggested by the paper that you cite.

If you try it out, note though calcNormFactors() is designed to work on real data sets with many genes. It won't necessarily give good results on a toy hypothetical dataset of just a few genes.

ADD COMMENT • link 9.2 years ago Gordon Smyth 50k

0

Entering edit mode

Hi Gordon, would you tell me how to get GeneLength for Length parameter? Thank you!

ADD REPLY • link 13 months ago Chris • 0

0

Entering edit mode

See how to calculate gene length to be used in rpkm() in edgeR. If you still need help then please open a new question with some context about your situation and the data you have. That is far better than adding comments to 8-year-old answers.

ADD REPLY • link 13 months ago Gordon Smyth 50k

score 1 · Answer 2 · 2015-02-01

There is no problem with the rpkm function in edgeR. It does exactly what it says on the tin, i.e., it computes the reads per kilobase per million for each gene in each sample. The link you provided suggests an adjustment to the RPKMs to avoid the problem of "inconsistency" between samples, but these adjusted values are not RPKMs anymore. If you want this adjustment, you'll just have to do it yourself:

t(t(rpkms)/colSums(rpkms))

for a matrix of rpkms. Personally, I think that these adjusted RPKMs are more difficult to interpret. Consider the example below:

Gene	Length (kbp)	*Sample A (reads, RPKM)*	*Sample B (reads, RPKM)*
1	10	10 (33333333)	10 (33333333)
2	10	10 (33333333)	10 (33333333)
3	20	10 (16666667)	0 (0)
4	5	0 (0)	10 (66666667)

If you compared RPKMs directly between samples A and B, genes 1 and 2 will not be DE (which is the correct state of affairs). However, if you performed the adjustment, you would divide all RPKM values in sample A by 83333333, and those in sample B by 133333333. This would introduce a spurious difference of 60% between A and B for genes 1 and 2, which is not ideal.