Question

The difference between three methods in calcNormFacotors() in edgeR

0

Entering edit mode

Zhan Tianyu ▴ 40

@zhan-tianyu-6632

Last seen 10.2 years ago

Hello all,

I have a question concerning the calcNormFacotrs() in edgeR. There are three methods that I could choose from: "TMM", "RLE", and "upperquartile". I am wondering how could decide which one to use?

For example, consider a simple example like this: there are 10 genes in total, and 4 genes in two groups. Therefore, the counts data would be a 10*8 matrix, where each row is the gene, each column is the individual, and the 1-4 columns are the first group, 5-8 columns are the second group. Among the 10 genes, 60% genes are the differential genes: the counts of No. 3,4,5,6,8,9 in the first group are doubled, while others are the sample. Please see the attachments for this count data.

Then I generated the "group" factor via this command:
> grp <- as.factor(rep(0:1, each = 8/2))

After that, I generated the DGEList by:
> d <- DGEList(counts = counts, group = grp )

Then I calculated the normalization factor by edgeR:
> n <- calcNormFactors(d)

By default, this function uses the "TMM" method. However, the normalization factors look like this:

group lib.size norm.factors
Sample1 0 5062446 1.1195829383593
Sample2 0 5062340 0.8154739771400
Sample3 0 5062444 1.1195827474525
Sample4 0 5062466 1.1403164060313
Sample5 1 3000123 0.9624162935534
Sample6 1 2999992 0.9624163157255
Sample7 1 2999977 0.9624169648716
Sample8 1 3000156 0.9624160077253

I think it is weird, because normalization factors for individuals 1 and 2 are quite different (1.11958, and 0.81547). However, from the counts data, their counts are generally the same (Please see the attachment for counts data).

Then I tried the method of RLE method:
n <- calcNormFactors(d,method="RLE")

The results are:

$samples
group lib.size norm.factors
Sample1 0 5062446 1.0886765699045
Sample2 0 5062340 1.0886508565338
Sample3 0 5062444 1.0886766741626
Sample4 0 5062466 1.0886750099086
Sample5 1 3000123 0.9185446848068
Sample6 1 2999992 0.9185578680804
Sample7 1 2999977 0.9185624609049
Sample8 1 3000156 0.9185437155777

I think this time the results are more reasonable. My question is how I decide which method to use? Why TMM gives a weird result?

Thank you.

Best regards,
sewen67

Normalization edgeR • 3.3k views

ADD COMMENT • link updated 10.2 years ago by Gordon Smyth 52k • written 10.4 years ago by Zhan Tianyu ▴ 40

score 1 · Answer 1 · 2014-07-06

Dear Zhan Tianyu,

The edgeR authors obviously recommend TMM. It is the default and is used in all the edgeR examples and case studies.

I don't know of any published comparative study showing better performance for the other methods.

TMM is not however designed to work well with very small numbers of genes (such as your toy example with 10 genes). Actually, your toy example does not fit the assumptions of any the normalization methods because the majority of the genes (all but four in fact) are differentially expressed. I don't think you can learn much about the performance of the different methods on real data from this example.

If you think that TMM has given an incorrect result for a real dataset then I suggest that you send your data example offline to the TMM author, Mark Robinson, so that he can trouble-shoot.

There was no attachment with your email, and I don't think that you have examined the right thing to judge which is the better normalization.

Best wishes
Gordon