Hello all,
I have a question concerning the calcNormFacotrs() in edgeR. There are three methods that I could choose from: "TMM", "RLE", and "upperquartile". I am wondering how could decide which one to use?
For example, consider a simple example like this: there are 10 genes in total, and 4 genes in two groups. Therefore, the counts data would be a 10*8 matrix, where each row is the gene, each column is the individual, and the 1-4 columns are the first group, 5-8 columns are the second group. Among the 10 genes, 60% genes are the differential genes: the counts of No. 3,4,5,6,8,9 in the first group are doubled, while others are the sample. Please see the attachments for this count data.
Then I generated the "group" factor via this command:
> grp <- as.factor(rep(0:1, each = 8/2))
After that, I generated the DGEList by:
> d <- DGEList(counts = counts, group = grp )
Then I calculated the normalization factor by edgeR:
> n <- calcNormFactors(d)
By default, this function uses the "TMM" method. However, the normalization factors look like this:
group lib.size norm.factors
Sample1 0 5062446 1.1195829383593
Sample2 0 5062340 0.8154739771400
Sample3 0 5062444 1.1195827474525
Sample4 0 5062466 1.1403164060313
Sample5 1 3000123 0.9624162935534
Sample6 1 2999992 0.9624163157255
Sample7 1 2999977 0.9624169648716
Sample8 1 3000156 0.9624160077253
I think it is weird, because normalization factors for individuals 1 and 2 are quite different (1.11958, and 0.81547). However, from the counts data, their counts are generally the same (Please see the attachment for counts data).
Then I tried the method of RLE method:
n <- calcNormFactors(d,method="RLE")
The results are:
$samples
group lib.size norm.factors
Sample1 0 5062446 1.0886765699045
Sample2 0 5062340 1.0886508565338
Sample3 0 5062444 1.0886766741626
Sample4 0 5062466 1.0886750099086
Sample5 1 3000123 0.9185446848068
Sample6 1 2999992 0.9185578680804
Sample7 1 2999977 0.9185624609049
Sample8 1 3000156 0.9185437155777
I think this time the results are more reasonable. My question is how I decide which method to use? Why TMM gives a weird result?
Thank you.
Best regards,
sewen67
Following up on Gordon's answer, the TMM method will support cases where up to 60% of genes are DE. However, this upper limit is only supported when the DE genes are evenly distributed between groups, i.e., 30% of genes are upregulated, and 30% are downregulated. In your case, the 60% of DE genes are all upregulated in one group, so it's not surprising that TMM fails.