Question: Normalization factor in TMM method
gravatar for Sara
3.0 years ago by
Sara0 wrote:

Hi all,

As many of you do, I apologize for asking a likely dumb question, but I appreciate in advance any clarification from you. As far as I know, calcNormFactors() produces two columns of information. The first is lib.size and the second is norm.factors, which multiplying these two columns together gives us an effective library size. However, I don't understand how the normalization factor was calculated, could you please explain me in a simple way as I'm basically a biologist?

From what I read, I understand that TMM_count = raw_counts / ( libsize * norm.factor ). Please kindly let me know what is differences between TMM_count and FPKM values in terms of normalization by library size?

Thank you


normalization edger tmm • 1.3k views
ADD COMMENTlink modified 3.0 years ago by James W. MacDonald51k • written 3.0 years ago by Sara0
Answer: Normalization factor in TMM method
gravatar for James W. MacDonald
3.0 years ago by
United States
James W. MacDonald51k wrote:

The basic idea is that we are trying to account for differences due to library size. Consider two samples, one with 10M reads, and one with 20M reads. All else equal, if you had a gene that was expressed at the same level in both samples, you still expect twice as many reads in the second sample as compared to the first (because there are twice as many total reads).

Dividing the samples by the library size accounts for these differences, but you can get 'compositional biases' where there might be a set of mRNA transcripts in one sample that are highly expressed, and they hogged up a bunch of the space on a given lane. Since they took up so much space, the remaining mRNA transcripts may have lower counts just because they got out-competed for space. The TMM normalization accounts for that, by ignoring some of the really highly expressed genes, so when you adjust for library sizes you can arguably get a better adjustment.

FPKM goes one step further, accounting for the length of the transcripts you are measuring. A longer transcript will usually have more reads, because it's longer. And if you were trying to make comparisons between genes or transcripts within a sample, that might be something you care about. But in general you are looking for differences between the SAME genes in different samples, so the transcript length doesn't matter.

Does that make sense?

ADD COMMENTlink written 3.0 years ago by James W. MacDonald51k

Thank you very much, James. It's very helpful, but I'm really sorry for this question, your mean from "samples" in "Dividing the samples by the library size accounts" in paragraph 2 is the mapped read for each gene in a given library?


ADD REPLYlink written 3.0 years ago by Sara0

Exactly. You divide the counts for each gene by the library size (in millions, because you don't want to be dealing with normalized counts of like 0.000002 or whatever).

ADD REPLYlink written 3.0 years ago by James W. MacDonald51k

Thanks a lot for your great help.

ADD REPLYlink written 3.0 years ago by Sara0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 309 users visited in the last hour