Search
Question: Normalization factor in TMM method
0
gravatar for Sara
13 months ago by
Sara0
Sara0 wrote:

Hi all,

As many of you do, I apologize for asking a likely dumb question, but I appreciate in advance any clarification from you. As far as I know, calcNormFactors() produces two columns of information. The first is lib.size and the second is norm.factors, which multiplying these two columns together gives us an effective library size. However, I don't understand how the normalization factor was calculated, could you please explain me in a simple way as I'm basically a biologist?

From what I read, I understand that TMM_count = raw_counts / ( libsize * norm.factor ). Please kindly let me know what is differences between TMM_count and FPKM values in terms of normalization by library size?

Thank you

 

ADD COMMENTlink modified 13 months ago by James W. MacDonald45k • written 13 months ago by Sara0
5
gravatar for James W. MacDonald
13 months ago by
United States
James W. MacDonald45k wrote:

The basic idea is that we are trying to account for differences due to library size. Consider two samples, one with 10M reads, and one with 20M reads. All else equal, if you had a gene that was expressed at the same level in both samples, you still expect twice as many reads in the second sample as compared to the first (because there are twice as many total reads).

Dividing the samples by the library size accounts for these differences, but you can get 'compositional biases' where there might be a set of mRNA transcripts in one sample that are highly expressed, and they hogged up a bunch of the space on a given lane. Since they took up so much space, the remaining mRNA transcripts may have lower counts just because they got out-competed for space. The TMM normalization accounts for that, by ignoring some of the really highly expressed genes, so when you adjust for library sizes you can arguably get a better adjustment.

FPKM goes one step further, accounting for the length of the transcripts you are measuring. A longer transcript will usually have more reads, because it's longer. And if you were trying to make comparisons between genes or transcripts within a sample, that might be something you care about. But in general you are looking for differences between the SAME genes in different samples, so the transcript length doesn't matter.

Does that make sense?

ADD COMMENTlink written 13 months ago by James W. MacDonald45k

Thank you very much, James. It's very helpful, but I'm really sorry for this question, your mean from "samples" in "Dividing the samples by the library size accounts" in paragraph 2 is the mapped read for each gene in a given library?

thanks

ADD REPLYlink written 13 months ago by Sara0

Exactly. You divide the counts for each gene by the library size (in millions, because you don't want to be dealing with normalized counts of like 0.000002 or whatever).
 

ADD REPLYlink written 13 months ago by James W. MacDonald45k

Thanks a lot for your great help.

ADD REPLYlink written 13 months ago by Sara0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 178 users visited in the last hour