Question

norm.factors in TMM normalisation

0

Entering edit mode

Assa Yeroslaviz ★ 1.5k

@assa-yeroslaviz-1597

Last seen 3 months ago

Germany

Hi all,

I'm having trouble understanding the numbers I get when running the TMM normalisation on my data set from S. cerevisiae.

I have several samples, but one of them is a lot smaller than the others (X_97_1h_1, s. table below).

from reading some of the posts in this forum I understand that edgeR takes not exactly this norm.factors into account, but the product of norm.factor * lib.size. This value gives me the effective library size which is than used in downstream analysis.

What I don't understand is how the norm.factor is calculated. Even though the library size is taken into account when calculating differential gene expression, how come the norm.factors are still similar, when the libraries are not?

Another factor in my case is what parameter is than used later on, the effective library size or the given norm.factors in the table (y$samples$norm.factors)

This is important in my case, as I am not just using edgeR to normalize the data, but also use it outside of edgeR to visualize it. I am calculating the overlap of reads over the genome (using summariseOverlaps) and counts reads into bins of 500 bases long. To create the wig files for the browser I am taking the raw read counts of each library and multiply it with the given norm.factor (e.g. for X_97_1h_1 I am taking 0.9584462 as a norm.factor to multiply each of my rows in the count table.).

This is how I ran the normalisation (I only have duplicates):

>y <- DGEList(counts=countTable, group= rep(1:18, each=2))
>y <- calcNormFactors(y, method="TMM")

> y$samples
                    group lib.size norm.factors
51_248_0h_1   51_248_0h_1  4801445    1.0390857
51_248_0h_2   51_248_0h_2  1644252    1.0393724
51_248_1h_1   51_248_1h_1  3297504    1.0464985
51_248_1h_2   51_248_1h_2  2222688    1.0171469
51_248_4h_1   51_248_4h_1  4679946    1.0074098
51_248_4h_2   51_248_4h_2  3024524    0.9885031
ctrl_110_0h_1   ctrl_110_0h_1  3769422    1.0582047
ctrl_110_0h_2   ctrl_110_0h_2  3650192    1.0630055
ctrl_110_1h_1   ctrl_110_1h_1  4275661    1.0542222
ctrl_110_1h_2   ctrl_110_1h_2  4348709    1.0602291
ctrl_110_4h_1   ctrl_110_4h_1  3507238    1.0648724
ctrl_110_4h_2   ctrl_110_4h_2  4324472    1.0700604
ctrl_248_0h_1   ctrl_248_0h_1  4819007    0.9628215
ctrl_248_0h_2   ctrl_248_0h_2  4573513    1.0564647
ctrl_248_1h_1   ctrl_248_1h_1  4834486    0.9610896
ctrl_248_1h_2   ctrl_248_1h_2  4297190    1.0468209
ctrl_248_4h_1   ctrl_248_4h_1  7834379    1.0270228
ctrl_248_4h_2   ctrl_248_4h_2  5017690    1.0747524
ctrl_97_0h_1     ctrl_97_0h_1  4025521    1.0027374
ctrl_97_0h_2     ctrl_97_0h_2  3803086    1.0271279
ctrl_97_1h_1     ctrl_97_1h_1  4124150    1.0060742
ctrl_97_1h_2     ctrl_97_1h_2  4114575    1.0235497
ctrl_97_4h_1     ctrl_97_4h_1  3468361    1.0699728
ctrl_97_4h_2     ctrl_97_4h_2  4065669    1.0654684
X_110_0h_1 X_110_0h_1  2763927    0.9538789
X_110_0h_2 X_110_0h_2  2882729    0.9265470
X_110_1h_1 X_110_1h_1  3059491    0.9551635
X_110_1h_2 X_110_1h_2  3208547    0.9368711
X_110_4h_1 X_110_4h_1  2862389    0.9656174
X_110_4h_2 X_110_4h_2  2984518    0.8909318
X_97_0h_1   X_97_0h_1  2811017    0.9374170
X_97_0h_2   X_97_0h_2  2134669    0.8990688
X_97_1h_1   X_97_1h_1   340190    0.9584462
X_97_1h_2   X_97_1h_2  2108722    0.9048725
X_97_4h_1   X_97_4h_1  3646497    0.9503031
X_97_4h_2   X_97_4h_2  2934569    0.9445557

edgr tmm normalization • 4.7k views

ADD COMMENT • link updated 8.5 years ago by Aaron Lun ★ 28k • written 8.5 years ago by Assa Yeroslaviz ★ 1.5k

score 4 · Answer 1 · 2015-10-27

Assume that you have two libraries. The ratio of the normalization factors between these two libraries represents the systematic fold-difference in the gene counts, beyond that caused by differences in sequencing depth/library size. For example, consider a case where you take a sample of RNA and sequence it to a particular depth. Now imagine taking the same sample, and resequencing it to a greater depth. When you compare the two resulting libraries, the only (systematic) difference in the counts between these two samples is that caused by the library size. As such, the normalization factors for these libraries will be equal such that the ratio is unity (i.e., no additional difference in the counts, beyond that due to library size differences).

Now, consider instead a case where you have two libraries of the same depth but generated from separate biological samples. In one sample, there is heightened expression of a particular gene. Because more reads are spent in sequencing this gene in the corresponding library, fewer reads are available to go around for the rest of the transcriptome, such that coverage is suppressed for all other genes. When you compare the counts between the two libraries, you will observe a systematic difference in the counts between libraries due to this suppressive effect - despite the library sizes being the same. In this case, the ratio of the normalization factors for these libraries will be not be equal to unity.

So, the long and short of it is that the similarity in library sizes (or lack thereof) does not guarantee any (dis)similarity in the normalization factors. In fact, one might say that they are intended to be independent, as the latter should catch differences that are missed by the former. However, the important value is the effective library size; this is the parameter that is used for all downstream analyses in edgeR, and it should be what you use to normalize your genomic coverage.