Question: Implications of TMM Normalization and Possible Alternatives
gravatar for rproendo
25 days ago by
rproendo0 wrote:

Hi all,

I'm relying on edgeR for differential expression analysis and have thus far been using TMM normalization to account for library composition effects, but I'm finding a very large number of DEGs. In this case, I'm unsure if my data set violates any assumptions for TMM, how robust TMM might be to these violations, and what consequences are to be expected. I'm using edgeR and TMM for two experiments, and both have a large number of DEGs. I've included sample code from one of these, where I'm using the exactTest to detect DEGs between two populations, each with 7 replicates. After filtering, ~17,800 genes are included in the model, and 34-38% are significant in any single direction at FDR<0.05. Thanks for reading.

data1 <-read.csv("/path/to/counts/genehits_count_matrix.csv", row.names="Gene")

group1 <- (c(1,1,1,1,1,1,1,2,2,2,2,2,2,2))
y1 <- DGEList(counts=data1,group=group1)

#[1] 24848    14

keep <- rowSums(cpm(y)>1) >=7
y1 <- y[keep, , keep.lib.sizes=FALSE]
#[1] 17872    14

y1 <- calcNormFactors(y1)

#group lib.size norm.factors
#X9_Sorted.bam       1 23724378    1.0107928
#X15_Sorted.bam      1 22202313    0.8970733
#X17_Sorted.bam      1 22368615    1.0044862
#X43_Sorted.bam      1 22633426    0.8669456
#X45_Sorted.bam      1 23806764    0.9907505
#X79_Sorted.bam      1 23051023    0.9943360
#X149_Sorted.bam     1 21555943    0.9268498
#X110_Sorted.bam     2 26638422    1.0525179
#X139_Sorted.bam     2 27932858    1.0555890
#X23_Sorted.bam      2 22286424    1.0212521
#X63_Sorted.bam      2 21534110    1.0750907
#X69_Sorted.bam      2 24672222    1.0584811
#X87_Sorted.bam      2 21216005    1.0446583
#X94_Sorted.bam      2 23970976    1.0282707

y1 <- estimateDisp(y1)
#[1] 0.06447124

plotMDS(y1, col=as.numeric(y1$samples$group))
legend("bottomleft", as.character(unique(y1$samples$group)), col=1:4, pch=20)

plotMDS(y1, method="bcv", col=as.numeric(y1$samples$group))
legend("bottomleft", as.character(unique(y1$samples$group)), col=1:4, pch=20)


et <- exactTest(y1)
summary(de <- decideTestsDGE(et, p=0.05))
#Down   6821
#NotSig 4953
#Up     6098
normalization deg edger tmm • 98 views
ADD COMMENTlink modified 23 days ago by Gordon Smyth36k • written 25 days ago by rproendo0
Answer: Implications of TMM Normalization and Possible Alternatives
gravatar for Aaron Lun
25 days ago by
Aaron Lun22k
Cambridge, United Kingdom
Aaron Lun22k wrote:

I don't see a problem here. Your data set has lots of DE genes, and that's unrelated to any failure of TMM normalization. In fact, I would say that the normalization is the reason why the direction of DE is mostly balanced. If you used a different scaling normalization strategy (e.g., based on spike-ins), you can expect the more or less the same number of DE genes but skewed in one direction.

Recall the main assumption of TMM normalization is that most genes are not DE. This can be rather generously interpreted - for example, violations can be ignored if the log-fold changes are small, or if the direction of DE is mostly balanced. In your case, while the DE genes are a strict majority, I would wager that many of these have small effect sizes and so do not drastically alter the normalization factor estimates. Of course, trying to prove an assumption based on results computed from that assumption tends to be a fool's errand, so I won't.

At any rate, your concerns are largely academic because what are you going to use if not TMM normalization? Keep in mind that an entire class of methods (e.g., DESeq normalization, all the other options in calcNormFactors) operate on similar assumptions. Library size normalization is probably unacceptable as such extensive DE implies the presence of strong composition biases. And I doubt you have spike-ins, so that's out of the question as well. So, TMM normalization is the only way to proceed here.

(You could compute normalization factors after subsetting down to a set of known "constant" genes. However, I've never thought that to be a good idea. Either the bulk of genes will usually agree with the subset - in which case I might as well have just run TMM on the entire set of genes - or the disagreement will make me doubt whether I selected the subset correctly. Remember that when the assumption of a non-DE majority fails, there's probably some crazy shit happening to your cells, so why should housekeeping genes behave?)

Finally, use the QL methods in edgeR (see here), exactTest is out of date.

ADD COMMENTlink modified 25 days ago • written 25 days ago by Aaron Lun22k

Alright, thank you for the feedback Aaron! I'll switch over to QL methods, as well. Happy holidays.

ADD REPLYlink written 24 days ago by rproendo0
Answer: Implications of TMM Normalization and Possible Alternatives
gravatar for Gordon Smyth
23 days ago by
Gordon Smyth36k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth36k wrote:

Aaron has already answered your main question but I would add that your filtering step is a bit more conservative than it needs to be. Given that all your library sizes are over 20 million, I would use cpm(y)>0.05 or cpm(y)>0.04 rather than cpm(y)>1. Alternatively you could simply use:

keep <- filterByExpr(y, group=group1)

to choose the filtering cutoff for you, which would amount to what I've just suggested.

See for a detailed discussion of filtering.

Your analysis is fine as it is, but less stringent filtering would allow you to explore some more modestly expressed genes in case some of them are of interest to you.

ADD COMMENTlink modified 22 days ago • written 23 days ago by Gordon Smyth36k

Okay, thank you for the advice, Gordon. I'll try your suggestions!

ADD REPLYlink written 22 days ago by rproendo0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 118 users visited in the last hour