Question

Implications of TMM Normalization and Possible Alternatives

0

Entering edit mode

rproendo ▴ 20

@rproendo-17985

Last seen 23 days ago

United States

Hi all,

I'm relying on edgeR for differential expression analysis and have thus far been using TMM normalization to account for library composition effects, but I'm finding a very large number of DEGs. In this case, I'm unsure if my data set violates any assumptions for TMM, how robust TMM might be to these violations, and what consequences are to be expected. I'm using edgeR and TMM for two experiments, and both have a large number of DEGs. I've included sample code from one of these, where I'm using the exactTest to detect DEGs between two populations, each with 7 replicates. After filtering, ~17,800 genes are included in the model, and 34-38% are significant in any single direction at FDR<0.05. Thanks for reading.

data1 <-read.csv("/path/to/counts/genehits_count_matrix.csv", row.names="Gene")

group1 <- (c(1,1,1,1,1,1,1,2,2,2,2,2,2,2))
y1 <- DGEList(counts=data1,group=group1)

dim(y1$counts)
#[1] 24848    14

keep <- rowSums(cpm(y)>1) >=7
y1 <- y[keep, , keep.lib.sizes=FALSE]
dim(y1$counts)
#[1] 17872    14

y1 <- calcNormFactors(y1)
y1$samples

#group lib.size norm.factors
#X9_Sorted.bam       1 23724378    1.0107928
#X15_Sorted.bam      1 22202313    0.8970733
#X17_Sorted.bam      1 22368615    1.0044862
#X43_Sorted.bam      1 22633426    0.8669456
#X45_Sorted.bam      1 23806764    0.9907505
#X79_Sorted.bam      1 23051023    0.9943360
#X149_Sorted.bam     1 21555943    0.9268498
#X110_Sorted.bam     2 26638422    1.0525179
#X139_Sorted.bam     2 27932858    1.0555890
#X23_Sorted.bam      2 22286424    1.0212521
#X63_Sorted.bam      2 21534110    1.0750907
#X69_Sorted.bam      2 24672222    1.0584811
#X87_Sorted.bam      2 21216005    1.0446583
#X94_Sorted.bam      2 23970976    1.0282707

y1 <- estimateDisp(y1)
y1$common.dispersion
#[1] 0.06447124

plotMDS(y1, col=as.numeric(y1$samples$group))
legend("bottomleft", as.character(unique(y1$samples$group)), col=1:4, pch=20)

plotMDS(y1, method="bcv", col=as.numeric(y1$samples$group))
legend("bottomleft", as.character(unique(y1$samples$group)), col=1:4, pch=20)

plotBCV(y1)

et <- exactTest(y1)
summary(de <- decideTestsDGE(et, p=0.05))
#2-1
#Down   6821
#NotSig 4953
#Up     6098

edger TMM DEG normalization • 1.7k views

ADD COMMENT • link updated 5.3 years ago by Gordon Smyth 50k • written 5.3 years ago by rproendo ▴ 20

score 1 · Answer 1 · 2018-12-25

I don't see a problem here. Your data set has lots of DE genes, and that's unrelated to any failure of TMM normalization. In fact, I would say that the normalization is the reason why the direction of DE is mostly balanced. If you used a different scaling normalization strategy (e.g., based on spike-ins), you can expect the more or less the same number of DE genes but skewed in one direction.

Recall the main assumption of TMM normalization is that most genes are not DE. This can be rather generously interpreted - for example, violations can be ignored if the log-fold changes are small, or if the direction of DE is mostly balanced. In your case, while the DE genes are a strict majority, I would wager that many of these have small effect sizes and so do not drastically alter the normalization factor estimates. Of course, trying to prove an assumption based on results computed from that assumption tends to be a fool's errand, so I won't.

At any rate, your concerns are largely academic because what are you going to use if not TMM normalization? Keep in mind that an entire class of methods (e.g., DESeq normalization, all the other options in calcNormFactors) operate on similar assumptions. Library size normalization is probably unacceptable as such extensive DE implies the presence of strong composition biases. And I doubt you have spike-ins, so that's out of the question as well. So, TMM normalization is the only way to proceed here.

(You could compute normalization factors after subsetting down to a set of known "constant" genes. However, I've never thought that to be a good idea. Either the bulk of genes will usually agree with the subset - in which case I might as well have just run TMM on the entire set of genes - or the disagreement will make me doubt whether I selected the subset correctly. Remember that when the assumption of a non-DE majority fails, there's probably some crazy shit happening to your cells, so why should housekeeping genes behave?)

Finally, use the QL methods in edgeR (see here), exactTest is out of date.

score 0 · Answer 2 · 2018-12-27

Aaron has already answered your main question but I would add that your filtering step is a bit more conservative than it needs to be. Given that all your library sizes are over 20 million, I would use cpm(y)>0.05 or cpm(y)>0.04 rather than cpm(y)>1. Alternatively you could simply use:

keep <- filterByExpr(y, group=group1)

to choose the filtering cutoff for you, which would amount to what I've just suggested.

See https://f1000research.com/articles/5-1408/v3 for a detailed discussion of filtering.

Your analysis is fine as it is, but less stringent filtering would allow you to explore some more modestly expressed genes in case some of them are of interest to you.