Question

DESeq2 log-fold change shrunk to -0.4 instead of zero. Why?

0

Entering edit mode

lostisle • 0

@lostisle-12032

Last seen 7.4 years ago

I'm using DESeq2 to analyze differential gene expression between two groups. As I understand, the log fold changes for genes with low counts (low mean expression) should be shrunken toward zero, but in the MA-plot (see below), they're shrunken toward -0.4 or so. Why is this so and how can I fix it? I'm worried all my LFC estimates are off. Thanks.

Also posted on Biostars: https://www.biostars.org/p/227508/

deseq2 rnaseq • 1.4k views

ADD COMMENT • link 7.4 years ago lostisle • 0

score 1 · Answer 1 · 2016-12-15

1

Entering edit mode

Michael Love 41k

@mikelove

Last seen 9 minutes ago

United States

This came up recently on the support site, but I can't find the post using the search on this site :-/

The shrinkage of LFC is still toward 0 here, and there is not a reason for concern with the analysis results or normalization. The reason that the points are tending toward -0.4 is because the library sizes (sequencing depth) are somewhat confounded with condition. This is not ideal but the computational analysis should still work, as the differences do not seem so extreme. If you look at the size factors ( sizeFactor(dds) ), you will notice that the geometric mean of the 'treated' group is ~75% of the geometric mean of the 'control' group. So the genes in which there are only a few reads total will tend toward a log2 fold change of -0.4. If you prefer, you can filter out these genes ( either before or after DESeq() ) to make better looking figures:

res.filt <- subset(res, baseMean > 5)

ADD COMMENT • link 7.4 years ago Michael Love 41k

0

Entering edit mode

Thanks, Michael, for the explanation! Indeed, the geometric mean of library size for treatment vs control groups are different. In fact there is a 5-fold difference! In addition to the indicator variable for treatment, I've included two other variables, one for the batch of sequencing (1 or 2), and one for whether there were 20M or more reads from the sample. This latter variable must have mitigated some of the effect of library size on the LFCs of treatment vs control.

To unpack your comment a bit: library size factors are on average bigger for the "control" group, and for low expression genes, whose counts (K) are low (say < 5 counts) regardless of library size, bigger size factor (s) makes the real expression levels (q) smaller, since K = s*q. Higher-expressed genes will have more freedom in K, so library size shouldn't pose a problem. Does this last statement seem hand-wavy? Comments from everyone are welcome!

ADD REPLY • link 7.4 years ago lostisle • 0

1

Entering edit mode

The point I wrote about the low expressed genes is just to explain the visual artifact you see. When there is only 1 or 2 reads for a gene, this is more likely to have come from the control group, because it has higher sequencing depth. It's not a problem for the analysis though.

I don't follow the last part.

The important thing is that: the size factor compensates -- for all genes -- when the count K is larger for a sample because of library size differences, not because of the underlying amount of gene expression for that gene.

ADD REPLY • link 7.4 years ago Michael Love 41k