DESeq2 log-fold change shrunk to -0.4 instead of zero. Why?
1
0
Entering edit mode
lostisle • 0
@lostisle-12032
Last seen 7.4 years ago

I'm using DESeq2 to analyze differential gene expression between two groups. As I understand, the log fold changes for genes with low counts (low mean expression) should be shrunken toward zero, but in the MA-plot (see below), they're shrunken toward -0.4 or so. Why is this so and how can I fix it? I'm worried all my LFC estimates are off. Thanks.

Also posted on Biostars: https://www.biostars.org/p/227508/

deseq2 rnaseq • 1.4k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 9 minutes ago
United States

This came up recently on the support site, but I can't find the post using the search on this site :-/

The shrinkage of LFC is still toward 0 here, and there is not a reason for concern with the analysis results or normalization. The reason that the points are tending toward -0.4 is because the library sizes (sequencing depth) are somewhat confounded with condition. This is not ideal but the computational analysis should still work, as the differences do not seem so extreme. If you look at the size factors ( sizeFactor(dds) ), you will notice that the geometric mean of the 'treated' group is ~75% of the geometric mean of the 'control' group. So the genes in which there are only a few reads total will tend toward a log2 fold change of -0.4. If you prefer, you can filter out these genes ( either before or after DESeq() ) to make better looking figures:

res.filt <- subset(res, baseMean > 5)
ADD COMMENT
0
Entering edit mode

Thanks, Michael, for the explanation! Indeed, the geometric mean of library size for treatment vs control groups are different. In fact there is a 5-fold difference! In addition to the indicator variable for treatment, I've included two other variables, one for the batch of sequencing (1 or 2), and one for whether there were 20M or more reads from the sample. This latter variable must have mitigated some of the effect of library size on the LFCs of treatment vs control.

To unpack your comment a bit: library size factors are on average bigger for the "control" group, and for low expression genes, whose counts (K) are low (say < 5 counts) regardless of library size, bigger size factor (s) makes the real expression levels (q) smaller, since K = s*q. Higher-expressed genes will have more freedom in K, so library size shouldn't pose a problem. Does this last statement seem hand-wavy? Comments from everyone are welcome!

ADD REPLY
1
Entering edit mode

The point I wrote about the low expressed genes is just to explain the visual artifact you see. When there is only 1 or 2 reads for a gene, this is more likely to have come from the control group, because it has higher sequencing depth. It's not a problem for the analysis though.

I don't follow the last part. 

The important thing is that: the size factor compensates -- for all genes -- when the count K is larger for a sample because of library size differences, not because of the underlying amount of gene expression for that gene.

 

ADD REPLY

Login before adding your answer.

Traffic: 1129 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6