I am curious about why the calculated log2 fold change value differs from the log2FoldChange of DESeq2 and want to know the cause.
Result (three condition/ Total 16 samples):
Condition 1 normalized counts: 0.000000 4.496866 8.383799 9.168738 5.433209
Condition 2 normalized counts: 5.609478 7.348834 6.021589 6.293060 6.732453
Condition 3 normalized counts: 4.727638 10.062812 8.112052 10.146985 8.873856 8.106829
Condition 1 average: 5.496522
Condition 2 average: 6.401083
Calculated log2 fold change: log2(6.401083/5.496522) = 0.219797
log2 fold change (MLE): condition Condition 2 vs Condition 1 : -0.00487575611632497
Can you tell me how to calculate log2 fold change? If it is difficult to tell me about the detailed method, I would like to know what factors(ex. baseMean lfcSE...) affect calculations and the cause of difference between plus(Calculated log2 fold change) and minus(log2 fold change (MLE): condition Condition 2 vs Condition 1).
txi.rsem.ex <- tximport(files, type = "rsem", txIn = FALSE, txOut = FALSE)
dds <- DESeqDataSetFromTximport(txi.rsem.ex, sampleTable, ~condition)
dds <- DESeq(dds, fitType = "parametric")
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
dds <- DESeq(dds,fitType = "parametric")
head(counts(dds,normalized=TRUE),1)
ENSG00000000003.14_1 / 0.00000 4.496866 8.383799 9.168738 5.433209 5.609478 7.348834 6.021589 6.29306 6.732453 4.727638 10.06281 8.112052 10.14698 8.873856 8.106829
res = results(dds, contrast=c('condition','Condition2','Condition1'), independentFiltering=FALSE)
res
log2 fold change (MLE): condition Condition2 vs Condition1
Wald test p-value: condition Condition2 vs Condition1
DataFrame with 15669 rows and 6 columns
ENSG00000000003.14_1/6.84488728991658 -0.00487575611632497 0.470063450907851
ENSG00000000419.12_1/ 471.531051580136 -0.61398371888155 0.26185744737714
ENSG00000000457.13_1/ 532.837380784179 0.0970014203408028 0.102600280199951
ENSG00000000460.16_2/ 112.059734944934 0.00740154312148043 0.190685869057898
ENSG00000000938.12_1/ 14989.1210747367 0.248168496953757 0.204711629956007
... ... ... ...
ENSG00000283496.1_1/ 0.603572642839276 2.59110989861822 3.00222231413428
ENSG00000283528.1_1/ 26.5572341542125 0.7132833237189 0.518081989743246
ENSG00000283632.1_1/ 2.18792323399264 0.509781159465374 0.897894139446614
ENSG00000283683.1_1/ 0.675640168262519 0.230315303559532 1.3305502133294
ENSG00000283697.1_1/ 0.686822760894622 -2.67178327010738 1.97936990724239
I would appreciate it if you check it again. Thank you for your quick reply.
Can you post this:
[1] Condition1 Condition1 Condition1 Condition1 Condition1 Condition2 Condition2 Condition2 Condition2 [10] Condition2 Condition3 Condition3 Condition3 Condition3 Condition3 Condition3
I'm not sure why you see something different there. Is it different for all genes? Or just the first gene? Were there any messages printed when you ran
DESeq()
?When I plug in these counts (rounded) into a simulated data set I get nearly the same LFC as using normalized counts (it is not necessarily that the MLE is identical to the simple ratio for a GLM, but typically very close).
Not all genes are different and I did not see any messages when I ran DESeq2.
C=counts(dds,normalized=TRUE)
test1 = apply(C, 1, function(x){ log2( mean(x[6:10]) / mean(x[1:5]) ) })
test2 = data.frame(calculated=test1, log2FC=res$log2FoldChange)
dim(test2[which(abs(test2$calculated - test2$log2FC) > 0.2),])
[1] 1084 2
head(test2[which(abs(test2$calculated - test2$log2FC) > 0.2),],10)
ENSG00000000003.14_1/ 0.21979688 -0.004875756
ENSG00000003137.8_1/ 5.21396910 4.260814729
ENSG00000003436.14_1/ -2.52976087 -2.313205639
ENSG00000005108.15_2/ 2.29681243 2.038137364
ENSG00000006042.11_1/ -2.00756362 -1.597780853
ENSG00000006659.12_1/ 0.93533078 0.519878947
ENSG00000006704.10_2/ -0.03160563 -0.246252646
ENSG00000007062.11_2/ 0.81472789 0.413966060
ENSG00000007171.16_2/ 1.58032878 1.373813189
ENSG00000007306.14_1/ 0.58076389 0.039064557
I took a look, it's the offsets (called size factors / normalization factors in DESeq2) in the GLM, that affect the contribution of the samples per gene, which means that the GLM gives a different answer than the simple ratio of averages. This is ok, there's no issue in the analysis.
For example, for the first gene:
Dear Mr. Love,
is it possible to tell more precisely how the normalization factors affect the final ratio as it is not exactly the log2 ratio of the means of normalized counts?
Sticking to the simple scenario depicted above: One factor with two levels A and B, I tried to inccorporate the normalization factors and later the size factors to somehow get the log2foldchanges as in the
results()
output, but in vain.I am sorry if this was clarified elsewhere already or should rather be a new question on its own.
Thank you very much in advance
How about any notes printed during DESeq()?
Could you share the dds with me via email? I'm
maintainer("DESeq2")