I'm analyzing a large dataset of 250 samples and I'm confronted with some results I cannot explain. I've added the code below. The analysis for this data gave only one gene to have significant padj (for which the log2fc sign was clearly correct). Still, we are interested in looking at the lfc for all the genes, so we investigated the logratios file. Maybe it is important to mention that none of the log2foldchanges in the data is >|0.6|
We noticed that the log2foldchange value, for many genes, does not reflect the situation in pseudo-counts (or vst) values. The contrast is respecting the indicated order of the groups, so it is not flipped order. The gene I picked below is one of the most obvious that sth is not right.
(I have tried using lfcShrink function with apeglm as well and it is the same situation). Does anybody has an idea why?
deseqdata <- DESeqDataSetFromMatrix(countData = countsraw,colData = sampleinfo,design = ~Age_categ+RIN_categ+Sex+plate+grouping1)
deseqdata$grouping1 <- factor(deseqdata$grouping1, levels = c("groupA","groupB","control"))
filter<-rowSums(counts(deseqdata) >= 10) >= 5
dds<-deseqdata[filter,]
dds<-DESeq(dds,betaPrior=TRUE)
vst<-vst(dds, blind=F)
res <- results( dds, contrast = c("grouping1","groupA" ,"groupB") ,addMLE=TRUE,alpha = 0.05)
the pseudocounts for the 3 groups are below, the basemean, the shrunken and unshrunken log2FC and pvalue groupA groupB groupC baseMean log2FoldChange lfcMLE lfcSE pvalue padj 1213.9 832.13 671.47 999.3 -0.35 -0.46 0.10 3.09E-04 2.58E-01
(for the vst values it is the same situation) The plot with y=vst values shows the situation https://ibb.co/JK30y8P
Hi Mike, thanks a lot for such a swift response. It would have been glad to be that mistake!
https://ibb.co/NtvhTVp
Can you show the top line when you print:
In the R console. What does R say the contrast is?
the order is ok, as I said, the only gene with a good padj was ok.
Oh, for the first time I noticed that you have a lot of covariates.
If you have a confounded design (where covariates are correlated with your grouping), then it can easily happen that the sign of LFC changes compared to what you would guess by looking at marginal distributions.
E.g. replace your design with
~grouping
and you will get the expected LFC. However, the problem is the confounding. The best you can do is include the covariates, and you can no longer set your expectations according to what the boxplots look like.To be honest, that was my guess as well!! I'm using DESeq2 for quite some years already and I did not encouter this situation before. For this particular dataset, I have tried to add some covariates to deal with tissue heterogeneity which indeed are correlated with the grouping. Thanks a lot for your help!!