Dear Helpful Strangers,
I am running multiple different regressions in a rather complex material. I am using DESeq2_1.12.4. I have noticed that using betaPrior=T or betaPrior=F makes a huge difference.
As far as I understand using betaPrior means that the log fold changes are shrunken, and these shrunken fold changes are used in the Wald test. Various posts here in bioconductor suggests that using betaPrior or not should not really influence your p-value that much, see for example Love's answer New function lfcShrink() in DESeq2.
As I mentioned, I am seeing huge effects on the p-values and I wonder why. One of my setting is that I have gene expression data from two time-points from patients (often, some patients only have one sample), a very un-balanced design, and want to model the effect on treatment (visit a to visit b) and recovery ( 0 or 1). There are also covariates for additional treatment that the patients are on. In this example I am getting significant results when using betaPriors, but nothing when I am not using them. I am filtering my counts matrix so that at least 20% of the samples have at least 1 normalized count (for reasons that are unrelated to this exact regression).
This sample set includes 128 samples, and a sneak peek of the design:
pat_id visit response T_after P_after
P09a P09 a 0 0 1
P09b P09 b 0 1 1
P10a P10 a 0 0 1
P10b P10 b 0 0 2
P14a P14 a 0 0 0
P14b P14 b 1 0 1
P15a P15 a 0 0 0
P15b P15 b 1 0 1
P17a P17 a 0 0 0
P17b P17 b 1 0 0
Code when using betaPrior=F:
dds <- DESeqDataSetFromMatrix(countData = Counts[,rownames(A_design)], colData = A_design, design = ~ pat_id + visit + response + T_after + P_after) dds <- estimateSizeFactors(dds) idx <- rowSums( counts(dds, normalized=TRUE) >= 1 ) >= 26 dds <- dds[idx,] ddsIb1F=DESeq(dds, betaPrior=FALSE,parallel=TRUE, BPPARAM=MulticoreParam(4)) # general effect of treatment: visit.res1F=results(ddsIb1F,name='visit_b_vs_a
',pAdjustMethod='BH') # gene expression related to recovery: recov.res1F=results(ddsIb1F,name='response_1_vs_0
Code when using betaPrior=T:
dds <- DESeqDataSetFromMatrix(countData = Counts[,rownames(A_design)], colData = A_design, design = ~ pat_id + visit + response + T_after + P_after) dds <- estimateSizeFactors(dds) idx <- rowSums( counts(dds, normalized=TRUE) >= 1 ) >= 26 dds <- dds[idx,] ddsIb1=DESeq(dds, betaPrior=TRUE,parallel=TRUE, BPPARAM=MulticoreParam(4)) # general effect of treatment: visit.res1=results(ddsIb1,name='visitb',pAdjustMethod='BH') # gene expression related to recovery: recov.res1=results(ddsIb1,name='response1',pAdjustMethod='BH')
When betaPrior=T my reasults are:
par(mfrow=c(1,2)) plotMA(visit.res1) plotMA(recov.res1)
And I am getting a fair amount of significant genes (71 and 38 to be exact).
But when I am not shrinking my fold changes I get:
par(mfrow=c(1,2)) plotMA(visit.res1F) plotMA(recov.res1F)
Here nothing is significant (well actually 1 gene for visit).
The correlation between p-values is rather poor when we consider the more extreme p-values:
And the correlation for fold changes is also low:
More importantly, the distribution of FDR changes dramatically:
Is there anything I can do to further understand how this shrinkage is affecting my data, and whether it is appropriate to use the shrinkage or not? Any suggestions about what in my data that leads to such a huge effect of using shrinkage or not?
I want to perform the most appropriate testing. But here I am at a loss as to which approach that is the most appropriate for my design, and why. In general I really like the idea of shrinking the fold changes. But I have another example where using shrinkage leads to a complete loss of signal (and the significant genes were not all really low in counts).
Thanks in advance,