I would appreciate your feedback on this one. I have been using DESeq2(version 1.20.0) for differential expression analysis comparing response in cancer patients.
My question is about the lfcShrink() function. There have been some updates on the function from the previous versions of the package and I have read the documentation and many posts in here, but I have still some doubts on how I do my analysis.
I wanted to be strict strict so I set both thresholds for padj as well as log fold change:
padj.cutoff <- 0.05 lfc.cutoff <- 0.58 dds <- DESeq(dds) res <- results(dds, contrast=c("response","PD","CR+PR"), lfcThreshold = lfc.cutoff, alpha = padj.cutoff, altHypothesis = "greaterAbs") res_setThres <- lfcShrink(dds=dds, contrast=c("response","PD","CR+PR"), res=res, lfcThreshold = lfc.cutoff) res_noThres <- lfcShrink(dds=dds, contrast=c("response","PD","CR+PR"), res=res)
My question is: should I set the threshold for LFC inside the lfcShrink() function too? In the res_setThres the p-values have changed (as mentioned in the documentation) and I eventually get no DEGs based on these thresholds. Even when I increased padj to 0.1 I still got zero significant DEGs. My res_noThres table produces a small set of genes for these thresholds. Is it wrong if I use the lfcSchrink without setting the lfcThreshold and then subset my results table to those genes passing the filters? I know it is important to set these thresholds in the results() function and not filter post-hoc, I was wondering if that applies to lfcshrink too.
Thank you Michael for the quick reply. Your explanation was very helpful!
Thank you for your answer, but I'm still a little confused and have a question as described below:
res <- results(dds, lfcThreshold=1, ...),we test whether the absolute LFC is larger than 1, but after shrinking LCF using
res <- lfcShrink(dds, coef="response_PD_vs_CR+PR", type="apeglm", res=res), some genes with an original |LFC| > 1 and padj < 0.05 will now have a |LFC| less than 1. Are those genes still significant? Are we going to use the shrunken LFC and p/padj values for downstream analysis even though the p/padj values were based on the original LFC and do not match the shrunken LFC?
I really appreciate it if you would answer my question. Thank you very much.
You’re always going to find a few inconsistencies if you switch between the two frameworks, why not just stick with the Bayesian shrinkage framework then (skip the results function)
dds <- DESeq(dds)
res <- lfcShrink(dds, coef=coef, lfcThreshold=1)
Thank you for your prompt response, Michael.
My concern is that
res <- lfcShrink(dds, coef=coef, lfcThreshold=1)will provide s values, which are less commonly used than p/padj values. I'm afraid that this may cause some confusion during the peer review process.
I'm considering skipping lfcShrink and just use LFC generated by the results function. Does this approach make sense to you? Thanks very much!
It’s really up to you. I don’t see an inconsistency between a thresholded test of an unshrunken LFC along with a shrunken LFC, which in a minority of cases may be less than the threshold. They are different procedures and estimators for different purposes, both useful output in my opinion.
I was wrong.
lfcShrinkwill produce padj values. However,
lfcShrinkdoesn't allow for specifying alpha values other than 0.1, whereas the results function does. Can I do
res <- lfcShrink(dds, coef=coef, lfcThreshold=1)and then filter significant genes by padj < 0.05? I ask because I saw you mentioned in another post that choosing a different value for alpha than the one you use to filter is sub-optimal.
Would you please tell me which approach would be better? Use results with alpha = 0.05 or use lfcShrink and then filter by padj=0.05?
Thank you very much!
I see, there's no access to the 'alpha' argument of results() which is called inside lfcShrink. Although I have said elsewhere that it's optimal to pick the same 'alpha' that you threshold on, it's not that big of a deal between 5% and 10%. I would go with using lfcShrink and filtering on padj=.05.
Thank you very much, Michael.
Hi Michael, this post was super informative but I also have some questions. I am concerned about the large differences in DEG lists that I am getting when I use lfcThreshold=1 with lfcShrink (filtered by s-value < .005)
vs when I use the results file for lfcShrink (filtered by lfcThreshold=1, adj p < .1).
I get 91 DEG for resNorm1 and 81 DEG for resNorm2, only 28 of which are shared between the two datasets. I understand that the two approaches differ in the way they estimate error, but is it common for them to produce such different results? The issue I have with the recommended second approach (resNorm2) is that it outputs many DEGs with log2FC < 1 (53 out of 81 DEGs have log2FC of 0.1 or below), which I find difficult to interpret. I was wondering if it was common to have this many low FCs after shrinking. I know it's hard to say which approach is the "right" one, but in this case, would the first approach (resNorm1) be more appropriate? My concern is that it identifies 63 DEGs that were not found with approach 1, and while I love having larger DEG lists I worry about having so many that are "unique". I hope my questions make sense. Thank you!
This seems like a case where apeglm shrinkage is stricter than null hypothesis testing of the the MLE effect sizes (even with the lfcThreshold). Can you look at the mean count of the genes that are significant by
resNorm2but have small shrunken LFC? E.g.
with(resNorm2, hist(baseMean[padj < 0.1 & abs(log2FoldChange) < 0.1])). Do these tend to have small mean count? If so, one way to square these numbers would be to filter at the top of the script, e.g.:
Where X is some minimal number of samples with a count of 10 or more.
Also, I'd be curious if the results are more overlapping with a less strict LFC threshold, e.g.
log2(1.5), so asking for statistical evidence that the fold changes are more than 1.5 fold.
I did include a filtering step at the top:
keep <- rowSums(counts(dds) >= 5) >= 3but maybe my choice of 3 for X (since I had 3 biol reps) was too low? The genes that had super low log2FCs all had 0 mean counts in one of the 2 conditions being compared. I know there is a chance that some genes may have low expression in one condition and not at all in another condition but I don't usually feel very confident about these.
I tried using the lower LFC threshold of 0.585 (fold change = 1.5) but I'm getting fairly similar results. Shrinking using
resNorm2 <- lfcShrink(dds, coef="Treat2", res=res, type="apeglm" )results in 131 DEGs, of which 64 had LFC of 0.1 or below. Shrinking using
resNorm1 <- lfcShrink(dds, coef="Treat2", lfcThreshold = 0.585, type="apeglm" )gave 184 DEGs. All of the genes with LFCs > 0.1 from resNorm2 (n = 67) were also identified with resNorm1. However, resNorm1 gave an additional 117 DEGs. So it seems that setting the LFC threshold in the lfcShrink() function is less conservative than setting the threshold in the results() function and using that output (res) for lfcShrink(). That seems counterintuitive to me somehow.
Not that the s-value and the adjusted p-value control different types of error. Adjusted p-value controls, in aggregate, the rate of genes with LFC=0 being called significant. s-value controls, in aggregate, a gene with LFC < 0 being called as LFC > 0, and vice versa.
I think that the apeglm shrinkage is probably strong in this dataset due to limited number of replicates and high within-group variability, and so you may not want to apply shrinkage, but use lfcThreshold alone.
My concern with not using shrinkage is that otherwise all of the LFCs are very high, ranging from 4 to 25 and -6 to -30 (mainly for the samples that had zero counts). I suppose I can remove those DEGs from the analysis. Would an alternate approach be to apply more stringent filtering to remove all genes with zero counts, e.g.
where X is the total # of samples being analyzed?
Right, those are just genes (the ones going to >10) where one group has all zero counts.
I wouldn't remove the genes from the analysis. Either: go with the LFC and s-values, which will give you more conservative results (apeglm is shrinking a lot because low replicate count and high variability), or:
If you don't like to report those with a large LFC because it is undefined, you could easily report them separately with:
this will allow you to set aside those genes with an FDR < 0.1 and the LFC is not defined because the counts were zero in the reference group.