Pre-filtering, adjusting independent filtering, or lfcthreshold() on DESeq2? Low counts and dropouts produce interesting volcano plots.
1
0
Entering edit mode
Alexander • 0
@f109e3f9
Last seen 8 weeks ago
United States

Hello,

I am running DESeq2 from bulk RNA sequencing data. Our lab has a legacy pipeline for identifying differentially expressed genes, but I have recently updated it to include functionality such as lfcshrink(). I noticed that in the past, graduate students would use a pre-filter to eliminate genes that were likely not biologically meaningful, as many samples contained drop-outs and had lower counts overall. An example is attached here in my data, specifically, where this gene was considered significant:

enter image description here

I also see examples of the other end of the spectrum, where I have quite a few dropouts, but this time there is no significant difference detected, as you can see here:

enter image description here

I have read in the vignette and the forums how pre-filtering is not necessary (only used to speed up the process), and that independent filtering should take care of these types of genes. However, upon shrinking my log2(fold-changes), I have these strange lines that appear on my volcano plots. I am attaching these, here:

enter image description here

I know that DESeq2 calculates the log2(fold-changes) before shrinking, which is why this may appear a little strange (referring to the string of significant genes in a straight line at the volcano center). However, my question lies in why these genes are not filtered out in the first place? I can do it with some pre-filtering (I have seen these genes removed by adding a rule that 50/75% of samples must have a count greater than 10), but that seems entirely arbitrary and unscientific. All of these genes have drop-outs and low counts in some samples. Can you adjust the independent filtering, then? Is that the better approach? Or would applying an lfcthreshold() be better? Mind you, I am performing gene set enrichment analysis directly after this (with normalized counts directly from DESeq2), if that changes anything. I am continuously reading the vignette to try to uncover this answer. Still, as someone in the field with limited experience, I want to ensure I am doing what is scientifically correct.

I have attached relevant pieces of my code that may be useful. I snipped out any portions for visualizations, etc. to make it cleaner:

# Create coldata
coldata <- data.frame(
  row.names = sample_names,
  occlusion = factor(occlusion, levels = c("0", "70", "90", "100")),
  region = factor(region, levels = c("upstream", "downstream")),
  replicate = factor(replicate)
)

# Create DESeq2 dataset
dds <- DESeqDataSetFromMatrix(
  countData = cts,
  colData = coldata,
  design = ~ region + occlusion

# Filter genes with low expression ()
keep <- rowSums(counts(dds) >=10) >=12 # Have been adjusting this to view volcano plots differently
dds <- dds[keep, ]

# Run DESeq normalization
dds <- DESeq(dds)

# Load apelgm for LFC shrinkage
if (!requireNamespace("apeglm", quietly = TRUE)) {
  BiocManager::install("apeglm")
}
library(apeglm)

# 0% vs 70%
res_70 <- lfcShrink(dds, coef = "occlusion_70_vs_0", type = "apeglm")
write.table(
  cbind(res_70[, c("baseMean", "log2FoldChange", "pvalue", "padj", "lfcSE")],
        SYMBOL = mcols(dds)$SYMBOL),
  file = "06042025_res_0_vs_70.txt", sep = "\t", row.names = TRUE, col.names = TRUE
)

# 0% vs 90%
res_90 <- lfcShrink(dds, coef = "occlusion_90_vs_0", type = "apeglm")
write.table(
  cbind(res_90[, c("baseMean", "log2FoldChange", "pvalue", "padj", "lfcSE")],
        SYMBOL = mcols(dds)$SYMBOL),
  file = "06042025_res_0_vs_90.txt", sep = "\t", row.names = TRUE, col.names = TRUE
)

# 0% vs 100%
res_100 <- lfcShrink(dds, coef = "occlusion_100_vs_0", type = "apeglm")
write.table(
  cbind(res_100[, c("baseMean", "log2FoldChange", "pvalue", "padj", "lfcSE")],
        SYMBOL = mcols(dds)$SYMBOL),
  file = "06042025_res_0_vs_100.txt", sep = "\t", row.names = TRUE, col.names = TRUE
)

Thanks for your assistance! I hope I included everything necessary and am not repeating information found elsewhere on this forum. I scanned a lot, but couldn't find the exact keywords to find it (if it's out there).

EDIT: 06/10/2025 to include lfcthreshold() as an option in my title/body text

DESeq2 • 3.8k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 8 days ago
United States

There are a couple posts on here that also mention this but maybe hard to find: lfcShrink is more aggressive at shrinking that the point null hypothesis testing (LFC = 0). If you want them to line up more you could use lfcThreshold which tends to agree more with the behavior of lfcShrink.

However, my question lies in why these genes are not filtered out in the first place?

independentFiltering focuses on average scaled count. These genes with LFC shrunk to zero tend to display high dispersion.

ADD COMMENT
0
Entering edit mode

Dr. Love,

Thank you for the support. Is there a hard-and-fast rule for choosing a threshold in bulk RNAseq analyses using lfcthreshold()? As I understand it, most people will apply what works best for their data (after consulting threads such as: Effect of lfcThreshold on p-value in DESeq2).

Take care.

ADD REPLY
0
Entering edit mode

It has to come from the biology. What is a meaningful fold change in the system. Depending on context that could be doubling (1) or increasing by 50% (log2(1.5)).

ADD REPLY
0
Entering edit mode

Hi Dr. Love,

I believe I found some old forums that you had referenced earlier and used their suggestions, plus the advice given in this post, to try to get things to look correct. I ran some testing on a single comparison of mine (looking at just my control vs. 90% stenosis for now). I ended up using your first (of two) suggestions found on this forum post and am still encountering similar issues. Here is the code structure for the first suggestion:

# 0% vs 90%
res <- results(dds, name="occlusion_90_vs_0", lfcThreshold=0.5)
res <- lfcShrink(dds, coef="occlusion_90_vs_0", type="apeglm", res=res)
write.table(
  cbind(res[, c("baseMean","log2FoldChange","pvalue","padj","lfcSE")],
        SYMBOL = mcols(dds)$SYMBOL),
  file = "TEST06232025_res_0_vs_90_thr_and_shrunk.txt",
  sep  = "\t", quote = FALSE, row.names = TRUE, col.names = TRUE
)

As you can see in my attached volcano plot, after performing a threshold prior to a shrinkage, I still have a clump of genes that form that solid line in the very center of the plot: enter image description here

Moreover, if I follow the second suggestion from that post (again, just using my 0% vs. 90% comparison for simplicity at the moment), my code structure turns into this:

res_90 <- lfcShrink(dds, coef = "occlusion_90_vs_0", type = "apeglm", svalue = TRUE)
write.table(cbind(res_90[, c("baseMean", "log2FoldChange", "svalue")],
    SYMBOL = mcols(dds)$SYMBOL), 
file = "06042025_res_0_vs_90.txt", sep = "\t", row.names = TRUE, col.names = TRUE

I can then plot my s-values on my volcano plot instead of the adjusted p-values. I also compared how the s-values related to the adjusted p-values. For the volcano plot, I arbitrarily chose s = 0.05 for the time being to deem "significance." I have attached those images here: enter image description hereenter image description here

Finally, I can also add the thresholding argument that is built into lfcshrink(). I chose a biologically meaningful fold change of 50%, so I made the threshold to be 0.5850 and plotted the volcano plot (shown below). This dramatically reduced the number of "significant" genes, and might have been too stringent.

enter image description here

I understand that the s-value provides a measure of the false sign rate, which is the estimated probability that the sign of the log fold change estimate is wrong. What I want to ensure is how this can be used (and cutoff at suggested values of 0.01 or 0.005) instead of an adjusted p-value, which is traditionally used to determine significance. I read in Stephen's 2016 paper that having high confidence in a sign's correctness implies we also have high confidence in the number not being 0, which is considered "signless." Is that a correct interpretation?

This is a direct quote from the paper that gave me that interpretation:

enter image description here

I do not have a statistics background whatsoever, but I believe my interpretation is correct? After graphing, it is evident that s-values and adjusted p-values correlate relatively well. With a strict cut-off, all s-values below a particular threshold would almost 100% be below 0.05 in their adjusted p-value.

The big question: Would simply reporting s-values in the volcano plots be the best approach in the future (and even disregard thresholding)? And would adding a threshold of s-values to highlight genes of interest in heat maps be appropriate? My intuition is yes, mainly because the approach tends to be more conservative than traditional adjusted p-values, and is readily accepted in your field. But then this raises the question... why would we not just use s-values over adjusted p-values all the time? Is there still debate about prioritizing type S errors over type I errors in these statistical analyses?

Thank you so much as always!

EDIT: 06/24/2025 to rephrase some of my paragraphs after doing some additional reading

ADD REPLY
1
Entering edit mode

There is no problem with reporting the s-values with the volcano plots from lfcShrink. No reason not to use these, and with the increasing use of ashr I think they are more widely known now.

ADD REPLY

Login before adding your answer.

Traffic: 915 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6