Hi there, I am using Deseq2 to perform DE analysis between disease group and control from RNAseq data, since I learned from this tutorial that low-count genes can impact false positives in the multiple testing correction so pre-filtering is advised: https://www.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html#using-sva-with-deseq2. But I am a bit confused about the DEGs I got after I used 4 different filtering criterions, these DEGs overlap with each other to some extent, but also have some difference with each other. So I have two following questions:
- Is it better that I should only use the common DEGs shared by different filtering strategies, because these common DEGs seem quite robust across different filtering strategies?
- Where do those non-overlapping DEGs come from? Does it indicate that Deseq2 has generated some false positives in the GLM model fitting process? One thing to mention here is that I have about 60 samples for each of the two conditions I am comparing, is it likely that 60 samples would lead to large within-group variance thus a single NB model for each condition is not applicable any more, so Deseq2 generates some false positive results? Any other suggestions? Thank you for your comments!
The four different filtering strategies are as follows: strategy 1: Keep genes express larger than 50 in at least 8 samples, 18819 genes remained; strategy 2: Keep genes express larger than 1 in at least 1 samples, 47826 genes remained; strategy 3: Keep genes express larger than 20 in at least 3 samples, 24669 genes remained; strategy 4: Keep genes express larger than 5 in at least 3 samples, 34878 genes remained; As follows are the pvalue distribution of the above 4 strategies, which you could also indicate the number of significant DEGs, and the relative abundance of Up/Down DEGs are nearly the same.
For the Up/Down DEGs I got from the above 4 filtering strategies, I used Venn plot to show the overlapping and non-overlapping set of genes. As you could see, the common DEGs are quite robust, but there are still many genes that are unique for each strategy. Venn plot for common DEGs between strategy 1 and 2:
Venn plot for common DEGs between strategy 1, 2 and 3:
Venn plot for common DEGs among all four strategies:
Codes are basic workflows from Deseq2 comparing disease group versus ctrl, except different number of genes are pre-filtered before forming a Deseq2 dds object, DEGs are treated as p-adj < 0.05.
# include your problematic code here with any corresponding output # please also include the results of running the following in an R session sessionInfo( )