Questions about the false positives indicated by DESeq2 and QQ plot
Entering edit mode
Weiqian ▴ 10
Last seen 9 months ago
United States

Hi Michael, I am using DESeq2 to analyze my RNA seq dataset, but I am a bit confused about the conflicting result between DESeq2 and Quantile Quantile plot. As far as I know, Deseq2 already enables me to choose 'BH/fdr etc' to correct for false positives from multiple comparison tests, but if I further take the p values or p-adj values generated by Deseq2, to draw a QQ plot (as follows), it just indicates that I have too many false positives and the model is inflated.

So I just wonder is it necessary or reasonable to use a QQ plot to decide whether I have too many false positives of a RNA seq analysis, and further use another package such as 'fdrtool' to further correct for false positives, and replace the original p-adj values of Deseq2 with the regenerated fdr values? Because on one hand Deseq2 already has multiple test correction, on the other hand a QQ plot is used in GWAS study as far as I know.

And my second question is as low-expressed genes can negatively affect the result of multiple test correction (more unuseful genes lead to more false positives), so I decide to use a criterion where the number of sig DE genes reaches the peak (maximum) of the following curve, or keep genes which express at least 50 reads in at least 8 samples (18828 genes after filtering). I am not sure whether this strategy is reasonable?

For my dataset, I have 134 samples, with the design model of 'Area + Sex + Sequencing_batch + Age + Group', where variable 'Age' is continuous and all other variables are binary categorical variables. The following PCA result uses VST data and DE compares between case and control group. Thanks for you comment!

Code as shown below

dds <- DESeqDataSetFromMatrix(data_final,
                              colData = metadata_final,
                              design= ~ group_id)
dds$group_id<-relevel(dds$group_id,ref = 'control')
design(dds)<- ~ Area+Sex+Sequencing_batch+Age+Group
dds <- DESeq(dds)
normalized_counts <- counts(dds,
                            normalized = TRUE)
vsd <- vst(object=dds,blind=FALSE)
pcaData <- plotPCA(vsd, 
                                   ntop = 5000,
                                   intgroup=c("Group", "Area",'Sex','Sequencing_batch'), 
contrast_fmt = c("group_id", "AD", "control")
res <- results(dds,
               contrast = contrast_fmt,
               pAdjustMethod = "fdr",
               alpha = 0.05,
# log fold change shrinkage ignored here
hist(res$padj[res$baseMean > 1], breaks = 0:20/20,
     col = "grey50", border = "white")

#code for number of sig DE genes by changing different filtering threshold not shown, threshold is #like this pattern, for each iteration, run Deseq2 and detect the number of sig DE genes with p-adj < #0.05
idx <- rowSums((df)>=X) >= Y
data_final <- df[idx,]

P adj value histogram

qq plot

filtering low-expressed genes

PC1 Vs PC2

PC1 Vs PC3

pre-filterin DESeq2 • 1.3k views
Entering edit mode
Last seen 4 hours ago
United States

PC1 has a lot of signal here separating the two groups, so I would expect a lot of DE genes. If you had genes that you think should not be DE across the groups it would be interesting to perform hist(res$pvalue) on that subset.

What do you get with the following (this is both using LRT over Wald, as well as a faster engine for computing dispersion because of large sample size):

design(dds) <- ~Area+Sex+Sequencing_batch+Age+Group
dds <- DESeq(dds, test="LRT", reduced=~Area+Sex+Sequencing_batch+Age, fitType="glmGamPoi")
res <- results(dds)
Entering edit mode

(glmGamPoi can be obtained with a version of DESeq2 since 1.30, now we are on 1.36)

Entering edit mode

Yeah my current project is on R 4.0.2 and seems the highest version of Deseq2 I get is just 1.28.1; Maybe in future I'll immigrant my project and all related packages onto higher R version :)

Entering edit mode

Hi Michael, Thank you for kind reply, as my current Deseq2 version is not that high, so I just choose fitType = 'parametric', and I get the following result, quite similar to what I got from Wald test, so is it still needed to use a QQ plot to detect and correct for the false positives? Thanks!


out of 18819 with nonzero total read count adjusted p-value < 0.05 LFC > 0 (up) : 6045, 32% LFC < 0 (down) : 5720, 30% outliers [1] : 0, 0% low counts [2] : 0, 0% (mean count < 8)

Entering edit mode

You can't say these are false positives from your current data. That would presume that the vast majority of the genes are null, that the LFC is truly 0 for most genes.

While this is appropriate for some studies, I don't think it is for a generic RNA-seq study.

From the PCA plot there are systematic differences between these groups, it in fact dominates PC1. So the null is clearly false across the genes that contribute to the most dominant axis of variance (unsupervised) in your data.

If you care about large effects, use the following:

res <- results(dds, lfcThreshold=1)
Entering edit mode

Thanks for your comment, it really helps a lot!

Entering edit mode

Could you please tell me why do you think Q-Q plot is inappropriate for a generic RNA-seq study in detail?

I would really appreciate that if you could answer my question.

Best regards,

Yujin Kim.

Entering edit mode

We discuss in the original paper, there are many studies where the majority of genes are affected in some amount by treatment. QQ-plot as I use it is helpful when the signal is sparse and the data consist with the null is dominant, as you expect the points to mostly fall on the theoretical line under the null. So if you have a dataset which is of the former type, I'm not sure what you learn from the QQ-plot.

Entering edit mode

Thanks for your reply :)


Login before adding your answer.

Traffic: 639 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6