Questions about the false positives indicated by DESeq2 and QQ plot
1
0
Entering edit mode
Weiqian • 0
@912d91f3
Last seen 4 months ago
United States

Hi Michael, I am using DESeq2 to analyze my RNA seq dataset, but I am a bit confused about the conflicting result between DESeq2 and Quantile Quantile plot. As far as I know, Deseq2 already enables me to choose 'BH/fdr etc' to correct for false positives from multiple comparison tests, but if I further take the p values or p-adj values generated by Deseq2, to draw a QQ plot (as follows), it just indicates that I have too many false positives and the model is inflated.

So I just wonder is it necessary or reasonable to use a QQ plot to decide whether I have too many false positives of a RNA seq analysis, and further use another package such as 'fdrtool' to further correct for false positives, and replace the original p-adj values of Deseq2 with the regenerated fdr values? Because on one hand Deseq2 already has multiple test correction, on the other hand a QQ plot is used in GWAS study as far as I know.

And my second question is as low-expressed genes can negatively affect the result of multiple test correction (more unuseful genes lead to more false positives), so I decide to use a criterion where the number of sig DE genes reaches the peak (maximum) of the following curve, or keep genes which express at least 50 reads in at least 8 samples (18828 genes after filtering). I am not sure whether this strategy is reasonable?

For my dataset, I have 134 samples, with the design model of 'Area + Sex + Sequencing_batch + Age + Group', where variable 'Age' is continuous and all other variables are binary categorical variables. The following PCA result uses VST data and DE compares between case and control group. Thanks for you comment!

Code as shown below

dds <- DESeqDataSetFromMatrix(data_final,
design= ~ group_id)
dds$group_id<-relevel(dds$group_id,ref = 'control')
design(dds)<- ~ Area+Sex+Sequencing_batch+Age+Group
dds <- DESeq(dds)
normalized_counts <- counts(dds,
normalized = TRUE)
vsd <- vst(object=dds,blind=FALSE)
ntop = 5000,
intgroup=c("Group", "Area",'Sex','Sequencing_batch'),
returnData=TRUE)
res <- results(dds,
contrast = contrast_fmt,
alpha = 0.05,
lfcThreshold=0)
# log fold change shrinkage ignored here
hist(res$padj[res$baseMean > 1], breaks = 0:20/20,
col = "grey50", border = "white")
library(qqman)
qq(res$padj) #code for number of sig DE genes by changing different filtering threshold not shown, threshold is #like this pattern, for each iteration, run Deseq2 and detect the number of sig DE genes with p-adj < #0.05 idx <- rowSums((df)>=X) >= Y data_final <- df[idx,]  pre-filterin DESeq2 • 319 views ADD COMMENT 0 Entering edit mode @mikelove Last seen 1 day ago United States PC1 has a lot of signal here separating the two groups, so I would expect a lot of DE genes. If you had genes that you think should not be DE across the groups it would be interesting to perform hist(res$pvalue) on that subset.

What do you get with the following (this is both using LRT over Wald, as well as a faster engine for computing dispersion because of large sample size):

design(dds) <- ~Area+Sex+Sequencing_batch+Age+Group
dds <- DESeq(dds, test="LRT", reduced=~Area+Sex+Sequencing_batch+Age, fitType="glmGamPoi")
res <- results(dds)
summary(res)

0
Entering edit mode

(glmGamPoi can be obtained with a version of DESeq2 since 1.30, now we are on 1.36)

0
Entering edit mode

Yeah my current project is on R 4.0.2 and seems the highest version of Deseq2 I get is just 1.28.1; Maybe in future I'll immigrant my project and all related packages onto higher R version :)

0
Entering edit mode

Hi Michael, Thank you for kind reply, as my current Deseq2 version is not that high, so I just choose fitType = 'parametric', and I get the following result, quite similar to what I got from Wald test, so is it still needed to use a QQ plot to detect and correct for the false positives? Thanks!

## #

out of 18819 with nonzero total read count adjusted p-value < 0.05 LFC > 0 (up) : 6045, 32% LFC < 0 (down) : 5720, 30% outliers [1] : 0, 0% low counts [2] : 0, 0% (mean count < 8)

0
Entering edit mode

You can't say these are false positives from your current data. That would presume that the vast majority of the genes are null, that the LFC is truly 0 for most genes.

While this is appropriate for some studies, I don't think it is for a generic RNA-seq study.

From the PCA plot there are systematic differences between these groups, it in fact dominates PC1. So the null is clearly false across the genes that contribute to the most dominant axis of variance (unsupervised) in your data.

If you care about large effects, use the following:

res <- results(dds, lfcThreshold=1)

0
Entering edit mode

Thanks for your comment, it really helps a lot!