Question: DESeq2: Extremely low pvalues
0
2.7 years ago by
szenitha0
szenitha0 wrote:

Dear all,

I have a question regarding DESeq2 package. Please help. I am using this package to analyze a RNA seq dataset. The dataset has  3 different groups and my goal is to find out the DE genes between groups. I used the test 'LRT' to find out genes that are differentially expressed between all the groups. The problem is that I am getting extremly low p-values and do not understand why. I am getting p-values as low as 3.067392e-314.

Code:

dds.fc <- DESeqDataSetFromMatrix(dat0,colData=pheno,design=~ State) #condition
#relevel groups
dds.fc@colData$State<-factor(dds.fc@colData$State,levels=c("non-infected spleen","infected lung","infected spleen"))
dds<-DESeq(dds.fc,test = "LRT",reduced = ~1)

res<-results(dds,name =  "State_infected.lung_vs_non.infected.spleen")

#order with respect to p-value

temp<-res[with(res,order(res[,5])),]

#get result with pvalue below 0.05

sub <- subset(temp,temp$pvalue<0.05) Result: head(sub,25) log2 fold change (MLE): State infected.lung vs non.infected.spleen LRT p-value: '~ State' vs '~ 1' DataFrame with 25 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> Arg1 2337.0486 14.430641 1.1909019 2463.520 0 0 Ccl12 817.8881 11.012091 0.7409714 1723.197 0 0 Ccl7 855.1171 12.086512 1.0304499 2447.651 0 0 Cd38 881.1564 6.760928 0.2047603 1766.141 0 0 Cd72 533.9823 7.131620 0.2767360 1654.121 0 0 ... ... ... ... ... ... ... Irg1 2944.5393 9.067204 0.2761566 1443.782 3.067392e-314 1.831233e-311 Htr7 393.6399 8.571863 0.4775459 1436.052 1.462933e-312 8.336721e-310 C3ar1 1772.5935 6.709288 0.2091927 1432.418 9.002270e-312 4.907020e-309 Rasgef1b 805.7408 4.032636 0.1431151 1407.784 2.011923e-306 1.050978e-303 Ets2 3110.7434 5.978289 0.1886460 1364.882 4.164787e-297 2.088558e-294 Session Info sessionInfo() R version 3.3.0 (2016-05-03) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] pheatmap_1.0.8 data.table_1.9.6 DESeq2_1.12.3 [4] rgl_0.95.1441 corrplot_0.77 RColorBrewer_1.1-2 [7] ffpe_1.16.0 TTR_0.23-1 RUVSeq_1.6.2 [10] edgeR_3.14.0 limma_3.28.14 EDASeq_2.6.2 [13] ShortRead_1.30.0 GenomicAlignments_1.8.4 Rsamtools_1.24.0 [16] Biostrings_2.40.2 XVector_0.12.0 BiocParallel_1.6.2 [19] RSkittleBrewer_1.1 zebrafishRNASeq_0.106.2 BiocInstaller_1.22.3 [22] scales_0.4.0 ggplot2_2.1.0 SummarizedExperiment_1.2.3 [25] Biobase_2.32.0 GenomicRanges_1.24.2 GenomeInfoDb_1.8.2 [28] IRanges_2.6.1 S4Vectors_0.10.1 BiocGenerics_0.18.0 [31] sva_3.20.0 genefilter_1.54.2 mgcv_1.8-12 [34] nlme_3.1-128  deseq2 • 958 views ADD COMMENTlink modified 2.7 years ago by Steve Lianoglou12k • written 2.7 years ago by szenitha0 Answer: DESeq2: Extremely low pvalues 2 2.7 years ago by Michael Love23k United States Michael Love23k wrote: Why don't you plot an example of one of these genes using plotCounts. I'd guess that these are all "off" in one group and have high expression in another. Note that you are testing that there are no differences across State, so if there are any large differences it makes sense that the probability of the data under the null is very small. The p-value can be arbitrarily low when the null hypothesis is false. In fact it becomes not so interesting as a statistic when the null is clearly false, and other statistics become more interesting, such as the effect size: the log2 fold change between groups. Actually, we discuss just this topic in the DESeq2 paper. ADD COMMENTlink written 2.7 years ago by Michael Love23k Thank you michael. Yes I took the plotCounts for some genes and found that they were expressed in one condition and off in the other. I also went through the paper. I hope that I have understood it correctly. By default we check if there is no differences across condition(null hypothesis). But if there is a clear biological difference across condition then even genes with very small foldchange will be flagged as significant because the null hypothesis is clearly false. In such a scenario I would need to focus on the magnitude of the fold change.So if I have to find out the significant DE genes between two conditions then I would need to specify a composite null hypothesis wherein the null hypothesis would be that genes with fold change > or < threshold is not significant. So if I am interested in finding DE genes between two groups "infected lung" and "non-infected spleen" that have a fold change > 2 or <-2 then is this correct? res<- results(dds,contrast=c("State", "infected lung", "non-infected spleen"), altHypothesis="greaterAbs",lfcThreshold = 1) res[which(res$padj<0.05),]

exactly.

one correction is that lfcThreshold=1 and altHypothesis="greaterAbs" corresponds to a fold change of > 2 or < 1/2.