Question

Different padj value after multiple testing

0

Entering edit mode

tarun2 • 0

@tarun2-11885

Last seen 4.2 years ago

United States

I am rather new to Bioconductor and RNA-Seq analysis and tried to go over several papers and workflow to follow DESeq2 pipeline. Apologies if this post will be quite long.

I have a 2x2 factorial reproductive-stage drought stress experiment in rice. I have 2 genotypes (NIL-tolerant, Swarna-sensitive) and 2 conditions (Drought and Control). I have 4 biological replicates. We sampled two different tissues; flag leaf and panicle at maximum booting. I'm trying to analyze the data per tissue first.

I used the tximport package to import my transcript-level dataset using the following codes:

tx.all <- tximport(myfiles, type = "salmon", txOut = TRUE, reader = read_tsv)

I created the DESeqDataset for use with DESeq2 as follows:

colData <- data.frame(geno=rep(c("NIL","Swarna"),each=8),condition=rep(rep(c("Control","Drought"),each=4),times=2))

rownames(colData) <- colnames(tx.all$counts)

dds <- DESeqDataSetFromTximport(tx.all, colData, formula(~geno+condition+geno:condition))

I grouped them:

dds$group<-factor(paste0(dds$geno, dds$condition))

design(dds) <- ~group

I did pre-filtering to filter out samples that have no counts, or only a single count across all samples.

nrow(dds)
dds <- dds[ rowSums(counts(dds)) > 1, ]
nrow(dds)

I ran the DESeq2 and generated the multiple condition effects.

dds<-DESeq(dds)
> resultsNames(dds)
[1] "Intercept" "groupNILControl" "groupNILDrought" "groupSwarnaControl"
[5] "groupSwarnaDrought"

Since we're interested in identifying gene/s in NIL that confers drought tolerance at reproductive growth stage of rice. I set-up the result using contrast. The following is an example:

res<-results(dds, contrast=c("group","NILDrought","SwarnaDrought"))

res

mcols(res, use.names=TRUE)

summary(res)

I got this nice table:

out of 49722 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 2538, 5.1%
LFC < 0 (down) : 2292, 4.6%
outliers [1] : 892, 1.8%
low counts [2] : 10260, 21%
(mean count < 2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

One thing I noticed is that even though i did not specified the log fold change, it does calculate for all the transcripts.

Then, I did multiple testing as follows:

resLFC1 <- results(dds, contrast=c("group","NILDrought","SwarnaDrought"), lfcThreshold = 1, alpha = 0.05)
table(resLFC1$padj < 0.05)
summary(resLFC1)
sum( resLFC1$padj < 0.05, na.rm=TRUE )

And the results:

out of 49722 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up) : 166, 0.33%
LFC < 0 (down) : 99, 0.2%
outliers [1] : 892, 1.8%
low counts [2] : 4819, 9.7%
(mean count < 0)

I extracted all the genes that were differentially expressed based from this result which brings me to my question:

First, i observed that the log fold change for the default (res) generation and from the multiple testing (resLFC1) is the same as well as the base means and the lfcSE. But the padj is TOTALLY DIFFERENT!? I may totally not get it but why the padj value is different from res and resLFC1 after the addition of lfcThreshold=1 and lowering the padj to 0.05?

This is an example from res table:

FeatureID

baseMean

log2FoldChange

lfcSE

stat

pvalue

padj

LOC_Os01g02700.1

300.8324

2.433062

0.384283

6.331434

2.43E-10

2.52E-08

This same gene is identified in resLFC1;

FeatureID	baseMean	log2FoldChange	lfcSE	stat	pvalue	padj
LOC_Os01g02700.1	300.8324	2.433062	0.384283	3.729184	0.000192	0.023987

I know DESeq2 uses the so-called Benjamini-Hochberg (BH) adjustment that calculates for each gene and adjusted p-value but please and kindly enlighten me what's happening during the addition of lfcThreshold and lowering the padj to 0.05 in the DESeq2 package? Why are the padj values differs but the log2FoldChange remains the same?

Which then brings me to my other question. Since I used the extracted genes from res to generate Venn diagram for our four defined contrasts using an example subsetting below;

#Set up thresholds#
Upregulated = 1
Downregulated = -1

NILD_NILC_upregulated <- subset(NILD_NILC, (log2FoldChange > Upregulated & padj < 0.05))

I'm getting a different number of upregulated or downregulated genes based on these parameters for Venn diagram than that generated from resLFC1. Well, that is because the padj is different from res and resLFC1.

Should I still use those genes generated from res for generation of Venn diagram?

Please and kindly advise. Any comments on the workflow that I'm following is much appreciated.

Thanks and best regards.

Asher

R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] readr_1.1.0 tximport_1.2.0 genefilter_1.56.0
[4] pheatmap_1.0.8 RColorBrewer_1.1-2 ggplot2_2.2.1
[7] gplots_3.0.1 DESeq2_1.14.1 SummarizedExperiment_1.4.0
[10] Biobase_2.34.0 GenomicRanges_1.26.4 GenomeInfoDb_1.10.3
[13] IRanges_2.8.1 S4Vectors_0.12.1 BiocGenerics_0.20.0

loaded via a namespace (and not attached):
[1] Rcpp_0.12.10 locfit_1.5-9.1 lattice_0.20-34 gtools_3.5.0
[5] assertthat_0.1 digest_0.6.12 R6_2.2.0 plyr_1.8.4
[9] backports_1.0.5 acepack_1.4.1 RSQLite_1.1-2 zlibbioc_1.20.0
[13] lazyeval_0.2.0 data.table_1.10.4 annotate_1.52.1 gdata_2.17.0
[17] rpart_4.1-10 Matrix_1.2-7.1 checkmate_1.8.2 splines_3.3.2
[21] BiocParallel_1.8.1 geneplotter_1.52.0 stringr_1.2.0 foreign_0.8-67
[25] htmlwidgets_0.8 RCurl_1.95-4.8 munsell_0.4.3 base64enc_0.1-3
[29] htmltools_0.3.5 nnet_7.3-12 tibble_1.2 gridExtra_2.2.1
[33] htmlTable_1.9 Hmisc_4.0-2 XML_3.98-1.5 bitops_1.0-6
[37] grid_3.3.2 xtable_1.8-2 gtable_0.2.0 DBI_0.6
[41] magrittr_1.5 scales_0.4.1 KernSmooth_2.23-15 stringi_1.1.3
[45] XVector_0.14.1 latticeExtra_0.6-28 Formula_1.2-1 tools_3.3.2
[49] hms_0.3 survival_2.39-5 AnnotationDbi_1.36.2 colorspace_1.3-2
[53] cluster_2.0.5 caTools_1.17.1 memoise_1.0.0 knitr_1.15.1

deseq2 padj multiple comparisons differential gene expression • 2.4k views

ADD COMMENT • link updated 8.8 years ago by Michael Love 43k • written 8.8 years ago by tarun2 • 0

score 0 · Answer 1 · 2017-04-04

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 5 hours ago

United States

In setting an lfcThreshold, you performed a different statistical test, hence the p-values and adjusted p-values are different.

Please read over the section of the DESeq2 paper, "Hypothesis tests with thresholds on effect size":

https://dx.doi.org/10.1186/s13059-014-0550-8

ADD COMMENT • link 8.8 years ago Michael Love 43k