Different padj value after multiple testing
1
0
Entering edit mode
tarun2 • 0
@tarun2-11885
Last seen 2.4 years ago
United States

I am rather new to Bioconductor and RNA-Seq analysis and tried to go over several papers and workflow to follow DESeq2 pipeline. Apologies if this post will be quite long.

I have a 2x2 factorial reproductive-stage drought stress experiment in rice. I have 2 genotypes (NIL-tolerant, Swarna-sensitive) and 2 conditions (Drought and Control). I have 4 biological replicates. We sampled two different tissues; flag leaf and panicle at maximum booting. I'm trying to analyze the data per tissue first.

I used the tximport package to import my transcript-level dataset using the following codes:

tx.all <- tximport(myfiles, type = "salmon", txOut = TRUE, reader = read_tsv)

I created the DESeqDataset for use with DESeq2 as follows:

colData <- data.frame(geno=rep(c("NIL","Swarna"),each=8),condition=rep(rep(c("Control","Drought"),each=4),times=2))

rownames(colData) <- colnames(tx.all$counts)

dds <- DESeqDataSetFromTximport(tx.all, colData, formula(~geno+condition+geno:condition))

I grouped them:

dds$group<-factor(paste0(dds$geno, dds$condition))

design(dds) <- ~group

I did pre-filtering to filter out samples that have no counts, or only a single count across all samples.

nrow(dds)
dds <- dds[ rowSums(counts(dds)) > 1, ]
nrow(dds)

I ran the DESeq2 and generated the multiple condition effects.

dds<-DESeq(dds)
> resultsNames(dds)
[1] "Intercept"          "groupNILControl"    "groupNILDrought"    "groupSwarnaControl"
[5] "groupSwarnaDrought"

Since we're interested in identifying gene/s in NIL that confers drought tolerance at reproductive growth stage of rice. I set-up the result using contrast. The following is an example:

res<-results(dds, contrast=c("group","NILDrought","SwarnaDrought"))

res

mcols(res, use.names=TRUE)

summary(res)

I got this nice table:

out of 49722 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 2538, 5.1% 
LFC < 0 (down)   : 2292, 4.6% 
outliers [1]     : 892, 1.8% 
low counts [2]   : 10260, 21% 
(mean count < 2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

One thing I noticed is that even though i did not specified the log fold change, it does calculate for all the transcripts.

Then, I did multiple testing as follows:

resLFC1 <- results(dds, contrast=c("group","NILDrought","SwarnaDrought"), lfcThreshold = 1, alpha = 0.05)
table(resLFC1$padj < 0.05)
summary(resLFC1)
sum( resLFC1$padj < 0.05, na.rm=TRUE )

 

And the results:

out of 49722 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)     : 166, 0.33% 
LFC < 0 (down)   : 99, 0.2% 
outliers [1]     : 892, 1.8% 
low counts [2]   : 4819, 9.7% 
(mean count < 0)

I extracted all the genes that were differentially expressed based from this result which brings me to my question:

First, i observed that the log fold change for the default (res) generation and from the multiple testing (resLFC1) is the same as well as the base means and the lfcSE. But the padj is TOTALLY DIFFERENT!?  I may totally not get it but why the padj value is different from res and resLFC1 after the addition of lfcThreshold=1 and lowering the padj to 0.05?

This is an example from res table:

FeatureID baseMean log2FoldChange lfcSE stat pvalue padj
LOC_Os01g02700.1 300.8324 2.433062 0.384283 6.331434 2.43E-10 2.52E-08

This same gene is identified in resLFC1;

FeatureID baseMean log2FoldChange lfcSE stat pvalue padj
LOC_Os01g02700.1 300.8324 2.433062 0.384283 3.729184 0.000192 0.023987

I know DESeq2 uses the so-called Benjamini-Hochberg (BH) adjustment that calculates for each gene and adjusted p-value but please and kindly enlighten me what's happening during the addition of lfcThreshold and lowering the padj to 0.05 in the DESeq2 package?  Why are the padj values differs but the log2FoldChange remains the same?

 

Which then brings me to my other question. Since I used the extracted genes from res to generate Venn diagram for our four defined contrasts using an example subsetting below;

#Set up thresholds#
Upregulated = 1
Downregulated = -1

 

NILD_NILC_upregulated <- subset(NILD_NILC, (log2FoldChange > Upregulated & padj < 0.05))

I'm getting a different number of upregulated or downregulated genes based on these parameters for Venn diagram than that generated from resLFC1. Well, that is because the padj is different from res and resLFC1. 

Should I still use those genes generated from res for generation of Venn diagram?

Please and kindly advise. Any comments on the workflow that I'm following is much appreciated.

 

Thanks and best regards.

Asher

 

R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] readr_1.1.0                tximport_1.2.0             genefilter_1.56.0         
 [4] pheatmap_1.0.8             RColorBrewer_1.1-2         ggplot2_2.2.1             
 [7] gplots_3.0.1               DESeq2_1.14.1              SummarizedExperiment_1.4.0
[10] Biobase_2.34.0             GenomicRanges_1.26.4       GenomeInfoDb_1.10.3       
[13] IRanges_2.8.1              S4Vectors_0.12.1           BiocGenerics_0.20.0       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.10         locfit_1.5-9.1       lattice_0.20-34      gtools_3.5.0        
 [5] assertthat_0.1       digest_0.6.12        R6_2.2.0             plyr_1.8.4          
 [9] backports_1.0.5      acepack_1.4.1        RSQLite_1.1-2        zlibbioc_1.20.0     
[13] lazyeval_0.2.0       data.table_1.10.4    annotate_1.52.1      gdata_2.17.0        
[17] rpart_4.1-10         Matrix_1.2-7.1       checkmate_1.8.2      splines_3.3.2       
[21] BiocParallel_1.8.1   geneplotter_1.52.0   stringr_1.2.0        foreign_0.8-67      
[25] htmlwidgets_0.8      RCurl_1.95-4.8       munsell_0.4.3        base64enc_0.1-3     
[29] htmltools_0.3.5      nnet_7.3-12          tibble_1.2           gridExtra_2.2.1     
[33] htmlTable_1.9        Hmisc_4.0-2          XML_3.98-1.5         bitops_1.0-6        
[37] grid_3.3.2           xtable_1.8-2         gtable_0.2.0         DBI_0.6             
[41] magrittr_1.5         scales_0.4.1         KernSmooth_2.23-15   stringi_1.1.3       
[45] XVector_0.14.1       latticeExtra_0.6-28  Formula_1.2-1        tools_3.3.2         
[49] hms_0.3              survival_2.39-5      AnnotationDbi_1.36.2 colorspace_1.3-2    
[53] cluster_2.0.5        caTools_1.17.1       memoise_1.0.0        knitr_1.15.1       

 

 

 

 

 

 

deseq2 padj multiple comparisons differential gene expression • 1.8k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 2 hours ago
United States

In setting an lfcThreshold, you performed a different statistical test, hence the p-values and adjusted p-values are different.

Please read over the section of the DESeq2 paper, "Hypothesis tests with thresholds on effect size":

https://dx.doi.org/10.1186/s13059-014-0550-8

ADD COMMENT

Login before adding your answer.

Traffic: 871 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6