Question

different stat and p values when filtering or not by lfc in DESeq2

0

Entering edit mode

eve.olszanowski • 0

@dd6eb667

Last seen 14 months ago

Argentina

Hello! I calculated DGE with or without filtering by LFC. The thing that concerns me is that, in the results table, even though the baseMean, the log2FoldChange, the lfcSE, the stat value is different, and because of that the pvalue and the padj are also different.

Here is an example of one gene. The first row is the result of results(dds, contrast = c("condition", "E2", "UNT"), alpha=0.05) and the second row is results(dds, contrast = c("condition", "E2", "UNT"), alpha=0.05, lfcThreshold = 1)

ENSID            SYMBOL     baseMean         log2FoldChange         lfcSE               stat                  pvalue                 padj
ENSG00000103257 SLC7A5  35252.0316507722    1.14628745148079    0.0571527021487077  20.0565749017119    1.76854266047002E-89    5.25586898452226E-87
ENSG00000103257 SLC7A5  35252.0316507722    1.14628745148079    0.0571527021487077  2.55958941539029    0.0104795893216153         1


sessionInfo( )
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=es_AR.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=es_AR.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=es_AR.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=es_AR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] venn_1.11                   forcats_0.5.2               stringr_1.4.1               dplyr_1.0.9                
 [5] purrr_0.3.4                 readr_2.1.2                 tidyr_1.2.0                 tibble_3.1.8               
 [9] ggplot2_3.3.6               tidyverse_1.3.2             DESeq2_1.36.0               SummarizedExperiment_1.26.1
[13] Biobase_2.56.0              MatrixGenerics_1.8.1        matrixStats_0.62.0          GenomicRanges_1.48.0       
[17] GenomeInfoDb_1.32.3         IRanges_2.30.1              S4Vectors_0.34.0            BiocGenerics_0.42.0        

loaded via a namespace (and not attached):
 [1] bitops_1.0-7           fs_1.5.2               lubridate_1.8.0        bit64_4.0.5            RColorBrewer_1.1-3    
 [6] httr_1.4.4             tools_4.2.1            backports_1.4.1        utf8_1.2.2             R6_2.5.1              
[11] DBI_1.1.3              colorspace_2.0-3       withr_2.5.0            tidyselect_1.1.2       bit_4.0.4             
[16] compiler_4.2.1         cli_3.3.0              rvest_1.0.3            xml2_1.3.3             DelayedArray_0.22.0   
[21] labeling_0.4.2         scales_1.2.1           genefilter_1.78.0      digest_0.6.29          XVector_0.36.0        
[26] pkgconfig_2.0.3        dbplyr_2.2.1           fastmap_1.1.0          limma_3.52.2           rlang_1.0.4           
[31] readxl_1.4.1           rstudioapi_0.14        RSQLite_2.2.16         farver_2.1.1           generics_0.1.3        
[36] jsonlite_1.8.0         BiocParallel_1.30.3    googlesheets4_1.0.1    RCurl_1.98-1.8         magrittr_2.0.3        
[41] GenomeInfoDbData_1.2.8 Matrix_1.4-1           Rcpp_1.0.9             munsell_0.5.0          fansi_1.0.3           
[46] lifecycle_1.0.1        stringi_1.7.8          zlibbioc_1.42.0        grid_4.2.1             blob_1.2.3            
[51] parallel_4.2.1         crayon_1.5.1           lattice_0.20-45        Biostrings_2.64.1      haven_2.5.1           
[56] splines_4.2.1          annotate_1.74.0        hms_1.1.2              KEGGREST_1.36.3        locfit_1.5-9.6        
[61] pillar_1.8.1           admisc_0.29            geneplotter_1.74.0     codetools_0.2-18       reprex_2.0.2          
[66] XML_3.99-0.10          glue_1.6.2             renv_0.15.5            modelr_0.1.9           png_0.1-7             
[71] vctrs_0.4.1            tzdb_0.3.0             cellranger_1.1.0       gtable_0.3.0           assertthat_0.2.1      
[76] cachem_1.0.6           xtable_1.8-4           broom_1.0.0            survival_3.4-0         googledrive_2.0.0     
[81] gargle_1.2.0           AnnotationDbi_1.58.0   memoise_2.0.1          ellipsis_0.3.2

DESeq2 wald stat • 1.7k views

ADD COMMENT • link updated 3.4 years ago by Peter Hickey ▴ 760 • written 3.4 years ago by eve.olszanowski • 0

score 1 · Answer 1 · 2022-09-06

1

Entering edit mode

Peter Hickey ▴ 760

@petehaitch

Last seen 5 weeks ago

WEHI, Melbourne, Australia

You're testing 2 different null hypotheses, which is why the statistics are different:

results(dds, contrast = c("condition", "E2", "UNT"), alpha=0.05) is testing a null hypothesis of logFC = 0 vs. an alternative hypothesis of logFC != 0.
results(dds, contrast = c("condition", "E2", "UNT"), alpha=0.05, lfcThreshold = 1) is testing a null hypothesis of |logFC| <= 1 vs. an alternative hypothesis of |logFC| > 1

See also the altHypothesis argument of results() for other tweaks that can be made to the definition of the alternative hypothesis.

ADD COMMENT • link 3.4 years ago Peter Hickey ▴ 760

0

Entering edit mode

Thank you for the answer! The thing is that, when I use res <- results(dds, contrast = c("condition", "E2", "UNT"), alpha=0.05) and then filter the data frame by |logFC| > 1 I get 223 upregulated genes and 105 downregulated. But when I use res_LFC <- results(dds, contrast = c("condition", "E2", "UNT"), alpha=0.05, lfcThreshold=1) I get 67 upregulated genes and 11 downregulated. So, which is the correct approach?

> res <- results(dds, contrast = c("condition", "E2", "UNT"), alpha=0.05)
> summary(res)

out of 30576 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)       : 2880, 9.4%
LFC < 0 (down)     : 2314, 7.6%
outliers [1]       : 0, 0%
low counts [2]     : 13042, 43%
(mean count < 37)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

> res=data.frame(res)
> res_up <- res[which(res$log2FoldChange>1 & res$padj<0.05)  ,]
> nrow(res_up)
[1] 223
> res_down <- res[which(res$log2FoldChange< -1  & res$padj<0.05) ,]
> nrow(res_down)
[1] 105
> 
> res_LFC <- results(dds, contrast = c("condition", "E2", "UNT"), alpha=0.05, lfcThreshold=1)
> summary(res_LFC)

out of 30576 with nonzero total read count
adjusted p-value < 0.05
LFC > 1.00 (up)    : 67, 0.22%
LFC < -1.00 (down) : 11, 0.036%
outliers [1]       : 0, 0%
low counts [2]     : 2372, 7.8%
(mean count < 2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

ADD REPLY • link 3.4 years ago eve.olszanowski • 0

0

Entering edit mode

If you want to "identify genes that have a statistically significant |logFC| > 1" then the correct approach is results(dds, contrast = c("condition", "E2", "UNT"), alpha=0.05, lfcThreshold=1).

It is incorrect to apply a logFC filter after applying an FDR cutoff. This is discussed, for example, in https://f1000research.com/articles/5-1438:

A commonly used approach is to apply FDR and logFC cutoffs simultaneously. However this tends to favor lowly expressed genes, and also fails to control the FDR correctly. A better and more rigorous approach is to modify the statistical test so as to detect expression changes greater than a specified threshold. In edgeR, this can be done using the glmTreat function. This function is analogous to the TREAT method for microarrays but is adapted to the NB framework.

The TREAT method is described in https://pubmed.ncbi.nlm.nih.gov/19176553/ and I believe DESeq2 implements a similar procedure when you supply a lfcThreshold > 0 (I'm more familiar with edgeR than DESeq2).

ADD REPLY • link 3.4 years ago Peter Hickey ▴ 760