Hello! I calculated DGE with or without filtering by LFC. The thing that concerns me is that, in the results table, even though the baseMean, the log2FoldChange, the lfcSE, the stat value is different, and because of that the pvalue and the padj are also different.
Here is an example of one gene.
The first row is the result of results(dds, contrast = c("condition", "E2", "UNT"), alpha=0.05)
and the second row is results(dds, contrast = c("condition", "E2", "UNT"), alpha=0.05, lfcThreshold = 1)
ENSID SYMBOL baseMean log2FoldChange lfcSE stat pvalue padj
ENSG00000103257 SLC7A5 35252.0316507722 1.14628745148079 0.0571527021487077 20.0565749017119 1.76854266047002E-89 5.25586898452226E-87
ENSG00000103257 SLC7A5 35252.0316507722 1.14628745148079 0.0571527021487077 2.55958941539029 0.0104795893216153 1
sessionInfo( )
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=es_AR.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=es_AR.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=es_AR.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=es_AR.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices datasets utils methods base
other attached packages:
[1] venn_1.11 forcats_0.5.2 stringr_1.4.1 dplyr_1.0.9
[5] purrr_0.3.4 readr_2.1.2 tidyr_1.2.0 tibble_3.1.8
[9] ggplot2_3.3.6 tidyverse_1.3.2 DESeq2_1.36.0 SummarizedExperiment_1.26.1
[13] Biobase_2.56.0 MatrixGenerics_1.8.1 matrixStats_0.62.0 GenomicRanges_1.48.0
[17] GenomeInfoDb_1.32.3 IRanges_2.30.1 S4Vectors_0.34.0 BiocGenerics_0.42.0
loaded via a namespace (and not attached):
[1] bitops_1.0-7 fs_1.5.2 lubridate_1.8.0 bit64_4.0.5 RColorBrewer_1.1-3
[6] httr_1.4.4 tools_4.2.1 backports_1.4.1 utf8_1.2.2 R6_2.5.1
[11] DBI_1.1.3 colorspace_2.0-3 withr_2.5.0 tidyselect_1.1.2 bit_4.0.4
[16] compiler_4.2.1 cli_3.3.0 rvest_1.0.3 xml2_1.3.3 DelayedArray_0.22.0
[21] labeling_0.4.2 scales_1.2.1 genefilter_1.78.0 digest_0.6.29 XVector_0.36.0
[26] pkgconfig_2.0.3 dbplyr_2.2.1 fastmap_1.1.0 limma_3.52.2 rlang_1.0.4
[31] readxl_1.4.1 rstudioapi_0.14 RSQLite_2.2.16 farver_2.1.1 generics_0.1.3
[36] jsonlite_1.8.0 BiocParallel_1.30.3 googlesheets4_1.0.1 RCurl_1.98-1.8 magrittr_2.0.3
[41] GenomeInfoDbData_1.2.8 Matrix_1.4-1 Rcpp_1.0.9 munsell_0.5.0 fansi_1.0.3
[46] lifecycle_1.0.1 stringi_1.7.8 zlibbioc_1.42.0 grid_4.2.1 blob_1.2.3
[51] parallel_4.2.1 crayon_1.5.1 lattice_0.20-45 Biostrings_2.64.1 haven_2.5.1
[56] splines_4.2.1 annotate_1.74.0 hms_1.1.2 KEGGREST_1.36.3 locfit_1.5-9.6
[61] pillar_1.8.1 admisc_0.29 geneplotter_1.74.0 codetools_0.2-18 reprex_2.0.2
[66] XML_3.99-0.10 glue_1.6.2 renv_0.15.5 modelr_0.1.9 png_0.1-7
[71] vctrs_0.4.1 tzdb_0.3.0 cellranger_1.1.0 gtable_0.3.0 assertthat_0.2.1
[76] cachem_1.0.6 xtable_1.8-4 broom_1.0.0 survival_3.4-0 googledrive_2.0.0
[81] gargle_1.2.0 AnnotationDbi_1.58.0 memoise_2.0.1 ellipsis_0.3.2
Thank you for the answer! The thing is that, when I use
res <- results(dds, contrast = c("condition", "E2", "UNT"), alpha=0.05)
and then filter the data frame by |logFC| > 1 I get 223 upregulated genes and 105 downregulated. But when I useres_LFC <- results(dds, contrast = c("condition", "E2", "UNT"), alpha=0.05, lfcThreshold=1)
I get 67 upregulated genes and 11 downregulated. So, which is the correct approach?If you want to "identify genes that have a statistically significant |logFC| > 1" then the correct approach is
results(dds, contrast = c("condition", "E2", "UNT"), alpha=0.05, lfcThreshold=1)
.It is incorrect to apply a logFC filter after applying an FDR cutoff. This is discussed, for example, in https://f1000research.com/articles/5-1438:
The TREAT method is described in https://pubmed.ncbi.nlm.nih.gov/19176553/ and I believe DESeq2 implements a similar procedure when you supply a
lfcThreshold > 0
(I'm more familiar with edgeR than DESeq2).