Hello everyone. I have a question regarding how to interpret the value of the Fold Changes when a batch effect is present. I am running DEseq2 including the batch in the design matrix, as recommended. When looking at the log2FC of the results table I reckon this values do not have any normalization regarding the batch effect (correct me if I am wrong). So, when I apply a lfc threshold the amount of differentially expressed genes decreases significantly because the lfc values tend to be low. So, how can I filter the differentially expressed genes taking into account lfc values that are corrected by batch effect?
> dds <- DESeqDataSetFromMatrix(
+ countData = countdata,
+ colData = coldata,
+ design = ~ replicado + condition)
> dds
class: DESeqDataSet
dim: 61852 8
metadata(1): version
assays(1): counts
rownames(61852): ENSG00000000003.15 ENSG00000000005.6 ... ENSG00000290165.1 ENSG00000290166.1
rowData names(0):
colnames(8): UNT_1 UNT_2 ... E2GW_1 E2GW_2
colData names(2): condition replicado
> keep <- rowSums(counts(dds)) >= 10
> dds <- dds[keep,]
> dds <- DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
> comp <- results(dds, contrast = c("condition", "E2", "UNT"), alpha=0.05)
> summary(comp)
out of 27025 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up) : 3078, 11%
LFC < 0 (down) : 2316, 8.6%
outliers [1] : 0, 0%
low counts [2] : 9955, 37%
(mean count < 25)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
> comp
log2 fold change (MLE): condition E2 vs UNT
Wald test p-value: condition E2 vs UNT
DataFrame with 27025 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000000003.15 1952.07834 -0.134989 0.0697229 -1.936082 0.05285771 0.1343480
ENSG00000000419.14 6064.33523 0.113285 0.0522518 2.168069 0.03015345 0.0850842
ENSG00000000457.14 1331.59221 -0.216712 0.0796866 -2.719553 0.00653702 0.0234821
ENSG00000000460.17 1517.08053 -0.202480 0.0749210 -2.702582 0.00688032 0.0245808
ENSG00000000938.13 1.14802 1.367265 2.5395499 0.538389 0.59030883 NA
... ... ... ... ... ... ...
ENSG00000290124.1 12.40922 -0.0436103 0.806648 -0.0540636 0.956884 NA
ENSG00000290126.1 38.47627 -0.1570125 0.459964 -0.3413581 0.732834 0.852820
ENSG00000290127.1 3.26398 -0.8855532 1.439226 -0.6152984 0.538358 NA
ENSG00000290146.1 73.48168 0.0808059 0.318618 0.2536133 0.799794 0.894583
ENSG00000290165.1 2.78590 1.1607220 1.568872 0.7398451 0.459394 NA
> comp_lfc <- results(dds, contrast = c("condition", "E2", "UNT"), alpha=0.05, lfcThreshold=1)
> summary(comp_lfc)
out of 27025 with nonzero total read count
adjusted p-value < 0.05
LFC > 1.00 (up) : 70, 0.26%
LFC < -1.00 (down) : 8, 0.03%
outliers [1] : 0, 0%
low counts [2] : 3144, 12%
(mean count < 3)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
> comp_lfc
log2 fold change (MLE): condition E2 vs UNT
Wald test p-value: condition E2 vs UNT
DataFrame with 27025 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000000003.15 1952.07834 -0.134989 0.0697229 0.000000 1.000000 1
ENSG00000000419.14 6064.33523 0.113285 0.0522518 0.000000 1.000000 1
ENSG00000000457.14 1331.59221 -0.216712 0.0796866 0.000000 1.000000 1
ENSG00000000460.17 1517.08053 -0.202480 0.0749210 0.000000 1.000000 1
ENSG00000000938.13 1.14802 1.367265 2.5395499 0.144618 0.885013 NA
... ... ... ... ... ... ...
ENSG00000290124.1 12.40922 -0.0436103 0.806648 0.000000 1.000000 1
ENSG00000290126.1 38.47627 -0.1570125 0.459964 0.000000 1.000000 1
ENSG00000290127.1 3.26398 -0.8855532 1.439226 0.000000 1.000000 1
ENSG00000290146.1 73.48168 0.0808059 0.318618 0.000000 1.000000 1
ENSG00000290165.1 2.78590 1.1607220 1.568872 0.102444 0.918404 NA
sessionInfo( )
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=es_AR.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=es_AR.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=es_AR.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=es_AR.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices datasets utils methods base
other attached packages:
[1] forcats_0.5.2 stringr_1.4.1 dplyr_1.0.10 purrr_0.3.4
[5] readr_2.1.2 tidyr_1.2.1 tibble_3.1.8 ggplot2_3.3.6
[9] tidyverse_1.3.2 DESeq2_1.36.0 SummarizedExperiment_1.26.1 Biobase_2.56.0
[13] MatrixGenerics_1.8.1 matrixStats_0.62.0 GenomicRanges_1.48.0 GenomeInfoDb_1.32.4
[17] IRanges_2.30.1 S4Vectors_0.34.0 BiocGenerics_0.42.0
loaded via a namespace (and not attached):
[1] bitops_1.0-7 fs_1.5.2 lubridate_1.8.0 bit64_4.0.5 RColorBrewer_1.1-3 httr_1.4.4
[7] tools_4.2.1 backports_1.4.1 utf8_1.2.2 R6_2.5.1 DBI_1.1.3 colorspace_2.0-3
[13] withr_2.5.0 tidyselect_1.1.2 bit_4.0.4 compiler_4.2.1 cli_3.4.0 rvest_1.0.3
[19] xml2_1.3.3 DelayedArray_0.22.0 labeling_0.4.2 scales_1.2.1 genefilter_1.78.0 digest_0.6.29
[25] XVector_0.36.0 pkgconfig_2.0.3 dbplyr_2.2.1 fastmap_1.1.0 limma_3.52.3 rlang_1.0.5
[31] readxl_1.4.1 rstudioapi_0.14 RSQLite_2.2.17 farver_2.1.1 generics_0.1.3 jsonlite_1.8.0
[37] BiocParallel_1.30.3 googlesheets4_1.0.1 RCurl_1.98-1.8 magrittr_2.0.3 GenomeInfoDbData_1.2.8 Matrix_1.5-0
[43] Rcpp_1.0.9 munsell_0.5.0 fansi_1.0.3 lifecycle_1.0.2 stringi_1.7.8 zlibbioc_1.42.0
[49] grid_4.2.1 blob_1.2.3 parallel_4.2.1 crayon_1.5.1 lattice_0.20-45 Biostrings_2.64.1
[55] haven_2.5.1 splines_4.2.1 annotate_1.74.0 hms_1.1.2 KEGGREST_1.36.3 locfit_1.5-9.6
[61] pillar_1.8.1 geneplotter_1.74.0 codetools_0.2-18 reprex_2.0.2 XML_3.99-0.10 glue_1.6.2
[67] renv_0.15.5 modelr_0.1.9 png_0.1-7 vctrs_0.4.1 tzdb_0.3.0 cellranger_1.1.0
[73] gtable_0.3.1 assertthat_0.2.1 cachem_1.0.6 xtable_1.8-4 broom_1.0.1 survival_3.4-0
[79] googledrive_2.0.0 gargle_1.2.1 AnnotationDbi_1.58.0 memoise_2.0.1 ellipsis_0.3.2
How does this correction is calculated?
It seems to me that when I filter for lfc the decrease is drastic and I get almost no genes to work within the downstream analyses. Am I missing something?
Here's a simple example. Say we have just one gene, measured in treated and untreated subjects in two batches.
In the above model, the Estimate for treatUntreated is the difference for treated and untreated after controlling for batch. And the batch2 coefficient is the average difference between batch2 and batch1. Algebraically, this value is subtracted from all the batch2 subjects, which corrects all of those subjects for the fact that they come from a different batch. We can do this all by hand as well.
Does that make sense?
As to why you get no genes when you filter on lfc, it's because you are filtering on lfc. That's really only something you should do if there are 'too many' genes and you want to cut the list down.
It makes perfect sense. Thanks for the example, it was helpful. But then my question is, how is the log2FC calculated? Because I thought it was independent of the design. But evidently it is not. More specifically, how the calculation of the log2FC takes into account the batch (or any other covariate) effect?
I just showed you exactly how it works! You said it makes perfect sense and then asked me how it works? I am confused.
Also, how would calculation of the lfc be independent of the design? The design specifies what comparisons you are making, and by definition then what and how the lfc is calculated.
So, are you saying that DEseq2 normalize counts doing that subtraction you mentioned in the d.f$batchadj column, and then calculated the lfc with that normalized values? Sorry if it is repetitive, but I am not finding that details in the DEseq2 manual, and it is very important for the way I am going to analyze my data from now on.
Kind of, but in a GLM it's not just subtracting, because it accounts for the unique variability of count data.
But it's worthwhile to consider James's example as an analogy to what happens inside a GLM, while acknowledging it's not just subtraction.
If you want to see how LFC is calculated, it is
beta
here:https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8#Sec18
The section "Final estimate of logarithmic fold changes" describes the iterative algorithm to set beta.
Thank you Michael. I am reading it now.
The decision of filtering on lfc was to obtain a set of genes with more biological significance, aka the ones that change the most. But when I do that, the number of DGE reduces drastically as i already pointed out. Since I have only two replicates, maybe there is no need to do that filtering on lfc, because of this next phrase extracted from the paper.
"Consequently, with sufficient sample size, even genes with a very small but non-zero LFC will eventually be detected as differentially expressed. A change should therefore be of sufficient magnitude to be considered biologically significant. For small-scale experiments, statistical significance is often a much stricter requirement than biological significance, thereby relieving the researcher from the need to decide on a threshold for biological significance."
Correct me if I am wrong.
Sorry I think there is a mixup, I was posting to explain that this is not an accurate representation of the methods:
It doesn't subtract, it works within a GLM framework.
Ok, but there is some type of normalization of the lfc taking into account all the factors of the design, I assume. So the values in the results of a contrast are normalized, in my case, for batch effects.