This is probably a conceptual question. Or, I'm doing something completely wrong, hopefully.
After doing the DEA I get the results, and want to plot the heatmap to see how the gene expression in samples look like. To plot that I use the data from counts(dds, normalized=T)
I understand that the counts(dds, normalized=T)
will not give me the count values that were used in the calculation of the log2FoldChange directly. In other words, I cannot just take the mean of counts(dds, normalized=T)
for the samples in each group, divide them and get the ratio of 2^log2FoldChange for that gene - I see that the model is more complicated than that. I read the vignette section "Theory behind DESeq2", even though I cannot claim that I understood the theory completely, I think I got the gist of it, maybe :) However, for instance for gene FGD2 below, the log2FoldChange value is 5.36, while the mean values of each group taken from counts(dds, normalized=T)
are 2107 and 2259. Is this normal? What might be the explanation behind these numbers? I sorted the results by padj and here giving the values for the first 6 genes in there.
It would be great if there was a way to get the count matrix that was used to calculate the log2FoldChanges in DESeq2, so that we could plot the heatmap with those values and see what's happening in the samples much better. Is that possible? Would this code give that? limma::removeBatchEffect(counts(ddsfiltered, normalized=T), batch = ddsfiltered@colData$batch, batch2=ddsfiltered@colData$sex)
Thank you all!
countData <- read.csv("test.csv", header = TRUE)
metadata <- read.csv("metadata-test.csv",header = TRUE)
colData <- as.data.frame(cbind(colnames(countData), metadata$batch, metadata$condition, metadata$techRepl, metadata$condName, metadata$sex))
colnames(colData) <- c("sampleID", "batch", "condition", "techRepl", "condName", "sex")
dds <- DESeqDataSetFromMatrix(countData = countDataOnly, colData = colData, design = ~batch+sex+condition)
dds<- collapseReplicates(dds, colData$techRepl, renameCols = TRUE)
keep <- rowSums(counts(dds) >= 10) >= 3
ddsfiltered<- dds[keep,]
ddsfiltered <- DESeq(ddsfiltered)
res <- results(ddsfiltered)
> colData
sampleID batch condition techRepl condName sex
1 S104.1 1 Drug S104 S104-1 M
2 S104.2 2 Drug S104 S104-2 M
3 S23 1 Drug S23 S23 F
4 S77 2 Drug S77 S77 M
5 S104FUP 2 FUP S104FUP S104FUP M
6 S23FUP.1 2 FUP S23FUP S23FUP-1 F
7 S23FUP.2 2 FUP S23FUP S23FUP-2 F
8 S77FUP 2 FUP S77FUP S77FUP M
> head(res[order(res$padj),])
log2 fold change (MLE): condition FUP vs Drug
Wald test p-value: condition FUP vs Drug
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
FGD2 2183.731 5.36262 0.543943 9.85879 6.28028e-23 9.46061e-19
MED14 1063.100 5.35099 0.584255 9.15865 5.25493e-20 3.95802e-16
PCED1B 1800.534 7.68454 0.880384 8.72862 2.57799e-18 1.29450e-14
DOT1L 939.003 6.49484 0.815403 7.96519 1.64976e-15 6.21300e-12
PLA2G4B 834.465 4.94894 0.652518 7.58436 3.34128e-14 1.00666e-10
MAP4 425.414 6.56181 0.923084 7.10858 1.17243e-12 2.94358e-09
> DrugMeans <- rowMeans2(counts(ddsfiltered[,c("S104","S23","S77")],normalized=T))
> DrugMeans[c("FGD2","MED14","PCED1B","DOT1L","PLA2G4B","MAP4")]
FGD2 MED14 PCED1B DOT1L PLA2G4B MAP4
2108.0410 896.7003 1541.0021 826.1538 559.5478 192.3376
> FUPMeans <- rowMeans2(counts(ddsfiltered[,c("S104FUP","S23FUP","S77FUP")],normalized=T))
> FUPMeans[c("FGD2","MED14","PCED1B","DOT1L","PLA2G4B","MAP4")]
FGD2 MED14 PCED1B DOT1L PLA2G4B MAP4
2259.420 1229.500 2060.066 1051.851 1109.382 658.490
> limma::removeBatchEffect(counts(ddsfiltered, normalized=T), batch = ddsfiltered@colData$batch, batch2=ddsfiltered@colData$sex)[c("FGD2","MED14","PCED1B","DOT1L","PLA2G4B","MAP4"),]
S104 S104FUP S23 S23FUP S77 S77FUP
FGD2 2641.2613 3472.4680 2264.2549 2641.2613 863.47672 2833.8262
MED14 1262.0154 1479.7232 1115.8040 1262.0154 405.62533 1608.2749
PCED1B 1619.7350 3157.6165 2456.0578 1619.7350 721.48637 2652.7477
DOT1L 918.8023 1354.8178 1217.9032 918.8023 399.01627 1600.7746
PLA2G4B 876.4230 1190.8063 924.7999 876.4230 229.76985 1305.4466
MAP4 515.1502 603.3985 275.1398 515.1502 -82.97394 545.0052
> counts(ddsfiltered[c("FGD2","MED14","PCED1B","DOT1L","PLA2G4B","MAP4"),])
S104 S104FUP S23 S23FUP S77 S77FUP
FGD2 7691 1770 2316 4499 36 1342
MED14 2812 732 1126 2923 18 815
PCED1B 3478 1624 2389 3869 6 1285
DOT1L 2073 642 1211 2105 7 803
PLA2G4B 1261 659 847 2650 20 733
MAP4 926 461 139 1392 5 421
> counts(ddsfiltered[c("FGD2","MED14","PCED1B","DOT1L","PLA2G4B","MAP4"),], normalized=T)
S104 S104FUP S23 S23FUP S77 S77FUP
FGD2 3196.3914 2665.3859 3071.3369 2086.1311 56.394653 2026.7442
MED14 1168.6715 1102.2952 1493.2320 1355.3592 28.197327 1230.8469
PCED1B 1445.4621 2445.5292 3168.1450 1794.0078 9.399109 1940.6604
DOT1L 861.5420 966.7671 1605.9538 976.0627 10.965627 1212.7240
PLA2G4B 524.0735 992.3669 1123.2394 1228.7725 31.330363 1107.0071
MAP4 384.8470 694.2050 184.3333 645.4533 7.832591 635.8117
> head(assays(ddsfiltered)[["mu"]][c("FGD2","MED14","PCED1B","DOT1L","PLA2G4B","MAP4"),])
S104 S104FUP S23 S23FUP S77 S77FUP
FGD2 7864.3376 1540.8568 2265.8625 4600.517 36.000063 1536.3943
MED14 2889.9627 764.2425 1096.2055 3004.027 18.000037 762.0291
PCED1B 4747.3387 1284.0421 1884.0606 5280.766 6.000041 1280.3234
DOT1L 2597.5432 656.7338 1006.7262 2637.612 7.000035 654.8319
PLA2G4B 1515.6562 642.6250 724.4749 3183.122 20.000042 640.7639
MAP4 770.6558 491.3870 175.4154 1159.055 5.000030 489.9639
> sessionInfo( )
R version 4.3.2 (2023-10-31)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.7.2
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Chicago
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] RColorBrewer_1.1-3 pheatmap_1.0.12 ggprism_1.0.4
[4] DESeq2_1.42.0 SummarizedExperiment_1.32.0 MatrixGenerics_1.14.0
[7] matrixStats_1.2.0 GenomicRanges_1.54.1 GenomeInfoDb_1.38.6
[10] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
[13] dplyr_1.1.4 purrr_1.0.2 readr_2.1.5
[16] tidyr_1.3.1 tibble_3.2.1 tidyverse_2.0.0
[19] org.Hs.eg.db_3.18.0 AnnotationDbi_1.64.1 IRanges_2.36.0
[22] S4Vectors_0.40.2 Biobase_2.62.0 BiocGenerics_0.48.1
[25] limma_3.58.1 ggfortify_0.4.16 ggplot2_3.4.4
loaded via a namespace (and not attached):
[1] tidyselect_1.2.0 farver_2.1.1 blob_1.2.4 Biostrings_2.70.2
[5] bitops_1.0-7 fastmap_1.1.1 RCurl_1.98-1.14 digest_0.6.34
[9] timechange_0.3.0 lifecycle_1.0.4 statmod_1.5.0 KEGGREST_1.42.0
[13] RSQLite_2.3.5 magrittr_2.0.3 compiler_4.3.2 rlang_1.1.3
[17] tools_4.3.2 utf8_1.2.4 yaml_2.3.8 knitr_1.45
[21] labeling_0.4.3 S4Arrays_1.2.0 bit_4.0.5 DelayedArray_0.28.0
[25] abind_1.4-5 BiocParallel_1.36.0 withr_3.0.0 grid_4.3.2
[29] fansi_1.0.6 colorspace_2.1-0 scales_1.3.0 cli_3.6.2
[33] rmarkdown_2.25 crayon_1.5.2 generics_0.1.3 rstudioapi_0.15.0
[37] tzdb_0.4.0 httr_1.4.7 DBI_1.2.2 cachem_1.0.8
[41] zlibbioc_1.48.0 parallel_4.3.2 BiocManager_1.30.22 XVector_0.42.0
[45] vctrs_0.6.5 Matrix_1.6-5 hms_1.1.3 bit64_4.0.5
[49] locfit_1.5-9.8 glue_1.7.0 codetools_0.2-19 stringi_1.8.3
[53] gtable_0.3.4 munsell_0.5.0 pillar_1.9.0 htmltools_0.5.7
[57] GenomeInfoDbData_1.2.11 R6_2.5.1 evaluate_0.23 lattice_0.22-5
[61] png_0.1-8 memoise_2.0.1 Rcpp_1.0.12 gridExtra_2.3
[65] SparseArray_1.2.4 xfun_0.42 pkgconfig_2.0.3
Thank you for the explanation, probably my brain just needed to accumulate enough info to understand what you are saying, but I finally got it :)
"transforming the data, regressing out covariates with linear model, and then computing LFC from the residualized log counts." -> this is exactly what I was thinking. In case someone else comes here with a similar question, reading about GLM helped to finally get what the modeling part is doing.
But just to be clear, this is not the same as regressing out batch from the VST (approx log transformed):
See the vignette, we apply remove batch effect to VST data, not scaled counts.
counts(dds, normalized=TRUE) is just the counts, but with columns divided by size factors.
VST is that plus a transform (log like).