Question

log2FoldChange value is way too different when compared with counts(dds)

0

Entering edit mode

HAK • 0

@3f60d891

Last seen 5 days ago

United States

This is probably a conceptual question. Or, I'm doing something completely wrong, hopefully.
After doing the DEA I get the results, and want to plot the heatmap to see how the gene expression in samples look like. To plot that I use the data from counts(dds, normalized=T)

I understand that the counts(dds, normalized=T) will not give me the count values that were used in the calculation of the log2FoldChange directly. In other words, I cannot just take the mean of counts(dds, normalized=T) for the samples in each group, divide them and get the ratio of 2^log2FoldChange for that gene - I see that the model is more complicated than that. I read the vignette section "Theory behind DESeq2", even though I cannot claim that I understood the theory completely, I think I got the gist of it, maybe :) However, for instance for gene FGD2 below, the log2FoldChange value is 5.36, while the mean values of each group taken from counts(dds, normalized=T) are 2107 and 2259. Is this normal? What might be the explanation behind these numbers? I sorted the results by padj and here giving the values for the first 6 genes in there.

It would be great if there was a way to get the count matrix that was used to calculate the log2FoldChanges in DESeq2, so that we could plot the heatmap with those values and see what's happening in the samples much better. Is that possible? Would this code give that? limma::removeBatchEffect(counts(ddsfiltered, normalized=T), batch = ddsfiltered@colData$batch, batch2=ddsfiltered@colData$sex)

Thank you all!

countData <- read.csv("test.csv", header = TRUE)
metadata <- read.csv("metadata-test.csv",header = TRUE)
colData <- as.data.frame(cbind(colnames(countData), metadata$batch, metadata$condition, metadata$techRepl, metadata$condName, metadata$sex)) 
colnames(colData) <- c("sampleID", "batch", "condition", "techRepl", "condName", "sex")

dds <- DESeqDataSetFromMatrix(countData = countDataOnly, colData = colData, design = ~batch+sex+condition)
dds<- collapseReplicates(dds, colData$techRepl, renameCols = TRUE)
keep <- rowSums(counts(dds) >= 10) >= 3
ddsfiltered<- dds[keep,] 

ddsfiltered <- DESeq(ddsfiltered)
res <- results(ddsfiltered)

> colData
  sampleID batch condition techRepl condName sex
1   S104.1     1      Drug     S104   S104-1   M
2   S104.2     2      Drug     S104   S104-2   M
3      S23     1      Drug      S23      S23   F
4      S77     2      Drug      S77      S77   M
5  S104FUP     2       FUP  S104FUP  S104FUP   M
6 S23FUP.1     2       FUP   S23FUP S23FUP-1   F
7 S23FUP.2     2       FUP   S23FUP S23FUP-2   F
8   S77FUP     2       FUP   S77FUP   S77FUP   M

> head(res[order(res$padj),])
log2 fold change (MLE): condition FUP vs Drug 
Wald test p-value: condition FUP vs Drug 
DataFrame with 6 rows and 6 columns
         baseMean log2FoldChange     lfcSE      stat      pvalue        padj
        <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
FGD2     2183.731        5.36262  0.543943   9.85879 6.28028e-23 9.46061e-19
MED14    1063.100        5.35099  0.584255   9.15865 5.25493e-20 3.95802e-16
PCED1B   1800.534        7.68454  0.880384   8.72862 2.57799e-18 1.29450e-14
DOT1L     939.003        6.49484  0.815403   7.96519 1.64976e-15 6.21300e-12
PLA2G4B   834.465        4.94894  0.652518   7.58436 3.34128e-14 1.00666e-10
MAP4      425.414        6.56181  0.923084   7.10858 1.17243e-12 2.94358e-09

> DrugMeans <- rowMeans2(counts(ddsfiltered[,c("S104","S23","S77")],normalized=T))
> DrugMeans[c("FGD2","MED14","PCED1B","DOT1L","PLA2G4B","MAP4")]
     FGD2     MED14    PCED1B     DOT1L   PLA2G4B      MAP4 
2108.0410  896.7003 1541.0021  826.1538  559.5478  192.3376 
> FUPMeans <- rowMeans2(counts(ddsfiltered[,c("S104FUP","S23FUP","S77FUP")],normalized=T))
> FUPMeans[c("FGD2","MED14","PCED1B","DOT1L","PLA2G4B","MAP4")]
    FGD2    MED14   PCED1B    DOT1L  PLA2G4B     MAP4 
2259.420 1229.500 2060.066 1051.851 1109.382  658.490

> limma::removeBatchEffect(counts(ddsfiltered, normalized=T), batch = ddsfiltered@colData$batch, batch2=ddsfiltered@colData$sex)[c("FGD2","MED14","PCED1B","DOT1L","PLA2G4B","MAP4"),]
             S104   S104FUP       S23    S23FUP       S77    S77FUP
FGD2    2641.2613 3472.4680 2264.2549 2641.2613 863.47672 2833.8262
MED14   1262.0154 1479.7232 1115.8040 1262.0154 405.62533 1608.2749
PCED1B  1619.7350 3157.6165 2456.0578 1619.7350 721.48637 2652.7477
DOT1L    918.8023 1354.8178 1217.9032  918.8023 399.01627 1600.7746
PLA2G4B  876.4230 1190.8063  924.7999  876.4230 229.76985 1305.4466
MAP4     515.1502  603.3985  275.1398  515.1502 -82.97394  545.0052

> counts(ddsfiltered[c("FGD2","MED14","PCED1B","DOT1L","PLA2G4B","MAP4"),])
        S104 S104FUP  S23 S23FUP S77 S77FUP
FGD2    7691    1770 2316   4499  36   1342
MED14   2812     732 1126   2923  18    815
PCED1B  3478    1624 2389   3869   6   1285
DOT1L   2073     642 1211   2105   7    803
PLA2G4B 1261     659  847   2650  20    733
MAP4     926     461  139   1392   5    421

> counts(ddsfiltered[c("FGD2","MED14","PCED1B","DOT1L","PLA2G4B","MAP4"),], normalized=T)
             S104   S104FUP       S23    S23FUP       S77    S77FUP
FGD2    3196.3914 2665.3859 3071.3369 2086.1311 56.394653 2026.7442
MED14   1168.6715 1102.2952 1493.2320 1355.3592 28.197327 1230.8469
PCED1B  1445.4621 2445.5292 3168.1450 1794.0078  9.399109 1940.6604
DOT1L    861.5420  966.7671 1605.9538  976.0627 10.965627 1212.7240
PLA2G4B  524.0735  992.3669 1123.2394 1228.7725 31.330363 1107.0071
MAP4     384.8470  694.2050  184.3333  645.4533  7.832591  635.8117

> head(assays(ddsfiltered)[["mu"]][c("FGD2","MED14","PCED1B","DOT1L","PLA2G4B","MAP4"),])
             S104   S104FUP       S23   S23FUP       S77    S77FUP
FGD2    7864.3376 1540.8568 2265.8625 4600.517 36.000063 1536.3943
MED14   2889.9627  764.2425 1096.2055 3004.027 18.000037  762.0291
PCED1B  4747.3387 1284.0421 1884.0606 5280.766  6.000041 1280.3234
DOT1L   2597.5432  656.7338 1006.7262 2637.612  7.000035  654.8319
PLA2G4B 1515.6562  642.6250  724.4749 3183.122 20.000042  640.7639
MAP4     770.6558  491.3870  175.4154 1159.055  5.000030  489.9639

> sessionInfo( )
R version 4.3.2 (2023-10-31)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.7.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] RColorBrewer_1.1-3          pheatmap_1.0.12             ggprism_1.0.4              
 [4] DESeq2_1.42.0               SummarizedExperiment_1.32.0 MatrixGenerics_1.14.0      
 [7] matrixStats_1.2.0           GenomicRanges_1.54.1        GenomeInfoDb_1.38.6        
[10] lubridate_1.9.3             forcats_1.0.0               stringr_1.5.1              
[13] dplyr_1.1.4                 purrr_1.0.2                 readr_2.1.5                
[16] tidyr_1.3.1                 tibble_3.2.1                tidyverse_2.0.0            
[19] org.Hs.eg.db_3.18.0         AnnotationDbi_1.64.1        IRanges_2.36.0             
[22] S4Vectors_0.40.2            Biobase_2.62.0              BiocGenerics_0.48.1        
[25] limma_3.58.1                ggfortify_0.4.16            ggplot2_3.4.4              

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.0        farver_2.1.1            blob_1.2.4              Biostrings_2.70.2      
 [5] bitops_1.0-7            fastmap_1.1.1           RCurl_1.98-1.14         digest_0.6.34          
 [9] timechange_0.3.0        lifecycle_1.0.4         statmod_1.5.0           KEGGREST_1.42.0        
[13] RSQLite_2.3.5           magrittr_2.0.3          compiler_4.3.2          rlang_1.1.3            
[17] tools_4.3.2             utf8_1.2.4              yaml_2.3.8              knitr_1.45             
[21] labeling_0.4.3          S4Arrays_1.2.0          bit_4.0.5               DelayedArray_0.28.0    
[25] abind_1.4-5             BiocParallel_1.36.0     withr_3.0.0             grid_4.3.2             
[29] fansi_1.0.6             colorspace_2.1-0        scales_1.3.0            cli_3.6.2              
[33] rmarkdown_2.25          crayon_1.5.2            generics_0.1.3          rstudioapi_0.15.0      
[37] tzdb_0.4.0              httr_1.4.7              DBI_1.2.2               cachem_1.0.8           
[41] zlibbioc_1.48.0         parallel_4.3.2          BiocManager_1.30.22     XVector_0.42.0         
[45] vctrs_0.6.5             Matrix_1.6-5            hms_1.1.3               bit64_4.0.5            
[49] locfit_1.5-9.8          glue_1.7.0              codetools_0.2-19        stringi_1.8.3          
[53] gtable_0.3.4            munsell_0.5.0           pillar_1.9.0            htmltools_0.5.7        
[57] GenomeInfoDbData_1.2.11 R6_2.5.1                evaluate_0.23           lattice_0.22-5         
[61] png_0.1-8               memoise_2.0.1           Rcpp_1.0.12             gridExtra_2.3          
[65] SparseArray_1.2.4       xfun_0.42               pkgconfig_2.0.3

DESeq2 • 295 views

ADD COMMENT • link updated 13 days ago by Michael Love 42k • written 14 days ago by HAK • 0

score 0 · Answer 1 · 2024-05-01

0

Entering edit mode

Michael Love 42k

@mikelove

Last seen 18 minutes ago

United States

This has been asked before, but with a multi factor design, the LFC for coefficient X adjust for the others, in a way that is often hard to visualize or is anyway sometimes not intuitive. This is why controlling for important covariates is so important. You can even have signs go in opposite directions, as in Simpson's paradox.

What you get from a GLM is also not the same as transforming the data, regressing out covariates with linear model, and then computing LFC from the residualized log counts. If it was, then there would be not point for the GLM.

It looks like above you are regressing on the counts which would be wrong.

ADD COMMENT • link 14 days ago Michael Love 42k

0

Entering edit mode

Thank you for the explanation, probably my brain just needed to accumulate enough info to understand what you are saying, but I finally got it :)
"transforming the data, regressing out covariates with linear model, and then computing LFC from the residualized log counts." -> this is exactly what I was thinking. In case someone else comes here with a similar question, reading about GLM helped to finally get what the modeling part is doing.

ADD REPLY • link 13 days ago HAK • 0

0

Entering edit mode

But just to be clear, this is not the same as regressing out batch from the VST (approx log transformed):

limma::removeBatchEffect(counts(ddsfiltered, normalized=T), batch = ...)

See the vignette, we apply remove batch effect to VST data, not scaled counts.

counts(dds, normalized=TRUE) is just the counts, but with columns divided by size factors.

VST is that plus a transform (log like).

ADD REPLY • link 13 days ago Michael Love 42k