Fold Change calculation when Batch Effect is present in RNA-seq data (DEseq2)
1
0
Entering edit mode
@dd6eb667
Last seen 6 minutes ago
Argentina

Hello everyone. I have a question regarding how to interpret the value of the Fold Changes when a batch effect is present. I am running DEseq2 including the batch in the design matrix, as recommended. When looking at the log2FC of the results table I reckon this values do not have any normalization regarding the batch effect (correct me if I am wrong). So, when I apply a lfc threshold the amount of differentially expressed genes decreases significantly because the lfc values tend to be low. So, how can I filter the differentially expressed genes taking into account lfc values that are corrected by batch effect?

> dds <- DESeqDataSetFromMatrix( 
+   countData = countdata, 
+   colData = coldata, 
+   design = ~ replicado + condition) 
> dds
class: DESeqDataSet 
dim: 61852 8 
metadata(1): version
assays(1): counts
rownames(61852): ENSG00000000003.15 ENSG00000000005.6 ... ENSG00000290165.1 ENSG00000290166.1
rowData names(0):
colnames(8): UNT_1 UNT_2 ... E2GW_1 E2GW_2
colData names(2): condition replicado
> keep <- rowSums(counts(dds)) >= 10 
> dds <- dds[keep,]
> dds <- DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
> comp <- results(dds, contrast = c("condition", "E2", "UNT"), alpha=0.05)
> summary(comp)

out of 27025 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)       : 3078, 11%
LFC < 0 (down)     : 2316, 8.6%
outliers [1]       : 0, 0%
low counts [2]     : 9955, 37%
(mean count < 25)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

> comp
log2 fold change (MLE): condition E2 vs UNT 
Wald test p-value: condition E2 vs UNT 
DataFrame with 27025 rows and 6 columns
                     baseMean log2FoldChange     lfcSE       stat     pvalue      padj
                    <numeric>      <numeric> <numeric>  <numeric>  <numeric> <numeric>
ENSG00000000003.15 1952.07834      -0.134989 0.0697229  -1.936082 0.05285771 0.1343480
ENSG00000000419.14 6064.33523       0.113285 0.0522518   2.168069 0.03015345 0.0850842
ENSG00000000457.14 1331.59221      -0.216712 0.0796866  -2.719553 0.00653702 0.0234821
ENSG00000000460.17 1517.08053      -0.202480 0.0749210  -2.702582 0.00688032 0.0245808
ENSG00000000938.13    1.14802       1.367265 2.5395499   0.538389 0.59030883        NA
...                       ...            ...       ...        ...        ...       ...
ENSG00000290124.1    12.40922     -0.0436103  0.806648 -0.0540636   0.956884        NA
ENSG00000290126.1    38.47627     -0.1570125  0.459964 -0.3413581   0.732834  0.852820
ENSG00000290127.1     3.26398     -0.8855532  1.439226 -0.6152984   0.538358        NA
ENSG00000290146.1    73.48168      0.0808059  0.318618  0.2536133   0.799794  0.894583
ENSG00000290165.1     2.78590      1.1607220  1.568872  0.7398451   0.459394        NA

> comp_lfc <- results(dds, contrast = c("condition", "E2", "UNT"), alpha=0.05, lfcThreshold=1)
> summary(comp_lfc)

out of 27025 with nonzero total read count
adjusted p-value < 0.05
LFC > 1.00 (up)    : 70, 0.26%
LFC < -1.00 (down) : 8, 0.03%
outliers [1]       : 0, 0%
low counts [2]     : 3144, 12%
(mean count < 3)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

> comp_lfc
log2 fold change (MLE): condition E2 vs UNT 
Wald test p-value: condition E2 vs UNT 
DataFrame with 27025 rows and 6 columns
                     baseMean log2FoldChange     lfcSE      stat    pvalue      padj
                    <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000000003.15 1952.07834      -0.134989 0.0697229  0.000000  1.000000         1
ENSG00000000419.14 6064.33523       0.113285 0.0522518  0.000000  1.000000         1
ENSG00000000457.14 1331.59221      -0.216712 0.0796866  0.000000  1.000000         1
ENSG00000000460.17 1517.08053      -0.202480 0.0749210  0.000000  1.000000         1
ENSG00000000938.13    1.14802       1.367265 2.5395499  0.144618  0.885013        NA
...                       ...            ...       ...       ...       ...       ...
ENSG00000290124.1    12.40922     -0.0436103  0.806648  0.000000  1.000000         1
ENSG00000290126.1    38.47627     -0.1570125  0.459964  0.000000  1.000000         1
ENSG00000290127.1     3.26398     -0.8855532  1.439226  0.000000  1.000000         1
ENSG00000290146.1    73.48168      0.0808059  0.318618  0.000000  1.000000         1
ENSG00000290165.1     2.78590      1.1607220  1.568872  0.102444  0.918404        NA
sessionInfo( )

R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=es_AR.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=es_AR.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=es_AR.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=es_AR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] forcats_0.5.2               stringr_1.4.1               dplyr_1.0.10                purrr_0.3.4                
 [5] readr_2.1.2                 tidyr_1.2.1                 tibble_3.1.8                ggplot2_3.3.6              
 [9] tidyverse_1.3.2             DESeq2_1.36.0               SummarizedExperiment_1.26.1 Biobase_2.56.0             
[13] MatrixGenerics_1.8.1        matrixStats_0.62.0          GenomicRanges_1.48.0        GenomeInfoDb_1.32.4        
[17] IRanges_2.30.1              S4Vectors_0.34.0            BiocGenerics_0.42.0        

loaded via a namespace (and not attached):
 [1] bitops_1.0-7           fs_1.5.2               lubridate_1.8.0        bit64_4.0.5            RColorBrewer_1.1-3     httr_1.4.4            
 [7] tools_4.2.1            backports_1.4.1        utf8_1.2.2             R6_2.5.1               DBI_1.1.3              colorspace_2.0-3      
[13] withr_2.5.0            tidyselect_1.1.2       bit_4.0.4              compiler_4.2.1         cli_3.4.0              rvest_1.0.3           
[19] xml2_1.3.3             DelayedArray_0.22.0    labeling_0.4.2         scales_1.2.1           genefilter_1.78.0      digest_0.6.29         
[25] XVector_0.36.0         pkgconfig_2.0.3        dbplyr_2.2.1           fastmap_1.1.0          limma_3.52.3           rlang_1.0.5           
[31] readxl_1.4.1           rstudioapi_0.14        RSQLite_2.2.17         farver_2.1.1           generics_0.1.3         jsonlite_1.8.0        
[37] BiocParallel_1.30.3    googlesheets4_1.0.1    RCurl_1.98-1.8         magrittr_2.0.3         GenomeInfoDbData_1.2.8 Matrix_1.5-0          
[43] Rcpp_1.0.9             munsell_0.5.0          fansi_1.0.3            lifecycle_1.0.2        stringi_1.7.8          zlibbioc_1.42.0       
[49] grid_4.2.1             blob_1.2.3             parallel_4.2.1         crayon_1.5.1           lattice_0.20-45        Biostrings_2.64.1     
[55] haven_2.5.1            splines_4.2.1          annotate_1.74.0        hms_1.1.2              KEGGREST_1.36.3        locfit_1.5-9.6        
[61] pillar_1.8.1           geneplotter_1.74.0     codetools_0.2-18       reprex_2.0.2           XML_3.99-0.10          glue_1.6.2            
[67] renv_0.15.5            modelr_0.1.9           png_0.1-7              vctrs_0.4.1            tzdb_0.3.0             cellranger_1.1.0      
[73] gtable_0.3.1           assertthat_0.2.1       cachem_1.0.6           xtable_1.8-4           broom_1.0.1            survival_3.4-0        
[79] googledrive_2.0.0      gargle_1.2.1           AnnotationDbi_1.58.0   memoise_2.0.1          ellipsis_0.3.2
DESeq2 BatchEffect fold • 2.3k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 57 minutes ago
United States

Adding the batch effect to your model causes the reported logFC to be corrected for batch.

ADD COMMENT
0
Entering edit mode

How does this correction is calculated?

It seems to me that when I filter for lfc the decrease is drastic and I get almost no genes to work within the downstream analyses. Am I missing something?

ADD REPLY
2
Entering edit mode

Here's a simple example. Say we have just one gene, measured in treated and untreated subjects in two batches.

> set.seed(0xabeef)
## so you can replicate exactly
> d.f <- data.frame(gene = rnorm(20), treat = factor(rep(c("Treated","Untreated"), each = 10)),batch = factor(rep(1:2, each = 5, times = 2)))
> d.f
          gene     treat batch
1   1.92572364   Treated     1
2  -0.19719109   Treated     1
3  -0.64609257   Treated     1
4   1.37651205   Treated     1
5  -1.47979636   Treated     1
6   0.70194861   Treated     2
7   1.62166429   Treated     2
8   0.11868270   Treated     2
9   0.74456019   Treated     2
10  1.04686530   Treated     2
11  0.41404017 Untreated     1
12 -0.33484159 Untreated     1
13 -0.04801465 Untreated     1
14 -0.12757933 Untreated     1
15  0.32879812 Untreated     1
16  1.12670854 Untreated     2
17  0.21960504 Untreated     2
18  0.84672079 Untreated     2
19 -0.02770834 Untreated     2
20 -0.86688457 Untreated     2
## now let's fit a model
> fit <- lm(gene~treat+batch, d.f)
> summary(fit)

Call:
lm(formula = gene ~ treat + batch, data = d.f)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.78505 -0.42323 -0.01406  0.47714  1.62047 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)
(Intercept)      0.3053     0.3301   0.925    0.368
treatUntreated  -0.3682     0.3811  -0.966    0.348
batch2           0.4321     0.3811   1.134    0.273

Residual standard error: 0.8523 on 17 degrees of freedom
Multiple R-squared:  0.1154,    Adjusted R-squared:  0.01136 
F-statistic: 1.109 on 2 and 17 DF,  p-value: 0.3526

In the above model, the Estimate for treatUntreated is the difference for treated and untreated after controlling for batch. And the batch2 coefficient is the average difference between batch2 and batch1. Algebraically, this value is subtracted from all the batch2 subjects, which corrects all of those subjects for the fact that they come from a different batch. We can do this all by hand as well.

> batchdiff <- mean(subset(d.f, batch == 2)$gene) - mean(subset(d.f, batch == 1)$gene)
> batchdiff
[1] 0.4320604
> d.f$batchadj <- ifelse(d.f$batch == 1L, d.f$gene, d.f$gene - batchdiff)
> treatdiff <- mean(subset(d.f, treat == "Untreated")$batchadj) - mean(subset(d.f, treat == "Treated")$batchadj)
> treatdiff
[1] -0.3682033

Does that make sense?

As to why you get no genes when you filter on lfc, it's because you are filtering on lfc. That's really only something you should do if there are 'too many' genes and you want to cut the list down.

ADD REPLY
0
Entering edit mode

It makes perfect sense. Thanks for the example, it was helpful. But then my question is, how is the log2FC calculated? Because I thought it was independent of the design. But evidently it is not. More specifically, how the calculation of the log2FC takes into account the batch (or any other covariate) effect?

ADD REPLY
0
Entering edit mode

I just showed you exactly how it works! You said it makes perfect sense and then asked me how it works? I am confused.

Also, how would calculation of the lfc be independent of the design? The design specifies what comparisons you are making, and by definition then what and how the lfc is calculated.

ADD REPLY
0
Entering edit mode

So, are you saying that DEseq2 normalize counts doing that subtraction you mentioned in the d.f$batchadj column, and then calculated the lfc with that normalized values? Sorry if it is repetitive, but I am not finding that details in the DEseq2 manual, and it is very important for the way I am going to analyze my data from now on.

ADD REPLY
0
Entering edit mode

Kind of, but in a GLM it's not just subtracting, because it accounts for the unique variability of count data.

But it's worthwhile to consider James's example as an analogy to what happens inside a GLM, while acknowledging it's not just subtraction.

If you want to see how LFC is calculated, it is beta here:

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8#Sec18

The section "Final estimate of logarithmic fold changes" describes the iterative algorithm to set beta.

ADD REPLY
0
Entering edit mode

Thank you Michael. I am reading it now.

The decision of filtering on lfc was to obtain a set of genes with more biological significance, aka the ones that change the most. But when I do that, the number of DGE reduces drastically as i already pointed out. Since I have only two replicates, maybe there is no need to do that filtering on lfc, because of this next phrase extracted from the paper.

"Consequently, with sufficient sample size, even genes with a very small but non-zero LFC will eventually be detected as differentially expressed. A change should therefore be of sufficient magnitude to be considered biologically significant. For small-scale experiments, statistical significance is often a much stricter requirement than biological significance, thereby relieving the researcher from the need to decide on a threshold for biological significance."

Correct me if I am wrong.

ADD REPLY
0
Entering edit mode

Sorry I think there is a mixup, I was posting to explain that this is not an accurate representation of the methods:

DEseq2 normalize counts doing that subtraction

It doesn't subtract, it works within a GLM framework.

ADD REPLY
0
Entering edit mode

Ok, but there is some type of normalization of the lfc taking into account all the factors of the design, I assume. So the values in the results of a contrast are normalized, in my case, for batch effects.

ADD REPLY

Login before adding your answer.

Traffic: 819 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6