How to incorporate batch effect that is only present in one level into the design?
2
0
Entering edit mode
tim.meese • 0
@timmeese-23719
Last seen 4 months ago
Belgium

Hi

I want to perform a differential expression on RNA-Seq data. During the data exploration, the PCA plot indicated that a batch effect was present (plot not shown). I obtained additional information about the experiment and indeed, the samples were processed by two different persons. The metadata of the experiment is shown in this table:

   sample treatment lab_tech
1 sample1   control    tech1
2 sample2   control    tech1
3 sample3   control    tech1
4 sample4    treatA    tech2
5 sample5    treatA    tech2
6 sample6    treatA    tech1
7 sample7    treatA    tech1


My first idea was to perform a differential expression analysis between samples4/5 and samples6/7. The genes that are called differentially expressed are probably due to the batch effect. Therefore, I could use that list to "correct" the results of differential expression analysis of control vs treatA. But then I started wondering if the batch effect couldn't be modelled by including it into the design? However, I do not know how to formulate a correct design. I am not even sure if this is possible. Does anyone want to help?

linear model batch effect DESeq2 design • 2.8k views
2
Entering edit mode
Kevin Blighe ★ 3.9k
@kevin
Last seen 16 hours ago
Republic of Ireland

Edit: it would help to understand if this is your entire dataset (n=7)? - sometimes people only post a small example. As per James's answer, tech2 is nested in Treatment. For all intents and purposes, assuming n=7, you could simply drop tech2 and proceed with just tech1's samples. There is probably some other approach whereby you could find 'anchor' genes whose expression is constant across tech1's samples, and then use these to adjust the data for tech2, but this is difficult given a sample size of 5 and 2 (for tech1 and tech2 respectively).

## ------------

Hey tim.meese,

There have been many questions on this topic across various web-forums over the years. It would have helped to have seen your PCA bi-plot and to understand the amount of variation contributed by the PC(s) (principal component(s)) along which you are observing a batch difference.

Nevertheless, generally, for dealing with batch effects, the broadly-accepted procedure is to not directly modify your input raw counts in order to adjust for batch differences. Instead, one can include batch in the design formula, which means that one is simply modeling and adjusting for the effect size of batch. In your case, you would need:

~ lab_tech + treatment


Then, when you derive test statistics for treatment, these (the statistical inferences) will be adjusted for the effects of lab_tech.

## --------

If, later, you then wish to directly modify your transformed (rlog, vst, or logCPM) expression levels for downstream analyses like clustering, PCA, 'machine learning' stuff, et cetera, you could use limma::removeBatchEffect() to model and directly eliminate the batch effect in these. See Why after VST are there still batches in the PCA plot?.

Other related posts:

Kevin

2
Entering edit mode

It's one of those rare times where the fix for a complication is simpler than you first think it will be.

0
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

I'm pretty sure that since tech2 is nested in treatment, both the treatment effect and the tech-specific effect get absorbed into the tech2 factor level, so it's the same as just dropping the data from tech2, except for the reduction in degrees of freedom. As an example:

> set.seed(0xabeef)
> z <- rnorm(7) + c(0,0,0,2,2,0,0) ## add a tech-specific offset
> trt <- factor(rep(c("control","treat"), c(3,4)))
> tech <- factor(rep(c(1,2,1), c(3,2,2)))
> d.f <- data.frame(z = z, trt = trt, tech = tech)
> summary(lm(z~tech + trt, d.f))

Call:
lm(formula = z ~ tech + trt, data = d.f)

Residuals:
1       2       3       4       5       6       7
1.5649 -0.5580 -1.0069  1.4282 -1.4282 -0.4599  0.4599

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.3608     0.8305   0.434    0.686
tech2         0.7866     1.4384   0.547    0.614
trttreat      0.8010     1.3131   0.610    0.575

Residual standard error: 1.438 on 4 degrees of freedom
Multiple R-squared:  0.2702,    Adjusted R-squared:  -0.09475
F-statistic: 0.7404 on 2 and 4 DF,  p-value: 0.5327

> summary(lm(z~trt, d.f, subset = tech == 1))

Call:
lm(formula = z ~ trt, data = d.f, subset = tech == 1)

Residuals:
1       2       3       6       7
1.5649 -0.5580 -1.0069 -0.4599  0.4599

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.3608     0.6829   0.528    0.634
trttreat      0.8010     1.0798   0.742    0.512

Residual standard error: 1.183 on 3 degrees of freedom
Multiple R-squared:  0.155, Adjusted R-squared:  -0.1267
F-statistic: 0.5503 on 1 and 3 DF,  p-value: 0.512


0
Entering edit mode

To be clear, I meant the treatment effect for the samples from tech2 as well as any effect specific to the second tech. So the coefficients for treatment and intercept stay the same, regardless.

0
Entering edit mode

Good points James. It's not an ideal set-up with tech2 nested in treatment, as you say. I would like to see the user's PCA bi-plot, though, and, out of interest, Treatment_Tech1 vs Control_Tech1 compared to Treatment_Tech2 vs Control_Tech1.