Limma, blocking batch effect
3
2
Entering edit mode
Jeannette ▴ 20
@jeannette-8299
Last seen 9.4 years ago

I want to remove a known batch effect with limma using blocking.

My experiment has 36 samples from 3 donors with 12 different treatments. The batch effect is based on the different donors. I would like to remove batch effect and later compare different treatments (eg. treatment D vs. C).

This can be illustrated as follows:

Samples    Batch    Treatment
1    1    A
2    1    B
3    1    C
4    1    D
5    1    E
6    1    F
7    1    G
8    1    H
9    1    I
10    1    J
11    1    K
12    1    L
13    2    A
14    2    B
15    2    C
16    2    D
17    2    E
18    2    F
19    2    G
20    2    H
21    2    I
22    2    J
23    2    K
24    2    L
25    3    A
26    3    B
27    3    C
28    3    D
29    3    E
30    3    F
31    3    G
32    3    H
33    3    I
34    3    J
35    3    K
36    3    L

How to make a design matrix including the blocking factor?

limma batch effect • 6.2k views
ADD COMMENT
0
Entering edit mode
svlachavas ▴ 830
@svlachavas-7225
Last seen 13 months ago
Germany/Heidelberg/German Cancer Resear…

Dear Jeannette,

you can first check another similar thread i have posted (C: Questions in limma about multifactorial design and possible batch effect correct) regarding a similar problem that i have experienced. However, i would like to ask you if you have performed first any PCA plots or use the function plotMDS() to see how big is the "known batch effect you mention". In other words, because if your samples dont separate well and gather together-that is your patterns in your principal components directly correlate with the batch variable, you should consider performing some batch effect correction using the package sva and combat to "correct" your eset, prior analysis . Nevertheless, as im not an expert on this methodology and other experienced people will answer and provide valuable information(in my case i didnt have so obvious effects based on my plots), so i followed the advice of Mr MacDonald and added the batch in my design matrix. So, you could use:

design <- model.matrix(~Treatment + Batch) # considering that both Treatment and Batch should be factor variables

and proceed with lmFit and eBayes. Also, you might want to give attention to your first level of the Treatment, in case you have a control group without any treatment

Best,

Efstathios

ADD COMMENT
2
Entering edit mode

Just a quick comment - ComBat should NOT be used before running association testing (lmFit); association testing should be run with batch as a covariate on original data. The design Efstathios wrote is almost correct, except that you should add factor():

design <- model.matrix(~factor(Treatment) + factor(Batch))

ComBat is useful if the data is to be analyzed using methods that cannot deal with covariates but if covariates can be included such as in linear models, they should be used.
 

ADD REPLY
0
Entering edit mode

Dear Peter,

thank you for your evaluation-regarding the factor indication, i mention it with the # symbol next to the model.matrix-i didnt include it in case any of the variables needs releveling for any specific purpose

ADD REPLY
0
Entering edit mode
Jeannette ▴ 20
@jeannette-8299
Last seen 9.4 years ago

Dear Efstathios and dear Peter,

thank you very much for your answers. Please have a look below and perhaps answer again my new (beginner) questions:

AEset =ReadAffy()
rAEset=rma(AEset)
targets <- readTargets(targets.txt)

ArrayName    SampleNumber    Batch    Treatment
Hu4-1_(HG-U133_Plus_2)_3    1    1    A
Hu4-2_(HG-U133_Plus_2)_2    2    1    B
Hu4-3_(HG-U133_Plus_2)_2    3    1    C
Hu4-4_(HG-U133_Plus_2)_2    4    1    D
Hu4-5_(HG-U133_Plus_2)_2    5    1    E
Hu4-6_(HG-U133_Plus_2)_2    6    1    F
Hu4-7_(HG-U133_Plus_2)_2    7    1    G
Hu4-8_(HG-U133_Plus_2)_2    8    1    H
Hu4-9_(HG-U133_Plus_2)_2    9    1    I
Hu4-10_(HG-U133_Plus_2)_2    10    1    J
Hu4-11_(HG-U133_Plus_2)_2    11    1    K
Hu4-12_(HG-U133_Plus_2)_3    12    1    L
Hu5-1_(HG-U133_Plus_2)    13    2    A
Hu5-2_(HG-U133_Plus_2)    14    2    B
Hu5-3_(HG-U133_Plus_2)    15    2    C
Hu5-4_(HG-U133_Plus_2)    16    2    D
Hu5-5_(HG-U133_Plus_2)    17    2    E
Hu5-6_(HG-U133_Plus_2)    18    2    F
Hu5-7_(HG-U133_Plus_2)    19    2    G
Hu5-8_(HG-U133_Plus_2)    20    2    H
Hu5-9_(HG-U133_Plus_2)    21    2    I
Hu5-10_(HG-U133_Plus_2)    22    2    J
Hu5-11_(HG-U133_Plus_2)    23    2    K
Hu5-12_(HG-U133_Plus_2)    24    2    L
Hu6-1_(HG-U133_Plus_2)    25    3    A
Hu6-2_(HG-U133_Plus_2)    26    3    B
Hu6-3_(HG-U133_Plus_2)    27    3    C
Hu6-4_(HG-U133_Plus_2)    28    3    D
Hu6-5_(HG-U133_Plus_2)    29    3    E
Hu6-6_(HG-U133_Plus_2)    30    3    F
Hu6-7_(HG-U133_Plus_2)    31    3    G
Hu6-8_(HG-U133_Plus_2)    32    3    H
Hu6-9_(HG-U133_Plus_2)    33    3    I
Hu6-10_(HG-U133_Plus_2)    34    3    J
Hu6-11_(HG-U133_Plus_2)    35    3    K
Hu6-12_(HG-U133_Plus_2)    36    3    L

Treat<-factor(targets$Treatment)
Batch<-factor(targets$Batch)

design <- model.matrix(~factor(Treat) + factor(Batch))

The design matrix doesn't look like expected (for me): I miss Treat A and Batch 1? Could you tell me why they are missing?

(Intercept)    factor(Treat)B    factor(Treat)C    factor(Treat)D    factor(Treat)E    factor(Treat)F    factor(Treat)G    factor(Treat)H    factor(Treat)I    factor(Treat)J    factor(Treat)K    factor(Treat)L    factor(Batch)2    factor(Batch)3
1    0    0    0    0    0    0    0    0    0    0    0    0    0
1    1    0    0    0    0    0    0    0    0    0    0    0    0
1    0    1    0    0    0    0    0    0    0    0    0    0    0
1    0    0    1    0    0    0    0    0    0    0    0    0    0
1    0    0    0    1    0    0    0    0    0    0    0    0    0
1    0    0    0    0    1    0    0    0    0    0    0    0    0
1    0    0    0    0    0    1    0    0    0    0    0    0    0
1    0    0    0    0    0    0    1    0    0    0    0    0    0
1    0    0    0    0    0    0    0    1    0    0    0    0    0
1    0    0    0    0    0    0    0    0    1    0    0    0    0
1    0    0    0    0    0    0    0    0    0    1    0    0    0
1    0    0    0    0    0    0    0    0    0    0    1    0    0
1    0    0    0    0    0    0    0    0    0    0    0    1    0
1    1    0    0    0    0    0    0    0    0    0    0    1    0
1    0    1    0    0    0    0    0    0    0    0    0    1    0
1    0    0    1    0    0    0    0    0    0    0    0    1    0
1    0    0    0    1    0    0    0    0    0    0    0    1    0
1    0    0    0    0    1    0    0    0    0    0    0    1    0
1    0    0    0    0    0    1    0    0    0    0    0    1    0
1    0    0    0    0    0    0    1    0    0    0    0    1    0
1    0    0    0    0    0    0    0    1    0    0    0    1    0
1    0    0    0    0    0    0    0    0    1    0    0    1    0
1    0    0    0    0    0    0    0    0    0    1    0    1    0
1    0    0    0    0    0    0    0    0    0    0    1    1    0
1    0    0    0    0    0    0    0    0    0    0    0    0    1
1    1    0    0    0    0    0    0    0    0    0    0    0    1
1    0    1    0    0    0    0    0    0    0    0    0    0    1
1    0    0    1    0    0    0    0    0    0    0    0    0    1
1    0    0    0    1    0    0    0    0    0    0    0    0    1
1    0    0    0    0    1    0    0    0    0    0    0    0    1
1    0    0    0    0    0    1    0    0    0    0    0    0    1
1    0    0    0    0    0    0    1    0    0    0    0    0    1
1    0    0    0    0    0    0    0    1    0    0    0    0    1
1    0    0    0    0    0    0    0    0    1    0    0    0    1
1    0    0    0    0    0    0    0    0    0    1    0    0    1
1    0    0    0    0    0    0    0    0    0    0    1    0    1

fit<- lmFit(rAEset, design)
fit<-eBayes(fit)
topTable(fit)

X.Intercept. factor.Treat.B factor.Treat.C
213477_x_at         14.32670   0.0066422277   -0.007600883
201429_s_at         14.20578   0.0349340611   -0.002385242
208834_x_at         14.10887  -0.0001973202   -0.041636218
213583_x_at         14.36973   0.0514584539    0.013826467
216231_s_at         14.24558   0.0390321807   -0.044076872
226131_s_at         14.11218   0.0561841384   -0.020003040
213084_x_at         14.10507   0.0041735892   -0.048600165
213614_x_at         14.14842  -0.0153255164   -0.024215554
AFFX-hum_alu_at     14.66200  -0.0074594425    0.077741362
212869_x_at         14.42581  -0.0160917656   -0.064885389
                factor.Treat.D factor.Treat.E factor.Treat.F
213477_x_at       -0.001377460   -0.050702074    0.051865528
201429_s_at        0.025709220   -0.010092836   -0.013081372
208834_x_at        0.007341653    0.045773605    0.013431660
213583_x_at        0.016923252   -0.031997855    0.057666384
216231_s_at       -0.050963361   -0.011837809    0.035742225
226131_s_at        0.029673356   -0.042260419   -0.001585295
213084_x_at        0.013819580    0.009045218   -0.002321539
213614_x_at       -0.015048376   -0.039968926    0.019965938
AFFX-hum_alu_at    0.053618376    0.002911006    0.077713513
212869_x_at       -0.033341353   -0.023474926   -0.002161103
                factor.Treat.G factor.Treat.H factor.Treat.I
213477_x_at       0.0257321453   0.0768203223     0.03021889
201429_s_at      -0.0006291994   0.0491792622    -0.04604729
208834_x_at       0.0351613242  -0.0176276116    -0.02937423
213583_x_at       0.0680759699   0.0962858847     0.08730183
216231_s_at       0.0058945653  -0.0597448531     0.01669841
226131_s_at       0.0481740774   0.0466630885    -0.01247180
213084_x_at       0.0311798064  -0.0006383324    -0.05125879
213614_x_at       0.0371051870   0.0621918189     0.02366662
AFFX-hum_alu_at   0.0011396547   0.0871858929     0.09836153
212869_x_at       0.0091505640   0.0125854227    -0.01723962
                factor.Treat.J factor.Treat.K factor.Treat.L
213477_x_at        0.051355404   -0.028656052    0.018755885
201429_s_at       -0.029621140   -0.019563313   -0.005783606
208834_x_at       -0.059944204    0.003526064    0.023858054
213583_x_at        0.082476431   -0.009219475    0.053840761
216231_s_at        0.042499819   -0.003135958    0.020467677
226131_s_at       -0.019817628   -0.031505195    0.037207605
213084_x_at       -0.087220127   -0.014959050    0.010348036
213614_x_at        0.032127800   -0.032768684    0.026709697
AFFX-hum_alu_at    0.094535560    0.106192684    0.088915943
212869_x_at       -0.009439653   -0.047039333    0.011175808
                factor.Batch.2 factor.Batch.3  AveExpr        F
213477_x_at       -0.018194830     0.03891480 14.34803 99140.40
201429_s_at        0.046022000     0.04233994 14.23379 91804.48
208834_x_at        0.050637472     0.06257520 14.14497 91262.65
213583_x_at       -0.006076321     0.04993156 14.42491 89788.40
216231_s_at        0.027905592     0.19533235 14.31921 86928.12
226131_s_at        0.049201400     0.07235871 14.16023 86768.34
213084_x_at        0.057847385     0.08806016 14.14233 84925.36
213614_x_at       -0.005764947     0.09213428 14.18341 83335.92
AFFX-hum_alu_at   -0.058736541    -0.13878791 14.65290 83109.05
212869_x_at        0.022519146     0.02076862 14.42518 82826.61
                     P.Value    adj.P.Val
213477_x_at     3.959310e-57 1.933176e-52
201429_s_at     1.063226e-56 1.933176e-52
208834_x_at     1.147253e-56 1.933176e-52
213583_x_at     1.414304e-56 1.933176e-52
216231_s_at     2.143908e-56 1.976043e-52
226131_s_at     2.195194e-56 1.976043e-52
213084_x_at     2.892552e-56 1.976043e-52
213614_x_at     3.687342e-56 1.976043e-52
AFFX-hum_alu_at 3.818794e-56 1.976043e-52
212869_x_at     3.989544e-56 1.976043e-52

As I understood from the limma user guide, the treatments are now adjusted for the differences between the batches?

So further on I can make comparisons e.g. D vs. C?

Thanks in advance!

Jeannette

ADD COMMENT
1
Entering edit mode

The intercept represents the expression of your first sample, i.e., the treatment A sample in batch 1. This is why there is no explicit term for factor(treat)A or factor(batch)1. This isn't a problem, as the terms that do exist can be easily interpreted. For example, the factor(treat)B term represents the average log-fold change of each treatment B sample over the treatment A sample in the same batch, averaged across all batches. Similarly, the factor(batch)2 term represents the average log-fold change of all batch 2 samples over the corresponding batch 1 samples for the same treatment. As the batch terms absorb any batch effect across your data set, they will not affect the treatment effects, i.e., the treatments are "adjusted for batches", as it were.

If you want to compare between treatments, you can just use makeContrasts. However, I'd rename your column coefficients to get rid of the parentheses, as these interfere with expression parsing in makeContrasts. Or, more simply, as you've already converted both Treat and Batch to factors:

same.design <- model.matrix(~ Treat + Batch)
con <- makeContrasts(TreatC - TreatD, levels=same.design)

If you feed this into contrasts.fit, eBayes and topTable, the resulting comparison will be that of treatment C against treatment D. Log-fold changes will represent the increase in expression of C over D.

ADD REPLY
0
Entering edit mode
svlachavas ▴ 830
@svlachavas-7225
Last seen 13 months ago
Germany/Heidelberg/German Cancer Resear…

Dear Jeannette,

as i have mentioned above you maybe want to have a look or change the first level of your factor Treatment as your baseline level-if for instance you have a control group. If you dont have a control group in your design, then when you use your above code, the first level of your factor as far as i can see from the output of the model.matrix is the first level of the factor Treatment, that is Treatment A. So, the Intercept is this case i think represents the average expression of the Treatment A samples-while the other coefficients represent the log-fold change of the expression of the other Treatments vs the Treatment A, after taking into account(controlling) for the possible batch effect which is represented by the batch variable. Im not sure about the batch 1 you are mentioning, but i also believe that has been absorved in the Intercept as well. Maybe other experienced people who will see the post have a better explanation of the batch variable.Anyway, in my opinion you should not worry for the batch coefficient as have been included in your design, but more about how you will construct your differences

(* By the way, i also modified my code for a current study i been using 3 levels for my batch variable and i got the same for the batch:)

targets$replicate <- factor(rep(c(1:3), each=1, times=4))
> batch <- targets$replicate
> design <- model.matrix(~samples + batch)
> design
   (Intercept) samplesT47D_OHT samplesT47D_GDC samplesT47D_C10 samplesT47D_GDC_OHT
1            1               0               0               0                   0
2            1               0               0               0                   0
3            1               1               0               0                   0
4            1               1               0               0                   0
5            1               0               1               0                   0
6            1               0               1               0                   0
7            1               0               0               1                   0
8            1               0               0               1                   0
9            1               0               0               0                   1
10           1               0               0               0                   1
11           1               0               0               0                   0
12           1               0               0               0                   0
   samplesT47D_C10_OHT batch2 batch3
1                    0                     0      0
2                    0                     1      0
3                    0                     0      1
4                    0                     0      0
5                    0                     1      0
6                    0                     0      1
7                    0                     0      0
8                    0                     1      0
9                    0                    0      1
10                   0                   0      0
11                   1                   1      0
12                   1                   0      1

ADD COMMENT
0
Entering edit mode

In fact, if you dont have any baseline level or control group included in your Treatment factor,and you want to compare different therapies, then you could possibly move:

design <- model.matrix(~factor(Treat) + factor(Batch))

fit<- lmFit(rAEset, design)

con <- makeContrasts("factor.Treat.B-factor.Treat.C",....)

fit2 <- contrasts.fit(fit, contrasts=con)

ebayes <- eBayes(fit2)

However, if you want to include also the treatment A in your comparisons, you should create an intercept free model matrix:

design <- model.matrix(~0 +factor(Treat) + factor(Batch))

and then move similarly.

Overall, it depends on which comparisons you want to make

 

 

ADD REPLY
0
Entering edit mode

I have controls. Treatments A, B and C are different controls. How should I include this controls in my treatment factor?

Is it a new factor like factor(Treat_control)?

And why? Because one control is a real control (no substance added), two others are 2 different solvents (used for the "real" treatments. So I have factor(Treat_control) 1-3 for model.matrix to add?

A new question is: How can I get only the batch corrected normalized data? If I would like to use this data for other analyses (e.g. differentiell expression not only with limma estimated)?
 

ADD REPLY
0
Entering edit mode

The fact that treatments A, B and C are controls doesn't affect the parametrization of the design matrix. From limma's perspective, it doesn't care what the different treatments actually are. It only needs to know that samples with treatment A are different from samples with treatment B, and so on. This is already specified in your design, so there's no need for additional embellishment.

The identity of the controls only matters when you're setting up your DE comparisons with makeContrasts. For example, it may make more sense to compare your treatment to a control other than A (which is the default "baseline" treatment to which all others are compared, when dropping coefficients in topTable). This depends on the biology of your experiment, so you or your colleagues should be the best judge of which contrasts are most sensible.

As for batch effect removal; check out removeBatchEffect. This will eliminate batch effects from the data for use in other procedures, e.g., PCA, heatmaps. However, don't do any linear modelling with the batch-corrected data.

ADD REPLY
0
Entering edit mode

Dear Aaron, many thanks for your useful comments. I nearly got it :-)

Indeed Treatment A is only a baseline treatment (plan was not to use it for comparisons). I would like to compare D, E, F, J, K, L with control treatment C

and G, H, I with control treatment B.

I started with the contrast matrix and the help of your and Efstathios script examples. I got error messages, which I did not understand:

AEset =ReadAffy()
rAEset=rma(AEset)
Treat<-factor(targets$Treatment)
Batch<-factor(targets$Batch)
design <- model.matrix(~factor(Treat) + factor(Batch))

fit<- lmFit(rAEset, design)
same.design <- model.matrix(~ Treat + Batch)
con <- makeContrasts(TreatC - TreatD, levels=same.design)
Warning message:
In makeContrasts(TreatC - TreatD, levels = same.design) :
  Renaming (Intercept) to Intercept
fit2 <- contrasts.fit(fit, contrasts=con)ebayes <- eBayes(fit2)
Warning message:
In contrasts.fit(fit, contrasts = con) :
  row names of contrasts don't match col names of coefficients
fit2<-eBayes(fit2)

What does the warning messages mean?

ADD REPLY
1
Entering edit mode

I'm "borrowing" the answer of Aaron to the same situation i faced:

"Both warnings stem from the fact that makeContrasts will try to replace the coefficient name"(Intercept)" with "Intercept". To get rid of the warnings, just manually rename the intercept column of the design matrix to "Intercept". Assuming the first column is the intercept (which it should be, for your parametrization of the design):"

I believe after design matrix use: 

colnames(design)[1] <- "Intercept"

and use this design in every downstream step of your current analysis and the warnings should not appear

ADD REPLY

Login before adding your answer.

Traffic: 493 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6