remove batch effect and adjusted model for 4 covariates in Limma
2
1
Entering edit mode
mheydarpour ▴ 10
@mheydarpour-9430
Last seen 5.2 years ago

I have a gene-expression (GE) data for 100 patients at 2 different time (Before & After). So, each patient has 2 samples. 

Patients were under 2 treatment (treatA & treatB). I would like to to find the differential gene expression for the following contrast:

con <- makeContrasts(diff=(treatA_before-treatB_Before)-(treatA_After-treatB_After), levels=design)

However, I also would like to adjust my model for 4 covarites (age, sex, center, BMI) and also remove batch effect. So, I used the following design matrix:

treat=factor(rep(pheno$treat, 2),levels=c("A","B"))

time <- rep(c("Before", "After"), c(ncol(GE_before), ncol(GE_after)))

design=model.matrix(~0+age+sex+center+BMI+batch+treat:time)

 

 Could you please let me know that my design model is correct? can I remove the batch effect using this linear model. please advise. Thanks

limma batch effect covariates contrast differential gene expression • 3.6k views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 1 minute ago
The city by the bay

The best design would be to block on the patient ID directly. Consider this example data:

patient <- factor(rep(LETTERS[1:10], each=2))
treatment <- factor(rep(c("X", "Y"), each=10))
time <- factor(rep(c("Before", "After"), 10))

I would set up a design matrix that looks like this:

design <- model.matrix(~ patient + treatment:time)
design <- design[,!grepl("Before", colnames(design))]
colnames(design) <- sub(":", "_", colnames(design))

Looking at colnames(design) gives me:

 [1] "(Intercept)"          "patientB"             "patientC"            
 [4] "patientD"             "patientE"             "patientF"            
 [7] "patientG"             "patientH"             "patientI"            
[10] "patientJ"             "treatmentX_timeAfter" "treatmentY_timeAfter"

The first 10 terms are patient-specific blocking factors and can be ignored. The last two terms represent the effect of treatment with X and Y, respectively, which you can directly compare against each other:

con <- makeContrasts(treatmentX_timeAfter - treatmentY_timeAfter, levels=design)

This is equivalent to the contrast in your original post. However, the setup here is much more general, as it avoids assuming that age/BMI/etc. affect expression in a linear manner. Any patient-specific effect is regressed out, only the differences before/after treatment within each patient are of interest. Of course, this assumes that age/BMI/etc. does not change between time points for each patient.

ADD COMMENT
0
Entering edit mode

Thanks Aaron for your great explanation. How about batch effect? My RNAseq data for those 100 patients prepared at 7 different time date.

So, I have batch effects in my GE data (Before & after). How to remove the batch effect? If I consider "patient" as a random effect in the design model, do you think the batch effect is remove automatically? Please advise

ADD REPLY
0
Entering edit mode

For any given patient, do both of his/her samples belong in the same batch? If yes, the batch effect is automatically removed by blocking on the patient ID. If not, you'll need to include the batch as a separate factor in the design matrix.

Also, the patient ID is a fixed effect when you put it in the design matrix, see A: How does edgeR compare to lme4?.

ADD REPLY
0
Entering edit mode

OK, I got it. If I put "0" in the design model, is there any difference for the DGE results.

Then I have no "intercept" in the output.

ADD REPLY
0
Entering edit mode

The presence or absence of an intercept will not affect the interpretation of the two terms of interest.

ADD REPLY
0
Entering edit mode

Thank you very much!

ADD REPLY
0
Entering edit mode

Does the presence of an intercept make it so that each patient term is relative to the first patient term? For example, isn't the column "patientB" in the design matrix is actually patientB relative to patientA? If so, does this still accomplish the intended blocking effect?
 

ADD REPLY
0
Entering edit mode

Yes, that interpretation is correct. And yes, the blocking still works.

ADD REPLY
0
Entering edit mode

Hi Aaron,

I am just confused why these two contrast are "equivalent" as you mentioned in above: 

con <- makeContrasts(diff=(treatA_before-treatB_Before)-(treatA_After-treatB_After), levels=design)

con <- makeContrasts(treatmentX_timeAfter - treatmentY_timeAfter, levels=design)

Could you please explain more? thanks

ADD REPLY
0
Entering edit mode

I'll assume that treatment A = X and B = Y. The individual equivalences are:

treatmentX_timeAfter = treatA_After - treatA_before
treatmentY_timeAfter = treatB_After - treatB_before

The result should be obvious after some rearrangement of terms:

treatmentX_timeAfter - treatmentY_timeAfter
= treatA_After - treatA_before - (treatB_After - treatB_before)
= (treatA_After - treatB_After) - (treatA_before - treatB_before)

... which is just your original contrast multiplied by a factor of -1. In other words, the results of the two contrasts are (conceptually) equivalent, only differing by the sign of the log-fold change.

ADD REPLY
0
Entering edit mode

Ok, great. But if I choose "timeBefore" for any difference contrast, then the results should be change. is that right or I'm wrong?

treamnetX_timeBefore = treatA_After - treatA_Before

treatmentY_timeBefore= treatB_After - treatB_Before

 

ADD REPLY
0
Entering edit mode

I don't understand your question, nor your equations. In any case, they should be:

treatmentX_timeBefore = treatA_Before - treatA_After
treatmentY_timeBefore = treatB_Before - treatB_After

... assuming you removed all timeAfter coefficients from the design matrix. This will give the same results as in my original response, but with the sign of the log-FC flipped.

ADD REPLY
0
Entering edit mode

My 100 patients sequenced at 7 different times (So, I have 7 different class of batch effect in my data). When I put "Batch" as a separate factor in the model as below:

design=model.matrix(~0+Batch+patient+treatment:time)

I Got the following error:

Coefficients not estimable: Batch 
Warning message: Partial NA coefficients for 18258 probe(s)

What is wrong with this design model?

 

ADD REPLY
0
Entering edit mode

Please move this latest comment to a response to my first post, the space is getting too small to answer it effectively here.

ADD REPLY
0
Entering edit mode
mheydarpour ▴ 10
@mheydarpour-9430
Last seen 5.2 years ago

My 100 patients sequenced at 7 different times (So, I have 7 different class of batch effect in my data). When I put "Batch" as a separate factor in the model as below:

design=model.matrix(~0+Batch+patient+treatment:time)

I Got the following error:

Coefficients not estimable: Batch 
Warning message: Partial NA coefficients for 18258 probe(s)

What is wrong with this design model?

ADD COMMENT
0
Entering edit mode

I meant to move your comment as a response to my first answer, via the "Add comment" button. But never mind.

Anyway, there are two problems here. The first problem - and the cause of the unestimable coefficients - is that your patients are fully nested in your batch. This is probably because both samples for each patient come from the same batch, and was the point I was trying to convey in C: remove batch effect and adjusted model for 4 covariates in Limma. In this case, there is no need to explicitly block on batch in your design matrix; blocking on the patient factor already does that for you.

The second problem is that you've treated batch as a covariate rather than a factor, which is why you only get a warning for one Batch term and not for ~7 batch terms. This is largely irrelevant, though, in light of the first point above.

ADD REPLY
0
Entering edit mode

Both samples (before & after) of a patient are in the same batch, but patients are in different 7 batches. 

When I introduced batch as a factor in the model : Batch=factor(rep(pheno$Batch,2)

I still have problem for coefficients not estimable (NA) for at least 5 patients. 

ADD REPLY
0
Entering edit mode

You are missing the point. It doesn't matter that the patients are in different batches. If both samples for any given patient belong to the same batch, the patient blocking terms will absorb any batch effect. Say all samples in the first batch have increased expression - this will be modelled as larger coefficient estimates of the patient blocking terms for all patients in the first batch. Thus, the batch effect will not affect estimation of the terms of interest, i.e., the before/after log-fold change for each treatment. In summary: you do not need separate blocking terms for the batch effect. Doing the following:

design <- model.matrix(~patient+treatment:time)

... is sufficient for the contrast that you are interested in.

ADD REPLY
0
Entering edit mode

Thank you so much Aaron.

ADD REPLY

Login before adding your answer.

Traffic: 655 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6