Question

How to remove nested batch effects with removeBatchEffect?

0

Entering edit mode

Jenny Drnevich ★ 2.0k

@jenny-drnevich-2812

Last seen 13 months ago

United States

Hello,

I have an experimental design almost identical to the one portrayed in the edgeRUsersGuide() Section 3.5 Comparisons both between and within subjects. My question is not about the statistical analysis, but how to properly remove the subject/Patient effects for PCA clustering given that they are nested within Disease. The relevant codes from the section, plus 2 lines of my own:

targets$Patient #numbered 1-9 for the 9 total patients

Patient <- gl(3,2,length=18) #re-numbering patients withing Disease group

Disease <- factor(targets$Disease, levels=c("Healthy","Disease1","Disease2"))

Treatment <- factor(targets$Treatment, levels=c("None","Hormone"))

design <- model.matrix(~Disease+Disease:Patient+Disease:Treatment)

fit <- glmFit(y, design)

#Additions for clustering:

logCPM <- cpm(y, log = T, prior.count = 3)

noPatient <- removeBatchEffect(logCPM, design = model.matrix(~Disease+Disease:Treatment), batch = ??)

The design matrix given to removeBatchEffect should not have the Patient effect in it as it's given separately in the batch argument. However, should I pass the original numbering of patient in targets$Patient or the re-numbering of patient within Disease group in Patient? It seems like I should pass the original numbering because the nesting of re-numbered patients within Disease isn't specified in the call to removeBatchEffect, so I think it would treat all 1s, 2s and 3s as the same batch instead of only the 1s,2s and 3s within each Disease group. Am I correct or is there yet a different way to properly do it?

Thanks,

Jenny

edgeR removeBatchEffect nested design • 2.0k views

ADD COMMENT • link updated 9.3 years ago by Aaron Lun ★ 29k • written 9.3 years ago by Jenny Drnevich ★ 2.0k

score 0 · Answer 1 · 2016-10-13

You want to use the re-numbered patients. This function just fits the linear model (as you originally specified) and then removes the betas that you calculate for (in this case) the patients. If you were to use the original patient numbering, I am pretty sure you would get an error because some of your coefficients won't be estimable, plus you would be fitting a different model that isn't nested.

score 0 · Answer 2 · 2016-10-13

I'm not sure that using the re-numbered patient factor is beneficial. Are patients 1, 4, and 7 related in some way (and similarly for 2, 5 and 8; or 3, 6 and 9)? If not, blocking on the patient factor in removeBatchEffect probably won't have much effect, because there shouldn't be any consistent effect across three unrelated patients that can be regressed out. (Though I suppose it probably won't do too much harm, either.) On the other hand, as James said, you can't use the original numbering of the patient factor, because it would be confounded with the disease condition such that the Disease coefficients would be unestimable.

An alternative approach to visualisation is to run PCA/MDS on the treatment-control log-fold change within each patient. More specifically, each patient contributes a treatment-control log-fold change for each gene, and you then run plotMDS on the matrix of log-fold changes across all patients/genes. Any patient-specific effects on overall expression will cancel out as the log-fold change is computed within each patient. In addition, it's probably more consistent with your DE analysis, which looks at the effect of treatment for all patients with a given disease.