Question

Remove batch effect between microarray data

0

Entering edit mode

dqt • 0

@7a3defce

Last seen 23 months ago

Spain

Hello everyone,

I am fairly new to limma and I think I am faced with a particularly difficult design for batch effect removal. Basically, what I have done is a proteomic assay that eventually hybridizes DNA probes on a microarray. I am studying two different but related mouse models (each one is a KO for a specific gene). The experimental setup is as follows: we analyzed samples from 40 mice (20 from model A + 20 from model B), for each model we have 10 WT and 10 KO and for each genotype (WT/KO) there are 5 male and 5 female mice.

After some exploratory analysis (PCA) I noticed that there is a batch effect due to the sampling time: 4 different cohorts/batches needed for obtaining the 40 samples, 2 cohorts for model A (cohorts A and B) and 2 cohorts for model B (cohorts C and D):

   genotype    sex model cohort  age
1        WT   Male     A      A 4.97
2        KO Female     A      A 4.87
3        WT   Male     A      A 4.77
4        KO   Male     A      A 4.77
5        WT Female     A      A 4.93
6        WT   Male     A      A 4.97
7        KO   Male     A      A 4.77
8        KO Female     A      A 4.77
9        WT Female     A      A 4.87
10       KO Female     A      A 4.64
11       KO Female     A      A 4.57
12       WT Female     A      B 4.60
13       KO   Male     A      B 4.97
14       WT   Male     A      B 4.97
15       KO Female     A      B 4.70
16       WT   Male     A      B 4.97
17       WT Female     A      B 4.70
18       KO   Male     A      B 4.97
19       KO   Male     A      B 4.54
20       WT Female     A      B 4.74
21       KO   Male     B      C 6.48
22       KO   Male     B      C 6.45
23       KO Female     B      C 6.58
24       WT   Male     B      C 6.45
25       KO Female     B      C 6.58
26       WT Female     B      C 6.58
27       WT   Male     B      C 6.54
28       WT   Male     B      C 6.58
29       KO   Male     B      C 6.58
30       WT   Male     B      C 6.58
31       KO   Male     B      C 6.45
32       KO   Male     B      C 6.45
33       WT Female     B      D 6.61
34       KO Female     B      D 6.61
35       WT   Male     B      D 6.64
36       WT Female     B      D 6.61
37       WT Female     B      D 6.64
38       WT Female     B      D 6.64
39       KO Female     B      D 6.64
40       KO Female     B      D 6.64

Two things worry me: 1) the cohort is confounded with the model (perfectly aligned) and 2) KOs are phenotypically slighly different between models. I think I was able to find my way through differential expression:

groups <- glue::glue("{df$genotype}_{df$sex}_{df$model}")
design <- model.matrix(~ 0 + groups)
colnames(design) <- gsub("groups", "", design)

contrasts <- makeContrasts(
  KOvWT=(KO_Female_modelA+KO_Female_modelB+KO_Male_modelA+KO_Male_modelB)/4-(WT_Female_modelA+WT_Female_modelB+WT_Male_modelA+WT_Male_modelB)/4,
  modelA_KOvWT=(KO_Female_modelA+KO_Male_modelA)/2-(WT_Female_modelA+WT_Male_modelA)/2,
  modelB_KOvWT=(KO_Female_modelB+KO_Male_modelB)/2-(WT_Female_modelB+WT_Male_modelB)/2,
  SEX_KOvWT=(KO_Male_modelB+KO_Male_modelA-WT_Male_modelB-WT_Male_modelA)/4-(KO_Female_modelB+KO_Female_modelA-WT_Female_modelB-WT_Female_modelA)/4,
  MODEL_KOvWT=(KO_Male_modelB+KO_Female_modelB-WT_Male_modelB-WT_Female_modelB)/4-(KO_Male_modelA+KO_Female_modelA-WT_Male_modelA-WT_Female_modelA)/4,
  levels=colnames(design))

counts <- t(df.X)
corfit <- duplicateCorrelation(counts, design, block=df$cohort)
v <- voom(2^counts, design =design, block = df$cohort, correlation=corfit$consensus)
fit <- lmFit(v, design = design, block=df$cohort, correlation=corfit$consensus)
fit <- contrasts.fit(fit, contrasts)
fit2 <- treat(fit, trend = FALSE, robust =FALSE, lfc = log2(1.05))
results <- decideTests(fit2, method="separate")

Apart from differential expression I want to predict the age for the KO mice for each model and compare it to the predicted age for WT mice . We want to check if the poorer health status of KO mice can be inferred from their predicted age. For that I have found a dataset obtained with the same technique of 81 WT mice of different ages. However, when merging both datasets I have found a strong batch effect by PCA. In case it is useful, I noticed that median expression for each mice was lower and the variance slightly higher than mice in my dataset. I tried normalizing both datasets with normalizeBetweenArrays() but it did not show any effect:

df.X_quantile_norm <- normalizeBetweenArrays(df.X, method = "quantile")

For consistency I arbitrarily assigned genotype = "WT", model = "-", cohort = "E" and dataset = "2" to all mice in the new dataset. When approaching this problem with removeBatchEffect() I get this error probably for the correlation between some of my variables:

design_merged <- model.matrix(~ merged$genotype*merged$model + merged$sex + merged$cohort + merged$age)
merged.X_lm_batch_rm <- limma::removeBatchEffect(merged.X, design = design_merged, batch = merged$dataset)

Coefficients not estimable: merged$cohortD merged$cohortE merged$genotypeWT:merged$modelmodelA batch1 
Warning: Partial NA coefficients for 1226 probe(s)

I have thought about combining variables, but wanted to hear your opinion on how you would approach this problem.

Both me and my derived headache would appreciate any correction/suggestion :)

Thank you!!

Microarray DifferentialExpression limma MicroarrayData removebatcheffect • 1.3k views

ADD COMMENT • link 23 months ago dqt • 0

score 0 · Answer 1 · 2023-01-09

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

I wouldn't try to merge two different datasets from different labs, locations and times. Rather than combining disparate datasets into one limma analysis, I would generally analyse the datasets separately and then combine or correlate the results in a downstream analysis, perhaps using gene set tests or correlation or other methods.

ADD COMMENT • link 23 months ago Gordon Smyth 52k

0

Entering edit mode

Sorry, I think I did not properly explain the issue at hand. I think it can be broken down into two different parts. First, I want to make some comparisons within my samples. These would be to look for differentially expressed proteins between WT and KO mice for each model and to see if there are similarities across models and sex- or model-specific differences between WT and KO mice. That is when the problem with cohorts first appeared (batch effect). On the other hand, with this external dataset obtained with the same technique but some years ago, I want to use it to regress the age onto a subset of proteins (I intend to fit a LASSO model to this dataset) and then apply it to my samples to get the predicted age and evaluate if this prediction is accurate for my WT mice and higher than the actual age in the case of KO mice. I was trying to look for the best way to normalize both datasets. KO mice from both models display some features typically associated with aging so what we wanted to test is if this partial "aged" phenotype could also be explained at the proteomic level. The age with which we obtained the samples for each model was dependent on the onset of the phenotype, which occurs later for model B. The WT mice for each model served as control so they are age-matched with KO mice.

I admit that all this is quite complex, but I wanted to know, within the constraints, what would be the most correct way to do it and how you would approach these two issues.

Thank you!

ADD REPLY • link 23 months ago dqt • 0