Question

limma: include samples from multiple treatments?

0

Entering edit mode

smhughes ▴ 10

@smhughes-11737

Last seen 6.2 years ago

I have Illumina HT-12 microarray data from a crossover study. There are samples at study start, after treatment period 1, and after treatment period 2. All subjects had samples at each time point. All subjects had treatment A and treatment B (half of the subjects had Aduring treatment period 1 followed by B during treatment period 2; half had B first followed by A). We are interested in the contrast of Treatment A relative to baseline (study enrollment timepoint, before any treatment) and Treatment B relative to baseline.

I noticed that the p-values reported by topTable() differ for treatment A depending on whether or not the treatment B samples are present in the ExpressionSet. For example:

# Function for fitting the model
make_fit <- function(data) {
   trt_ <- factor(pData(data)$TreatmentCode)
   ptid <- factor(pData(data)$Subject)
   design <- model.matrix(~ 0 + trt_ + Subject)

   fit <- lmFit(data, design)

   contrasts <- makeContrasts(
      A = trt_A - trt_Baseline,
      levels = design)

   fit <- eBayes(contrasts.fit(contrasts, fit = fit))

   topTable(fit, coef = "A")   
}

complete <- data # all of the data (baseline, treatment A and treatment B)

# just the baseline and treatment A samples
baseline_and_A <- data[ , pData(data)$TreatmentCode %in% c("Baseline", "A")]

# apply the make_fit function defined above to the completed and subsetted data
make_fit(complete)
make_fit(baseline_and_A)

The p-values are greater for treatment A when I fit the model on the ExpressionSet containing only the baseline and treatment A samples (baseline_and_A) than the p-values when I fit the model on the ExpressionSet including baseline, treatment A, and treatment B samples (complete). The fold changes are identical.

Why does the presence of the treatment B samples change the p-values for treatment A, given that the treatment B samples are not involved in the contrast? I am trying to determine which method is more appropriate to report.

Thank you in advance.

limma microarray • 990 views

ADD COMMENT • link updated 6.2 years ago by James W. MacDonald 68k • written 6.2 years ago by smhughes ▴ 10

score 2 · Answer 1 · 2019-05-01

When you fit a linear model and compute contrasts, the denominator of your statistic is based on the variance estimate from the model, which is based on the within-group variances (for all groups - you can think of it as being the average variance across all the groups). The sample variance isn't particularly efficient, meaning that it takes a relatively large set of values before it converges towards the underlying population variance you are trying to estimate, so using more groups is in general a good thing, because you have more data from which to estimate the sample variance.

This is not necessarily a good thing if the actual variance for one or more of the groups is different, because then you end up combining data that you should probably not be combining. But in practice this is often difficult to discern, given that people usually have very few samples per group, and given the low efficiency of the sample variance it's hard to say if the 'real' variance is different, or if it's just a difference in sampling. So in general it's probably best to include all your samples in the model because it should be more powerful.