Question: limma: include samples from multiple treatments?
gravatar for smhughes
7 months ago by
smhughes10 wrote:

I have Illumina HT-12 microarray data from a crossover study. There are samples at study start, after treatment period 1, and after treatment period 2. All subjects had samples at each time point. All subjects had treatment A and treatment B (half of the subjects had Aduring treatment period 1 followed by B during treatment period 2; half had B first followed by A). We are interested in the contrast of Treatment A relative to baseline (study enrollment timepoint, before any treatment) and Treatment B relative to baseline.

I noticed that the p-values reported by topTable() differ for treatment A depending on whether or not the treatment B samples are present in the ExpressionSet. For example:

# Function for fitting the model
make_fit <- function(data) {
   trt_ <- factor(pData(data)$TreatmentCode)
   ptid <- factor(pData(data)$Subject)
   design <- model.matrix(~ 0 + trt_ + Subject)

   fit <- lmFit(data, design)

   contrasts <- makeContrasts(
      A = trt_A - trt_Baseline,
      levels = design)

   fit <- eBayes(, fit = fit))

   topTable(fit, coef = "A")   

complete <- data # all of the data (baseline, treatment A and treatment B)

# just the baseline and treatment A samples
baseline_and_A <- data[ , pData(data)$TreatmentCode %in% c("Baseline", "A")]

# apply the make_fit function defined above to the completed and subsetted data

The p-values are greater for treatment A when I fit the model on the ExpressionSet containing only the baseline and treatment A samples (baseline_and_A) than the p-values when I fit the model on the ExpressionSet including baseline, treatment A, and treatment B samples (complete). The fold changes are identical.

Why does the presence of the treatment B samples change the p-values for treatment A, given that the treatment B samples are not involved in the contrast? I am trying to determine which method is more appropriate to report.

Thank you in advance.

microarray limma • 118 views
ADD COMMENTlink modified 7 months ago by James W. MacDonald52k • written 7 months ago by smhughes10
Answer: limma: include samples from multiple treatments?
gravatar for James W. MacDonald
7 months ago by
United States
James W. MacDonald52k wrote:

When you fit a linear model and compute contrasts, the denominator of your statistic is based on the variance estimate from the model, which is based on the within-group variances (for all groups - you can think of it as being the average variance across all the groups). The sample variance isn't particularly efficient, meaning that it takes a relatively large set of values before it converges towards the underlying population variance you are trying to estimate, so using more groups is in general a good thing, because you have more data from which to estimate the sample variance.

This is not necessarily a good thing if the actual variance for one or more of the groups is different, because then you end up combining data that you should probably not be combining. But in practice this is often difficult to discern, given that people usually have very few samples per group, and given the low efficiency of the sample variance it's hard to say if the 'real' variance is different, or if it's just a difference in sampling. So in general it's probably best to include all your samples in the model because it should be more powerful.

ADD COMMENTlink written 7 months ago by James W. MacDonald52k

Thank you very much for your helpful explanation!

ADD REPLYlink written 7 months ago by smhughes10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 220 users visited in the last hour