Hi BioC community,
I have proteomic data for 2 conditions A and B. 15 patients are included in A group and 40 patients in B group. The proteome of each patient was measured at time 0 (before treatment) and time 22 (after treatment). So we study 110 samples, corresponding to 55 patients (2 samples by patient).
The questions are:
- What are the DE genes between A group and B group at time 0 (A.0 vs B.0)?
- What are the DE genes between A group time 22 and A group time 0 (A.22 vs A.0)?
- What are the DE genes between B group time 22 and B group time 0 (B.22 vs B.0)?
- What are the DE genes for the comparison (B.22 - B.0)-(A.22 - A.0) ?
This design is completely described in the paragraph “9.7 multi-level experiment” of limma user’s guide.
My problem is that the samples were processed in batches before I was in charge of this study, and I could just control the assignment to batches of the last half of the samples.
So the design generates something like that (to be clear, I just show some of the samples):
gp time patient batch 1 B t1 p1 B1 2 B t2 p1 B2 3 B t1 p2 B2 4 B t2 p2 B2 5 B t1 p3 B2 6 B t2 p3 B3 7 B t1 p4 B4 8 B t2 p4 B4 9 B t1 p5 B5 10 B t2 p5 B5 11 B t1 p6 B6 12 B t2 p6 B6 13 A t1 p7 B5 14 A t2 p7 B5 15 A t1 p8 B5 16 A t2 p8 B5 17 A t1 p9 B5 18 A t2 p9 B5 19 A t1 p10 B6 20 A t2 p10 B6 21 A t1 p11 B6 22 A t2 p11 B6 23 A t1 p12 B6 24 A t2 p12 B6
The last batches B5 and B6 have the same proportion B/A, here : 1/3 and 50% of T1, 50% of T2. However, some batches (like B1 and B3) have only one sample. The batch B2 have only B group.
I take into account the batch effect in my model :
Treat <- factor(paste(gp,time,sep="."))
design <- model.matrix(~0+batch+Treat)
Surprisingly, I do not got an error or warning message. Indeed, when you block by batch, you work within each batch to decrease variability linked to the batch, but how does it work when you have only one sample in the batch? Is this sample implicitly excluded from all the analyses? And when you have only B group in a batch? Are these samples excluded for the comparison (B.22 - B.0)-(A.22 - A.0) but included for the comparison B.22 vs B.0?
If they are not included in computations, how can you suggest to “save” these samples? Merge batches together?
Thanks in advance for your suggestions and help,
Eléonore
Dear Aaron and svlachavas,
Thanks for your quick answers.
If I well understand, Aaron recommends to use the two following approaches and see what happens :
design <- model.matrix(~0+batch+Treat)
andpatient
induplicateCorrelation
design <- model.matrix(~0+Treat)
andbatch
induplicateCorrelation
To be sure to have correctly understand what is said, could you correct me if I am wrong :
The first approach has the advantage to take into account the pairing between samples (with
duplicateCorrelation
on patient) but do not use the lone samples in differential analyses.In the second approach, pairing is not taken into account but lone samples are used. If the patient-to-patient variability is high, the second approach will be less powerful since the patient variability is not removed.
I understand that patient variability is removed by duplicate correlation in the first approach but do not understand very well why the patient variability is "absorbed" in the second approach (sorry for this) ?
Thanks again
Eléonore
Yes, that's right. And for the second approach, you're treating samples in different patients as direct replicates because you're not blocking on the
batch
anymore. This means that the variance estimate will include (i.e., absorb) patient-to-patient variability in absolute expression.duplicateCorrelation
does not protect against this; rather, it protects against inappropriate estimation of the variance or model coefficients, by ensuring that samples from the same batch are not treated as being independent (which they aren't, because of the underlying batch effect).Edit: Actually,
duplicateCorrelation
does okay with high batch variability; even though the variance estimate increases, this is offset by a reduction in the standard error of the coefficient estimates, and even more so by accounting for the correlation within batches when samples from the same batch are compared. The resulting p-value can then be lower than that of the first approach. Of course, this isn't a free lunch, as some assumptions are involved; see my edited answer above.Edit #2: Upon rereading of your post, I realize that you have multiple patients per condition for B5 and B6. Thus, it would seem better to block on
Patient
and ignoreBatch
, as the former is nested within the latter. This would give two approaches, where you block onPatient
explicitly in the design matrix (and don't useduplicateCorrelation
at all); and another where you only have the treatment in the design matrix, and block onPatient
induplicateCorrelation
. There's no point or need to account for the batch effect when it gets absorbed by the patient effect, as both are nuisance parameters.I understand what you say for batches 5 and 6 but for the other batches (batch 2 and 3 for example), there is not necessary the 2 samples from the same patient in the same batch... so not take into account batch effect is not correct because differences within a patient can be due to batch effect.
Good grief, what an illogical experimental design; I didn't see that detail. Well, you'll have to block on something like:
... in either the design matrix or via
duplicateCorrelation
. There's no value from blocking onpatient
andbatch
separately in the design matrix, as the blocking term for batch 3 is free to vary such that it will always fit observation 6 perfectly (and observation 5 will be fitted perfectly from the blocking term for patient 3).Alternatively, blocking on
total.block
induplicateCorrelation
is simpler than your two-step approach where you block onbatch
in the design followed by blocking onpatient
withduplicateCorrelation
. I guess the difference lies in whether you want to assume that the batch effect is homogeneous or not. If there are large differences between batches (e.g., outlier batches or whatever), then puttingbatch
in the design might be safer.