Search
Question: Batch with only one sample or condition
1
2.5 years ago by
France
eleonoregravier40 wrote:

Hi BioC community,

I have proteomic data for 2 conditions A and B. 15 patients are included in A group and 40 patients in B group. The proteome of each patient was measured at time 0 (before treatment) and time 22 (after treatment). So we study 110 samples, corresponding to 55 patients (2 samples by patient).

The questions are:

• What are the DE genes between A group and B group at time 0 (A.0 vs B.0)?
• What are the DE genes between A group time 22 and A group time 0 (A.22 vs A.0)?
• What are the DE genes between B group time 22 and B group time 0 (B.22 vs B.0)?
• What are the DE genes for the comparison (B.22 - B.0)-(A.22 - A.0) ?

This design is completely described in the paragraph “9.7 multi-level experiment” of limma user’s guide.

My problem is that the samples were processed in batches before I was in charge of this study, and I could just control the assignment to batches of the last half of the samples.

So the design generates something like that (to be clear, I just show some of the samples):

    gp   time patient batch

1   B    t1      p1    B1

2   B    t2      p1    B2

3   B    t1      p2    B2

4   B    t2      p2    B2

5   B    t1      p3    B2

6   B    t2      p3    B3

7   B    t1      p4    B4

8   B    t2      p4    B4

9   B    t1      p5    B5

10  B    t2      p5    B5

11  B    t1      p6    B6

12  B    t2      p6    B6

13  A    t1      p7    B5

14  A    t2      p7    B5

15  A    t1      p8    B5

16  A    t2      p8    B5

17  A    t1      p9    B5

18  A    t2      p9    B5

19  A    t1     p10    B6

20  A    t2     p10    B6

21  A    t1     p11    B6

22  A    t2     p11    B6

23  A    t1     p12    B6

24  A    t2     p12    B6

The last batches B5 and B6 have the same proportion B/A, here : 1/3 and 50% of T1, 50% of T2. However, some batches (like B1 and B3) have only one sample. The batch B2 have only B group.

I take into account the batch effect in my model :

Treat <- factor(paste(gp,time,sep="."))

design <- model.matrix(~0+batch+Treat)

Surprisingly, I do not got an error or warning message. Indeed, when you block by batch, you work within each batch to decrease variability linked to the batch, but how does it work when you have only one sample in the batch? Is this sample implicitly excluded from all the analyses? And when you have only B group in a batch? Are these samples excluded for the comparison (B.22 - B.0)-(A.22 - A.0) but included for the comparison B.22 vs B.0?

If they are not included in computations, how can you suggest to “save” these samples? Merge batches together?

Eléonore

modified 2.5 years ago by Gordon Smyth33k • written 2.5 years ago by eleonoregravier40
3
2.5 years ago by
Aaron Lun19k
Cambridge, United Kingdom
Aaron Lun19k wrote:

I don't understand why you would expect an error message. All coefficients are estimable in your model. The first sample is used to estimate the batch effect of B1 (similarly, the lone B3 sample is used to estimate that batch effect), and will not be used in any of the DE comparisons. If a batch only contains B samples, it will not participate in direct comparisons between A and B (e.g., A.0 - B.0) as the batch term will absorb any DE between A and B for that group of samples. However, it can still participate in differential log-fold change comparisons like (B.22 - B.0)-(A.22-A.0); you can still estimate B.22-B.0 based on samples within that batch.

As Efstathios has suggested, you will need to block on batch in duplicateCorrelation, rather than in the design matrix, in order to use the batches containing only one sample. This generally results in an increase in detection power, because you can use more information for DE testing when you don't need to estimate the batches. However, it does require that the batch effect is more-or-less homogeneous, i.e., you don't have systematic differences between batches or several outlier batches with aberrant expression profiles. Explicit blocking in the design does better in such cases.

So, it may be better to just accept that you can't use those samples - you'll have to test both approaches and see what happens. The best approach would be to tell the people who generate your data to do a better job with their sample collection.

Dear Aaron and svlachavas,

If I well understand, Aaron recommends to use the two following approaches and see what happens :

design <- model.matrix(~0+batch+Treat) and patient in duplicateCorrelation

design <- model.matrix(~0+Treat) and batch in duplicateCorrelation

To be sure to have correctly understand what is said, could you correct me if I am wrong :

The first approach has the advantage to take into account the pairing between samples (with duplicateCorrelation on patient) but do not use the lone samples in differential analyses.

In the second approach, pairing is not taken into account but lone samples are used.  If the patient-to-patient variability is high, the second approach will be less powerful since the patient variability is not removed.

I understand that patient variability is removed by duplicate correlation in the first approach but do not understand very well why the patient variability is "absorbed" in the second approach (sorry for this) ?

Thanks again

Eléonore

ADD REPLYlink modified 2.5 years ago by Ryan C. Thompson6.7k • written 2.5 years ago by eleonoregravier40

Yes, that's right. And for the second approach, you're treating samples in different patients as direct replicates because you're not blocking on the batch anymore. This means that the variance estimate will include (i.e., absorb) patient-to-patient variability in absolute expression. duplicateCorrelation does not protect against this; rather, it protects against inappropriate estimation of the variance or model coefficients, by ensuring that samples from the same batch are not treated as being independent (which they aren't, because of the underlying batch effect).

Edit: Actually, duplicateCorrelation does okay with high batch variability; even though the variance estimate increases, this is offset by a reduction in the standard error of the coefficient estimates, and even more so by accounting for the correlation within batches when samples from the same batch are compared. The resulting p-value can then be lower than that of the first approach. Of course, this isn't a free lunch, as some assumptions are involved; see my edited answer above.

Edit #2: Upon rereading of your post, I realize that you have multiple patients per condition for B5 and B6. Thus, it would seem better to block on Patient and ignore Batch, as the former is nested within the latter. This would give two approaches, where you block on Patient explicitly in the design matrix (and don't use duplicateCorrelation at all); and another where you only have the treatment in the design matrix, and block on Patient in duplicateCorrelation. There's no point or need to account for the batch effect when it gets absorbed by the patient effect, as both are nuisance parameters.

I understand what you say for batches 5 and 6 but for the other batches (batch 2 and 3 for example), there is not necessary the 2 samples from the same patient in the same batch... so not take into account batch effect is not correct because differences within a patient can be due to batch effect.

Good grief, what an illogical experimental design; I didn't see that detail. Well, you'll have to block on something like:

total.block <- paste0(patient, ".", batch)

... in either the design matrix or via duplicateCorrelation. There's no value from blocking on patient and batch separately in the design matrix, as the blocking term for batch 3 is free to vary such that it will always fit observation 6 perfectly (and observation 5 will be fitted perfectly from the blocking term for patient 3).

Alternatively, blocking on total.block in duplicateCorrelation is simpler than your two-step approach where you block on batch in the design followed by blocking on patient with duplicateCorrelation. I guess the difference lies in whether you want to assume that the batch effect is homogeneous or not. If there are large differences between batches (e.g., outlier batches or whatever), then putting batch in the design might be safer.

1
2.5 years ago by
Gordon Smyth33k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth33k wrote:

Dear Eleonore,

One of the key points of Section 9.7 on multi-level experiments is that you have to use duplicateCorrelation() to handle the correlation between patients. This is compulsory if you want to analyse the study as a multilevel model.

This means amongst other things that you cannot use duplicateCorrelation() for the batches, nor is this likely to be appropriate.

The experiment has a serious problem is that batches B1-B4 apply only to group A and not group B. If you attempt to correct for batch effects, then the samples in batches B1-B4 will essentially be wasted for comparing group A to group B.

You will need to make decisions about what compromises to make in the analysis. It would all be much easier if the batch effects were ignorable. The way to start would be to make an MDS plot of all the samples, to see which effects are important and which are not.

Thanks al lot Gordon and Aaron for your valued help

Eléonore

Hi again,

The last samples (batches B5 and B6) are not yet processed in spectrometry. I think it would be possible to ask for processing again some of the samples already processed in batches B1 to B4 within the batches B5 and B6.

It could simply "save" these samples but I am wondering if it could also be used to correct the batch effects for B1 to B4 and consequently save ALL the samples in batches B1 to B4. In this case, how can I copute the batch effect and which samples do I have to process again ?

Thanks again

Eléonore

I doubt it. You'd end up with a mix of technical and biological replicates scattered across several batches, which does not make it easy to model. Even if you could get rid of the batch effect, you'd still have the patient effect that's nested within it. So, at the end of the day, you'll still have to block on something.

0
2.5 years ago by
svlachavas570
Greece/Athens/National Hellenic Research Foundation
svlachavas570 wrote:

Dear  Eleonoregravier,

just my opinion on this matter, as your experiment is a bit tricky to interpret---so the specialists i believe will provide valuable feedback about the actual interpretation of your design matrix.

Firstly, as you mentioned the multi-level experiment--thus you have both within and between subject comparisons(gp and time, if i have understood well), you will have to use duplicateCorrelation function to treat the variable patient as a random effect.

Secondly, regarding the batch "confounder" variable-the most "appropriate" case is to include the batch variable to your design matrix, especially when batches are of "equal size", which is definately not in your case(example batch B1). However, im not certain if again in this special case, limma should work again just fine. But have you already created some diagnostic plots, like MDS or PCA plots, even cluster your samples and inspect how severe this possible "batch effect" is ?

Finally, regarding your comment of the samples in the last comparison: as(and if) you use all of your samples in your design matrix, then they will all contribute to the estimated variance of each gene. In other words, even if you explicitly compare specific samples of interest-or groups-still the other samples, will still influence the final variance of each gene of your comparisons(if i haven't described it appropriately, please anyone feel free to correct me).

Hope this helps for the start,

Efstathios

1

Not all samples will contribute to the variance. The B1 and B3 samples will not contribute any residual d.f. as they will always be fitted perfectly based on the estimated values of the batch coefficients for B1 and B3.

Dear Aaron than you for your correction !! to be certain of your notion, you mean that these specific samples will not contribute to the variance, based on the above design matrix ? I.e. design <- model.matrix(~0+batch+Treat) due to the batch variable that "will absorve any influence of any samples" that do not participate on the above mentioned comparison (i.e. (B.22 - B.0)-(A.22 - A.0) ) ??

Best,

Efstathios

1

That's right. All of the information in those samples is used to estimate the batch effect, there's nothing left to test DE.