Question

Unbalanced experiment with multiple samples from each patient.

0

Entering edit mode

Tore • 0

@tore-14664

Last seen 4.4 years ago

Saitama, Japan

I'm back from a hiatus from expression analysis and am faced with analyzing a messy gene expression experiment. Any advice would be very appreciated.

The samples consists of multiple normal and diseased regions from multiple individuals. The number of samples varies between donors:

Donor	# healthy samples	# diseased samples
A	1	1
B	2	4
C	3	1
D	2	3
E	2	3
F	1	5

As you can seem, it is a very unbalanced setup. My current plan is to simply include the donor in the model:

expression ~ 0 + donor + disease

Is there some better method involving blocking on the patient and using duplicateCorrelation?

Tore

limma • 1.3k views

ADD COMMENT • link updated 6.3 years ago by Aaron Lun ★ 28k • written 6.3 years ago by Tore • 0

score 3 · Accepted Answer · 2017-12-20

3

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 41 minutes ago

The city by the bay

There's two choices here. The first is to use duplicateCorrelation:

design <- model.matrix(~0 + donor + disease)
block <- paste0(donor, ".", disease)
dc <- duplicateCorrelation(y, design=design, block=block)

The additive formulation design is fairly self-explanatory, but block is less obvious. It aims to account for correlations between samples with the same combination of donor/disease, assuming that they are technical replicates. Donor-wide correlations are already handled by the donor terms in design, while disease-wide correlations are indistinguishable from your disease effect and should not be removed.

The second and simpler approach is to just average all samples with the same block together:

y0 <- avearrays(y, ID=block)

This avoids having to deal with correlations; each set of correlated samples is now a single averaged sample, and the averaged samples are independent of each other conditional on the design matrix. You can analyze this with a simple additive design where the number of samples is twice the number of donors (one averaged healthy/diseased sample per donor). For extra sophistication, run arrayWeights to account for the different variance of average observations derived from different numbers of samples.

The first approach will consider the variance between samples in the same block, whereas the second approach does not. This does provide some benefit as it encourages consistent measurements across the same donor/disease combination; however, it comes at the cost of other statistical issues, namely anticonservativeness (see A: limma - technical replicates: duplicateCorrelation() or avereps()? for some thoughts on this). Personally, I would go with the second approach - it's a lot faster, too.

ADD COMMENT • link 6.3 years ago Aaron Lun ★ 28k

1

Entering edit mode

Great answer, as usual, but I think the OP should pay particular attention to the "assuming they are technical replicates" bit of this answer. It might be useful for the OP to elaborate on how repeated "healthy" or "diseased" measures from the same patient are ending up in these data.

For instance, imagine "diseased" meant "tumor" in this question and we have two separate biopsies from the same tumor, I wouldn't call these technical replicates and therefore not sure if using duplicateCorrelation in the way you describe here makes sense ... right?

ADD REPLY • link 6.3 years ago Steve Lianoglou ★ 13k

1

Entering edit mode

The use of duplicateCorrelation should still be valid with separate biopsies - all the function does is model the correlations, and if you're taking repeated biopsies (i.e., measurements) from the same tumour, there most likely will be correlations between the resulting samples. I just mentioned technical replicates because that was easier to explain. The correlations would probably be much smaller with biopsies, though.

ADD REPLY • link 6.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you very much for your comments. The diseased samples are biopsies from separate diseased regions of the same organ (not cancer). I think I will go with `avearrays` and `arrayWeights` as Aaron suggested. Am I correct in assuming that `arrayWeights` will detect the reduced variance of my averaged samples and thus increase the weight for these samples when running `lmFit`?

On a related note, would it be possible to use `arrayWeights` to compute weights for `avearrays` as well?

ADD REPLY • link 6.3 years ago Tore • 0

1

Entering edit mode

For your first question, yes. For your second question... I suppose you could apply arrayWeights to the original samples (using a one-way layout with block) and then use those weights in avearrays. This would downweight outliers within each donor/disease combination, but it would only be effective when you have 3 or more samples for that combination.

ADD REPLY • link 6.3 years ago Aaron Lun ★ 28k