Question

Using duplicateCorrelation to handle replicate samples

0

Entering edit mode

rwalke13 • 0

@rwalke13-12983

Last seen 7.0 years ago

Dear all,

I am analysing some data from a DNA methylation array study. The comparison of interest is between DNA methylation in neurons that were differentiated from neuronal precursor cells (NPCs) made by differentiating iPSCs from cases (n = 4) and controls (n = 4). For each individual, between two and six DNA samples were assessed. For a given individual, these DNA samples were obtained from neurons that were grown on separate occasions for varying lengths of time (different passage numbers). For some individuals, methylation was measured in neurons differentiated from independent derivations of NPCs. Unfortunately a lot of the information on the passage number of the neurons and the identity of the NPC source of the neurons was not supplied.

This means that my dataset looks like this:

Sample_Name	Slide	Sentrix_Position	Individual	Sex	Case_status	Passage number	NPC ID
1	1	R01C01	1	M	Control		1_1
2	1	R02C01	1	M	Control		1_2
3	1	R03C01	1	M	Control		1_2
4	1	R04C01	2	M	Case		2_2
5	1	R05C01	2	M	Case	14	2_1
6	1	R06C01	2	M	Case	14	2_2
7	1	R01C02	2	M	Case		2_1
8	1	R02C02	2	M	Case	14	2_1
9	1	R03C02	2	M	Case	14	2_2
10	2	R01C02	3	F	Case		3_1
11	2	R02C02	3	F	Case
12	2	R03C02	3	F	Case
13	2	R04C02	4	F	Control
14	2	R05C02	4	F	Control	13	4_1
15	2	R06C02	4	F	Control	12	4_2
16	3	R01C01	5	F	Case
17	3	R02C01	5	F	Case
18	3	R03C01	5	F	Case	10	5_1
19	3	R04C01	6	M	Control
20	3	R05C01	6	M	Control		6_1
21	3	R06C01	6	M	Control
22	4	R02C02	7	F	Control	23	7_1
23	4	R03C02	7	F	Control
24	4	R04C02	8	M	Case
25	4	R05C02	8	M	Case
26	4	R06C02	8	M	Case

I have been trying to work out how best to treat the replicate arrays from each individual. The fact that the replicate arrays from each individual are not strictly technical replicates has made me reluctant to simply average them prior to carrying out analysis of differential methylation (although I have tried this approach). I, therefore, wondered if duplicateCorrelation, with individual as the blocking factor, might be a reasonable thing to do.

So far, I have carried out the case-control comparison in two ways:

1. averaging across the arrays for each individual, as follows (averaged_phenodata is a matrix where each individual is represented by one row and averaged_meth_data is average array data for each individual obtained using avearrays):

design <- model.matrix(~Case_status + Sex, data = averaged_phenodata)

fit <- lmfit(averaged_meth_data, design)

2. Using duplicateCorrelation, as follows:

design <- model.matrix(~Case_status + Sex, data = phenodata)

corfit <- duplicateCorrelation(meth_data, design, block=phenodata$Individual)

fit <- lmFit(meth_data, design, block=phenodata$Individual, correlation=corfit$consensus)

The results obtained from the two approaches are similar in terms of the identities of the top ranked loci but duplicateCorrelation results in many more significant results.

My questions are:

1. Is this, in theory, a legitimate use of duplicateCorrelation? Does the fact that the expected correlation between the replicates from each individual is not necessarily the same (due to differing levels of similarity of the replicates) invalidate the use of the consensus correlation?

2. Have I gone about implementing it correctly?

Any guidance would be very much appreciated.

Thanks,

Rosie

limma duplicatecorrelation replicates • 2.1k views

ADD COMMENT • link 7.0 years ago rwalke13 • 0

score 4 · Accepted Answer · 2017-05-08

4

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 2 hours ago

The city by the bay

Yes, the unknown number of passages could be an issue as it may introduce additional correlations between samples. For example, samples from different patients may be correlated if they have been passaged a similar number of times - this would be totally missed by the model. This may or may not be a realistic concern, based on the effect of passaging in your particular system. If the effect is weak, it may be fine to ignore it.

In any case, the use of duplicateCorrelation is best suited to situations where correlations are present across different levels of experimental factors (e.g., sex, group). Here, your are modelling correlations between samples that belong to the same patient, which have the same levels based on the level of the patient.

So, you might as well average the observations for each sample; this avoids the assumptions (and computational work) involved in running duplicateCorrelation. Averaging may also provide a little bit of protection against the passage problem. If you can assume that the number of passages is randomly distributed across samples, it should cancel out between patients if you take the average of samples within each patient.

There are several additional points to note with your first approach. The first is that you can explicitly block on the Slide number, which may be a batch effect contributing to some unwanted variance. The second is that you can run arrayWeights to account for the fact that patients have different numbers of averaged samples (and presumably different precision, i.e., more samples = less variance).

See also A: limma - technical replicates: duplicateCorrelation() or avereps()?.

ADD COMMENT • link 7.0 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks Aaron. I'll try out your suggestions re. the first approach.

I note your point about duplicateCorrelation being best suited to situations where correlation is present across different levels of experimental factors; however, for future reference, I just wanted to check whether (in light of Gordon's comment in the related question that you link to, where he notes that "duplicateCorrelation() is intended for more complicated situations in which the technical replicates can't be averaged without losing some information about the treatments"), duplicateCorrelation would become more useful in a situation where all the passage number information, for example, was available and it was desirable to enter this as a covariate?

Many thanks for your help,

Rosie

ADD REPLY • link 7.0 years ago rwalke13 • 0

0

Entering edit mode

I don't think so, because the number of passages isn't something of scientific interest, it's just a nuisance variable. The samples are still nested within the factors of interest, so the averaging approach is still recommended. Of course, if the passaging effect is strong, then there is an argument for keeping the samples separate and explicitly including the passage number as a covariate while blocking on the patient with duplicateCorrelation. This will model strong passage effects accurately (even more so if you use a spline). However, as mentioned above and in Gordon's reply, duplicateCorrelation doesn't come for free - either statistically or computationally - so it becomes a case of the lesser of two evils.

ADD REPLY • link 7.0 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks again for your advice-it's really helpful.

Rosie

ADD REPLY • link 7.0 years ago rwalke13 • 0