Question: Using duplicateCorrelation to handle replicate samples
0
2.5 years ago by
rwalke130
rwalke130 wrote:

Dear all,

I am analysing some data from a DNA methylation array study. The comparison of interest is between DNA methylation in neurons that were differentiated from neuronal precursor cells (NPCs) made by differentiating iPSCs from cases (n = 4) and controls (n = 4). For each individual, between two and six DNA samples were assessed. For a given individual, these DNA samples were obtained from neurons that were grown on separate occasions for varying lengths of time (different passage numbers). For some individuals, methylation was measured in neurons differentiated from independent derivations of NPCs. Unfortunately a lot of the information on the passage number of the neurons and the identity of the NPC source of the neurons was not supplied.

This means that my dataset looks like this:

 Sample_Name Slide Sentrix_Position Individual Sex Case_status Passage number NPC ID 1 1 R01C01 1 M Control 1_1 2 1 R02C01 1 M Control 1_2 3 1 R03C01 1 M Control 1_2 4 1 R04C01 2 M Case 2_2 5 1 R05C01 2 M Case 14 2_1 6 1 R06C01 2 M Case 14 2_2 7 1 R01C02 2 M Case 2_1 8 1 R02C02 2 M Case 14 2_1 9 1 R03C02 2 M Case 14 2_2 10 2 R01C02 3 F Case 3_1 11 2 R02C02 3 F Case 12 2 R03C02 3 F Case 13 2 R04C02 4 F Control 14 2 R05C02 4 F Control 13 4_1 15 2 R06C02 4 F Control 12 4_2 16 3 R01C01 5 F Case 17 3 R02C01 5 F Case 18 3 R03C01 5 F Case 10 5_1 19 3 R04C01 6 M Control 20 3 R05C01 6 M Control 6_1 21 3 R06C01 6 M Control 22 4 R02C02 7 F Control 23 7_1 23 4 R03C02 7 F Control 24 4 R04C02 8 M Case 25 4 R05C02 8 M Case 26 4 R06C02 8 M Case

I have been trying to work out how best to treat the replicate arrays from each individual. The fact that the replicate arrays from each individual are not strictly technical replicates has made me reluctant to simply average them prior to carrying out analysis of differential methylation (although I have tried this approach). I, therefore, wondered if duplicateCorrelation, with individual as the blocking factor, might be a reasonable thing to do.

So far, I have carried out the case-control comparison in two ways:

1. averaging across the arrays for each individual, as follows (averaged_phenodata is a matrix where each individual is represented by one row and averaged_meth_data is average array data for each individual obtained using avearrays):

design <- model.matrix(~Case_status + Sex, data = averaged_phenodata)

fit <- lmfit(averaged_meth_data, design)

2. Using duplicateCorrelation, as follows:

design <- model.matrix(~Case_status + Sex, data = phenodata)

corfit <- duplicateCorrelation(meth_data, design, block=phenodata$Individual) fit <- lmFit(meth_data, design, block=phenodata$Individual, correlation=corfit\$consensus)

The results obtained from the two approaches are similar in terms of the identities of the top ranked loci but duplicateCorrelation results in many more significant results.

My questions are:

1. Is this, in theory, a legitimate use of duplicateCorrelation? Does the fact that the expected correlation between the replicates from each individual is not necessarily the same (due to differing levels of similarity of the replicates) invalidate the use of the consensus correlation?

2. Have I gone about implementing it correctly?

Any guidance would be very much appreciated.

Thanks,

Rosie

modified 2.5 years ago • written 2.5 years ago by rwalke130
Answer: Using duplicateCorrelation to handle replicate samples
4
2.5 years ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

Yes, the unknown number of passages could be an issue as it may introduce additional correlations between samples. For example, samples from different patients may be correlated if they have been passaged a similar number of times - this would be totally missed by the model. This may or may not be a realistic concern, based on the effect of passaging in your particular system. If the effect is weak, it may be fine to ignore it.

In any case, the use of duplicateCorrelation is best suited to situations where correlations are present across different levels of experimental factors (e.g., sex, group). Here, your are modelling correlations between samples that belong to the same patient, which have the same levels based on the level of the patient.

So, you might as well average the observations for each sample; this avoids the assumptions (and computational work) involved in running duplicateCorrelation. Averaging may also provide a little bit of protection against the passage problem. If you can assume that the number of passages is randomly distributed across samples, it should cancel out between patients if you take the average of samples within each patient.

There are several additional points to note with your first approach. The first is that you can explicitly block on the Slide number, which may be a batch effect contributing to some unwanted variance. The second is that you can run arrayWeights to account for the fact that patients have different numbers of averaged samples (and presumably different precision, i.e., more samples = less variance).

ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by Aaron Lun25k

Thanks Aaron. I'll try out your suggestions re. the first approach.

I note your point about duplicateCorrelation being best suited to situations where correlation is present across different levels of experimental factors; however, for future reference, I just wanted to check whether (in light of Gordon's comment in the related question that you link to, where he notes that "duplicateCorrelation() is intended for more complicated situations in which the technical replicates can't be averaged without losing some information about the treatments"), duplicateCorrelation would become more useful in a situation where all the passage number information, for example, was available and it was desirable to enter this as a covariate?

Many thanks for your help,

Rosie

I don't think so, because the number of passages isn't something of scientific interest, it's just a nuisance variable. The samples are still nested within the factors of interest, so the averaging approach is still recommended. Of course, if the passaging effect is strong, then there is an argument for keeping the samples separate and explicitly including the passage number as a covariate while blocking on the patient with duplicateCorrelation. This will model strong passage effects accurately (even more so if you use a spline). However, as mentioned above and in Gordon's reply, duplicateCorrelation doesn't come for free - either statistically or computationally - so it becomes a case of the lesser of two evils.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Aaron Lun25k