Using duplicateCorrelation to handle replicate samples
Entering edit mode
rwalke13 • 0
Last seen 4.7 years ago

Dear all,

I am analysing some data from a DNA methylation array study. The comparison of interest is between DNA methylation in neurons that were differentiated from neuronal precursor cells (NPCs) made by differentiating iPSCs from cases (n = 4) and controls (n = 4). For each individual, between two and six DNA samples were assessed. For a given individual, these DNA samples were obtained from neurons that were grown on separate occasions for varying lengths of time (different passage numbers). For some individuals, methylation was measured in neurons differentiated from independent derivations of NPCs. Unfortunately a lot of the information on the passage number of the neurons and the identity of the NPC source of the neurons was not supplied. 

This means that my dataset looks like this:

Sample_Name Slide Sentrix_Position Individual Sex Case_status Passage number NPC ID
1 1 R01C01 1 M Control   1_1
2 1 R02C01 1 M Control   1_2
3 1 R03C01 1 M Control   1_2
4 1 R04C01 2 M Case   2_2
5 1 R05C01 2 M Case 14 2_1
6 1 R06C01 2 M Case 14 2_2
7 1 R01C02 2 M Case   2_1
8 1 R02C02 2 M Case 14 2_1
9 1 R03C02 2 M Case 14 2_2
10 2 R01C02 3 F Case   3_1
11 2 R02C02 3 F Case    
12 2 R03C02 3 F Case    
13 2 R04C02 4 F Control    
14 2 R05C02 4 F Control 13 4_1
15 2 R06C02 4 F Control 12 4_2
16 3 R01C01 5 F Case    
17 3 R02C01 5 F Case    
18 3 R03C01 5 F Case 10 5_1
19 3 R04C01 6 M Control    
20 3 R05C01 6 M Control   6_1
21 3 R06C01 6 M Control    
22 4 R02C02 7 F Control 23 7_1
23 4 R03C02 7 F Control    
24 4 R04C02 8 M Case    
25 4 R05C02 8 M Case    
26 4 R06C02 8 M Case    

I have been trying to work out how best to treat the replicate arrays from each individual. The fact that the replicate arrays from each individual are not strictly technical replicates has made me reluctant to simply average them prior to carrying out analysis of differential methylation (although I have tried this approach). I, therefore, wondered if duplicateCorrelation, with individual as the blocking factor, might be a reasonable thing to do.

So far, I have carried out the case-control comparison in two ways:

1. averaging across the arrays for each individual, as follows (averaged_phenodata is a matrix where each individual is represented by one row and averaged_meth_data is average array data for each individual obtained using avearrays):

design <- model.matrix(~Case_status + Sex, data = averaged_phenodata)

fit <- lmfit(averaged_meth_data, design)

2. Using duplicateCorrelation, as follows:

design <- model.matrix(~Case_status + Sex, data = phenodata)

corfit <- duplicateCorrelation(meth_data, design, block=phenodata$Individual)

fit <- lmFit(meth_data, design, block=phenodata$Individual, correlation=corfit$consensus)

The results obtained from the two approaches are similar in terms of the identities of the top ranked loci but duplicateCorrelation results in many more significant results. 

My questions are:

1. Is this, in theory, a legitimate use of duplicateCorrelation? Does the fact that the expected correlation between the replicates from each individual is not necessarily the same (due to differing levels of similarity of the replicates) invalidate the use of the consensus correlation?

2. Have I gone about implementing it correctly?

Any guidance would be very much appreciated.





limma duplicatecorrelation replicates • 1.1k views
Entering edit mode
Aaron Lun ★ 27k
Last seen 1 hour ago
The city by the bay

Yes, the unknown number of passages could be an issue as it may introduce additional correlations between samples. For example, samples from different patients may be correlated if they have been passaged a similar number of times - this would be totally missed by the model. This may or may not be a realistic concern, based on the effect of passaging in your particular system. If the effect is weak, it may be fine to ignore it.

In any case, the use of duplicateCorrelation is best suited to situations where correlations are present across different levels of experimental factors (e.g., sex, group). Here, your are modelling correlations between samples that belong to the same patient, which have the same levels based on the level of the patient.

So, you might as well average the observations for each sample; this avoids the assumptions (and computational work) involved in running duplicateCorrelation. Averaging may also provide a little bit of protection against the passage problem. If you can assume that the number of passages is randomly distributed across samples, it should cancel out between patients if you take the average of samples within each patient.

There are several additional points to note with your first approach. The first is that you can explicitly block on the Slide number, which may be a batch effect contributing to some unwanted variance. The second is that you can run arrayWeights to account for the fact that patients have different numbers of averaged samples (and presumably different precision, i.e., more samples = less variance).

See also A: limma - technical replicates: duplicateCorrelation() or avereps()?.

Entering edit mode

Thanks Aaron. I'll try out your suggestions re. the first approach.

I note your point about duplicateCorrelation being best suited to situations where correlation is present across different levels of experimental factors; however, for future reference, I just wanted to check whether (in light of Gordon's comment in the related question that you link to, where he notes that "duplicateCorrelation() is intended for more complicated situations in which the technical replicates can't be averaged without losing some information about the treatments"), duplicateCorrelation would become more useful in a situation where all the passage number information, for example, was available and it was desirable to enter this as a covariate?

Many thanks for your help,


Entering edit mode

I don't think so, because the number of passages isn't something of scientific interest, it's just a nuisance variable. The samples are still nested within the factors of interest, so the averaging approach is still recommended. Of course, if the passaging effect is strong, then there is an argument for keeping the samples separate and explicitly including the passage number as a covariate while blocking on the patient with duplicateCorrelation. This will model strong passage effects accurately (even more so if you use a spline). However, as mentioned above and in Gordon's reply, duplicateCorrelation doesn't come for free - either statistically or computationally - so it becomes a case of the lesser of two evils.

Entering edit mode

Thanks again for your advice-it's really helpful.



Login before adding your answer.

Traffic: 269 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6