Dear all,
I am analysing some data from a DNA methylation array study. The comparison of interest is between DNA methylation in neurons that were differentiated from neuronal precursor cells (NPCs) made by differentiating iPSCs from cases (n = 4) and controls (n = 4). For each individual, between two and six DNA samples were assessed. For a given individual, these DNA samples were obtained from neurons that were grown on separate occasions for varying lengths of time (different passage numbers). For some individuals, methylation was measured in neurons differentiated from independent derivations of NPCs. Unfortunately a lot of the information on the passage number of the neurons and the identity of the NPC source of the neurons was not supplied.
This means that my dataset looks like this:
Sample_Name | Slide | Sentrix_Position | Individual | Sex | Case_status | Passage number | NPC ID |
1 | 1 | R01C01 | 1 | M | Control | 1_1 | |
2 | 1 | R02C01 | 1 | M | Control | 1_2 | |
3 | 1 | R03C01 | 1 | M | Control | 1_2 | |
4 | 1 | R04C01 | 2 | M | Case | 2_2 | |
5 | 1 | R05C01 | 2 | M | Case | 14 | 2_1 |
6 | 1 | R06C01 | 2 | M | Case | 14 | 2_2 |
7 | 1 | R01C02 | 2 | M | Case | 2_1 | |
8 | 1 | R02C02 | 2 | M | Case | 14 | 2_1 |
9 | 1 | R03C02 | 2 | M | Case | 14 | 2_2 |
10 | 2 | R01C02 | 3 | F | Case | 3_1 | |
11 | 2 | R02C02 | 3 | F | Case | ||
12 | 2 | R03C02 | 3 | F | Case | ||
13 | 2 | R04C02 | 4 | F | Control | ||
14 | 2 | R05C02 | 4 | F | Control | 13 | 4_1 |
15 | 2 | R06C02 | 4 | F | Control | 12 | 4_2 |
16 | 3 | R01C01 | 5 | F | Case | ||
17 | 3 | R02C01 | 5 | F | Case | ||
18 | 3 | R03C01 | 5 | F | Case | 10 | 5_1 |
19 | 3 | R04C01 | 6 | M | Control | ||
20 | 3 | R05C01 | 6 | M | Control | 6_1 | |
21 | 3 | R06C01 | 6 | M | Control | ||
22 | 4 | R02C02 | 7 | F | Control | 23 | 7_1 |
23 | 4 | R03C02 | 7 | F | Control | ||
24 | 4 | R04C02 | 8 | M | Case | ||
25 | 4 | R05C02 | 8 | M | Case | ||
26 | 4 | R06C02 | 8 | M | Case |
I have been trying to work out how best to treat the replicate arrays from each individual. The fact that the replicate arrays from each individual are not strictly technical replicates has made me reluctant to simply average them prior to carrying out analysis of differential methylation (although I have tried this approach). I, therefore, wondered if duplicateCorrelation, with individual as the blocking factor, might be a reasonable thing to do.
So far, I have carried out the case-control comparison in two ways:
1. averaging across the arrays for each individual, as follows (averaged_phenodata is a matrix where each individual is represented by one row and averaged_meth_data is average array data for each individual obtained using avearrays):
design <- model.matrix(~Case_status + Sex, data = averaged_phenodata) fit <- lmfit(averaged_meth_data, design)
2. Using duplicateCorrelation, as follows:
design <- model.matrix(~Case_status + Sex, data = phenodata) corfit <- duplicateCorrelation(meth_data, design, block=phenodata$Individual) fit <- lmFit(meth_data, design, block=phenodata$Individual, correlation=corfit$consensus)
The results obtained from the two approaches are similar in terms of the identities of the top ranked loci but duplicateCorrelation results in many more significant results.
My questions are:
1. Is this, in theory, a legitimate use of duplicateCorrelation? Does the fact that the expected correlation between the replicates from each individual is not necessarily the same (due to differing levels of similarity of the replicates) invalidate the use of the consensus correlation?
2. Have I gone about implementing it correctly?
Any guidance would be very much appreciated.
Thanks,
Rosie
Thanks Aaron. I'll try out your suggestions re. the first approach.
I note your point about duplicateCorrelation being best suited to situations where correlation is present across different levels of experimental factors; however, for future reference, I just wanted to check whether (in light of Gordon's comment in the related question that you link to, where he notes that "duplicateCorrelation() is intended for more complicated situations in which the technical replicates can't be averaged without losing some information about the treatments"), duplicateCorrelation would become more useful in a situation where all the passage number information, for example, was available and it was desirable to enter this as a covariate?
Many thanks for your help,
Rosie
I don't think so, because the number of passages isn't something of scientific interest, it's just a nuisance variable. The samples are still nested within the factors of interest, so the averaging approach is still recommended. Of course, if the passaging effect is strong, then there is an argument for keeping the samples separate and explicitly including the passage number as a covariate while blocking on the patient with
duplicateCorrelation
. This will model strong passage effects accurately (even more so if you use a spline). However, as mentioned above and in Gordon's reply,duplicateCorrelation
doesn't come for free - either statistically or computationally - so it becomes a case of the lesser of two evils.Thanks again for your advice-it's really helpful.
Rosie