I am analysing some data from a DNA methylation array study. The comparison of interest is between DNA methylation in neurons that were differentiated from neuronal precursor cells (NPCs) made by differentiating iPSCs from cases (n = 4) and controls (n = 4). For each individual, between two and six DNA samples were assessed. For a given individual, these DNA samples were obtained from neurons that were grown on separate occasions for varying lengths of time (different passage numbers). For some individuals, methylation was measured in neurons differentiated from independent derivations of NPCs. Unfortunately a lot of the information on the passage number of the neurons and the identity of the NPC source of the neurons was not supplied.
This means that my dataset looks like this:
|Sample_Name||Slide||Sentrix_Position||Individual||Sex||Case_status||Passage number||NPC ID|
I have been trying to work out how best to treat the replicate arrays from each individual. The fact that the replicate arrays from each individual are not strictly technical replicates has made me reluctant to simply average them prior to carrying out analysis of differential methylation (although I have tried this approach). I, therefore, wondered if duplicateCorrelation, with individual as the blocking factor, might be a reasonable thing to do.
So far, I have carried out the case-control comparison in two ways:
1. averaging across the arrays for each individual, as follows (averaged_phenodata is a matrix where each individual is represented by one row and averaged_meth_data is average array data for each individual obtained using avearrays):
design <- model.matrix(~Case_status + Sex, data = averaged_phenodata) fit <- lmfit(averaged_meth_data, design)
2. Using duplicateCorrelation, as follows:
design <- model.matrix(~Case_status + Sex, data = phenodata) corfit <- duplicateCorrelation(meth_data, design, block=phenodata$Individual) fit <- lmFit(meth_data, design, block=phenodata$Individual, correlation=corfit$consensus)
The results obtained from the two approaches are similar in terms of the identities of the top ranked loci but duplicateCorrelation results in many more significant results.
My questions are:
1. Is this, in theory, a legitimate use of duplicateCorrelation? Does the fact that the expected correlation between the replicates from each individual is not necessarily the same (due to differing levels of similarity of the replicates) invalidate the use of the consensus correlation?
2. Have I gone about implementing it correctly?
Any guidance would be very much appreciated.