I'm using Limma to analyze Illumina 450k methylation data. I'm comparing the methylation (M-Values) in Obese vs Lean subjects, and I have a total of 49 arrays, representing 3 Lean subjects and 11 Obese subjects. Each subject is represented by 3-5 arrays,with the exception of one Lean subject that is represented only by one array. Is it appropriate to use the duplicateCorrelation function when some biological replicates have no technical replication? I would like to keep that array in the analysis since there are so few Lean subjects in the study.
Design setup and code:
head(sample_pheno,n=10)
Subject Condition
1 1 Obese
2 1 Obese
3 1 Obese
4 1 Obese
5 1 Obese
6 2 Lean
7 2 Lean
8 2 Lean
9 2 Lean
10 2 Lean
#Design setup
Condition<-factor(sample_pheno$Condition)
design<-model.matrix(~0+Condition)
colnames(design)<-levels(Condition)
head(design)
Lean Obese
1 0 1
2 0 1
3 0 1
4 0 1
5 0 1
6 1 0
#calculate correlation within subjects
corfit<-duplicateCorrelation(M_Val,design,block=sample_pheno$Subject)
fit<-lmFit(M_Val,design,block=sample_pheno$Subject,correlation=corfit$consensus.correlation)
Actually, the estimate of the consensus correlation is not completely unaffected by the presence of samples with no technical duplicates. This is because the former is calculated after fitting a mixed linear model (with
statmod::mixedModel2Fit
, if anyone's interested), in which the latter will still contribute to the estimated values of the coefficients. For example:That said, it's not really a problem, and I'll imagine they'll eventually converge due to improved estimation with an increasing number of levels of your blocking factor. On a related note, the other important thing with
duplicateCorrelation
is that it does better when you have samples across a large number of levels of the blocking factor. With 14 subjects, I think you'll be fine.Thanks for clarifying. As you say, adding samples will always affect the model fit, and I didn't mean to imply that it wouldn't. What I meant (but didn't say clearly) is that the estimate of inter-duplicate correlation will not be substantially aided or hindered by the inclusion of non-duplicated subjects. They don't add much information, nor do they take any information away.