Duplicate Correlation with technical replicates
1
0
Entering edit mode
dsperley • 0
@dsperley-7315
Last seen 3.5 years ago
United States

I'm using Limma to analyze Illumina 450k methylation data. I'm comparing the methylation (M-Values) in Obese vs Lean subjects, and I have a total of 49 arrays, representing 3 Lean subjects and 11 Obese subjects. Each subject is represented by 3-5 arrays,with the exception of one Lean subject that is represented only by one array. Is it appropriate to use the  duplicateCorrelation function when some biological replicates have no technical replication? I would like to keep that array in the analysis since there are so few Lean subjects in the study.

Design setup and code:

 head(sample_pheno,n=10)
Subject Condition
1        1     Obese
2        1     Obese
3        1     Obese
4        1     Obese
5        1     Obese
6        2      Lean
7        2      Lean
8        2      Lean
9        2      Lean
10       2      Lean

#Design setup
Condition<-factor(sample_pheno$Condition) design<-model.matrix(~0+Condition) colnames(design)<-levels(Condition) head(design) Lean Obese 1 0 1 2 0 1 3 0 1 4 0 1 5 0 1 6 1 0 #calculate correlation within subjects corfit<-duplicateCorrelation(M_Val,design,block=sample_pheno$Subject)
fit<-lmFit(M_Val,design,block=sample_pheno$Subject,correlation=corfit$consensus.correlation)

limma duplicatecorrelation • 1.3k views
1
Entering edit mode
@ryan-c-thompson-5618
Last seen 7 months ago
Scripps Research, La Jolla, CA

As far as I know, it's fine to use duplicateCorrelation on such a data set. I've done it before. Just keep in mind that any samples with no technical duplicates will not contribute to the the estimate of inter-duplicate correlation, since there is nothing for them to correlate with.

Side note: for 450k M-values, I recommend having a look at the mean-variance trend. From my experience, it typically has a concave shape, with lower variance in the middle and higher variance for extreme values. Using eBayes with trend=TRUE would be recommended for such data.

1
Entering edit mode

Actually, the estimate of the consensus correlation is not completely unaffected by the presence of samples with no technical duplicates. This is because the former is calculated after fitting a mixed linear model (with statmod::mixedModel2Fit, if anyone's interested), in which the latter will still contribute to the estimated values of the coefficients. For example:

a <- matrix(rnorm(11000), ncol=11)
design <- model.matrix(~factor(rep(1:2,c(5,6))))
grouping <- factor(c(rep(1:5, 2), 6)) # Last sample has no friend.
dc <- duplicateCorrelation(a, design, block=grouping)
dc$consensus # Something near zero dc2 <- duplicateCorrelation(a[,-11], design[-11,], block=grouping[-11]) dc2$consensus # Something also near zero but not exactly the same.

That said, it's not really a problem, and I'll imagine they'll eventually converge due to improved estimation with an increasing number of levels of your blocking factor. On a related note, the other important thing with duplicateCorrelation is that it does better when you have samples across a large number of levels of the blocking factor. With 14 subjects, I think you'll be fine.

0
Entering edit mode

Thanks for clarifying. As you say, adding samples will always affect the model fit, and I didn't mean to imply that it wouldn't. What I meant (but didn't say clearly) is that the estimate of inter-duplicate correlation will not be substantially aided or hindered by the inclusion of non-duplicated subjects. They don't add much information, nor do they take any information away.