Question

Duplicate Correlation with technical replicates

0

Entering edit mode

dsperley • 0

@dsperley-7315

Last seen 7.5 years ago

United States

I'm using Limma to analyze Illumina 450k methylation data. I'm comparing the methylation (M-Values) in Obese vs Lean subjects, and I have a total of 49 arrays, representing 3 Lean subjects and 11 Obese subjects. Each subject is represented by 3-5 arrays,with the exception of one Lean subject that is represented only by one array. Is it appropriate to use the duplicateCorrelation function when some biological replicates have no technical replication? I would like to keep that array in the analysis since there are so few Lean subjects in the study.

Design setup and code:

 head(sample_pheno,n=10)
   Subject Condition
1        1     Obese
2        1     Obese
3        1     Obese
4        1     Obese
5        1     Obese
6        2      Lean
7        2      Lean
8        2      Lean
9        2      Lean
10       2      Lean


#Design setup
Condition<-factor(sample_pheno$Condition)
design<-model.matrix(~0+Condition)
colnames(design)<-levels(Condition)
head(design)

  Lean Obese
1    0     1
2    0     1
3    0     1
4    0     1
5    0     1
6    1     0

#calculate correlation within subjects
corfit<-duplicateCorrelation(M_Val,design,block=sample_pheno$Subject)
fit<-lmFit(M_Val,design,block=sample_pheno$Subject,correlation=corfit$consensus.correlation)

limma duplicatecorrelation • 3.0k views

ADD COMMENT • link updated 8.6 years ago by Ryan C. Thompson ★ 7.9k • written 8.6 years ago by dsperley • 0

score 1 · Accepted Answer · 2016-09-07

1

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 5 months ago

Icahn School of Medicine at Mount Sinai…

As far as I know, it's fine to use duplicateCorrelation on such a data set. I've done it before. Just keep in mind that any samples with no technical duplicates will not contribute to the the estimate of inter-duplicate correlation, since there is nothing for them to correlate with.

Side note: for 450k M-values, I recommend having a look at the mean-variance trend. From my experience, it typically has a concave shape, with lower variance in the middle and higher variance for extreme values. Using eBayes with trend=TRUE would be recommended for such data.

ADD COMMENT • link 8.6 years ago Ryan C. Thompson ★ 7.9k

2

Entering edit mode

Actually, the estimate of the consensus correlation is not completely unaffected by the presence of samples with no technical duplicates. This is because the former is calculated after fitting a mixed linear model (with statmod::mixedModel2Fit, if anyone's interested), in which the latter will still contribute to the estimated values of the coefficients. For example:

a <- matrix(rnorm(11000), ncol=11)
design <- model.matrix(~factor(rep(1:2,c(5,6))))
grouping <- factor(c(rep(1:5, 2), 6)) # Last sample has no friend.
dc <- duplicateCorrelation(a, design, block=grouping)
dc$consensus # Something near zero
dc2 <- duplicateCorrelation(a[,-11], design[-11,], block=grouping[-11])
dc2$consensus # Something also near zero but not exactly the same.

That said, it's not really a problem, and I'll imagine they'll eventually converge due to improved estimation with an increasing number of levels of your blocking factor. On a related note, the other important thing with duplicateCorrelation is that it does better when you have samples across a large number of levels of the blocking factor. With 14 subjects, I think you'll be fine.

ADD REPLY • link 8.6 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks for clarifying. As you say, adding samples will always affect the model fit, and I didn't mean to imply that it wouldn't. What I meant (but didn't say clearly) is that the estimate of inter-duplicate correlation will not be substantially aided or hindered by the inclusion of non-duplicated subjects. They don't add much information, nor do they take any information away.

ADD REPLY • link 8.6 years ago Ryan C. Thompson ★ 7.9k