Question: duplicateCorrelation and Pearson Correlation
gravatar for Eleni Christodoulou
3.1 years ago by
Eleni Christodoulou150 wrote:

Dear BioC people,

I am analyzing a wet lab experiment on a cell line with 5 biological replicates. Each biological replicate has 3 or 4 technical replicates. Here is my targets$BiolRep entry

 [1] 1 2 2 1 2 1 4 4 1 4 3 3 3 5 5 5

Some of the biological replicates are sensitive and some are resistant to a drug. The purpose of the experiment is to find the differentially expressed genes between the two conditions. My targets$Response entry is

 [1] "sensitive" "sensitive" "sensitive" "sensitive" "sensitive" "sensitive" "sensitive" "sensitive" "sensitive"
[10] "sensitive" "resistant" "resistant" "resistant" "resistant" "resistant" "resistant"

To account for technical and biological replication, I use limma's duplicateCorrelation function. Here is part of my script, where norm.exp is my normalized expression matrix.

design <- model.matrix(~ct)  
colnames(design) <- c(levels(ct))
arrayw <- arrayWeights(norm.exp, design)
w <- asMatrixWeights(arrayw, dim(norm.exp))
dupcor <- duplicateCorrelation(norm.exp, design, block=targets$BiolRep, weights=w)


I am wondering why the dupcor$consensus.correlation is as low as 0.797633. When I calculate the Pearson correlation between the technical replicates of each biological replicate, this is above 0.99.


Thank you very much for your help!



limma duplicatecorrelation • 714 views
ADD COMMENTlink modified 3.1 years ago by Aaron Lun25k • written 3.1 years ago by Eleni Christodoulou150
Answer: duplicateCorrelation and Pearson Correlation
gravatar for Aaron Lun
3.1 years ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

Your two correlations are not comparable. The value from duplicateCorrelation represents the percentage of variance explained by the blocking factor within a "typical" gene. The average expression level of each gene has (roughly speaking) little effect on this value, as it cancels out when comparing between samples for the same gene. In contrast, correlations between the observed values of two technical replicates will be computed across genes and be affected by the range of expression values in the data.

To illustrate, consider a biological system where genes are expressed at a range of mean values:

ngenes <- 1000
nlibs <- 10
all.means <- rnorm(ngenes)

If we obtain independent biological replicates from this system, the correlation within each gene should be zero by definition, because the observations are independent. This is demonstrated by running duplicateCorrelation and setting dummy.block to force the function to calculate correlations caused by replicate pairing (e.g., as if we were working with a paired-sample design).

all.y <- matrix(rnorm(ngenes*nlibs, mean=all.means, sd=0.1), ncol=nlibs)
dummy.block <- rep(1:5, each=2)
duplicateCorrelation(all.y, block=dummy.block)$consensus # close to 0

However, computing the correlation between any two libraries yields a value quite close to 1. This is driven by the (mostly uninteresting) range of expression values across all genes, rather than any experimental relationship between the replicates.

cor(all.y[,1], all.y[,2]) # close to 1

As an aside, it seems that all of your technical replicates are nested within a single biological level. If so, it may be better to average the technical replicates instead, see A: limma - technical replicates: duplicateCorrelation() or avereps()?.

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by Aaron Lun25k

Thank you, Aaron, for the nice illustration of the differences between the methods! Really insightful!

At the end, do you maybe mean aveArrays() and not avereps()? My aim is to average the technical replicates of the samples.

I also have another case with just a pair of samples: One before and one after treatment. Each one has 3 technical replicates. Do you suggest to use limma with averrays(), or a paired t-test would be better?


Thank you very much!

ADD REPLYlink written 3.1 years ago by Eleni Christodoulou150

Yes, avearrays is the appropriate function here. Your other question should go in a separate post; but if you only have one sample for each condition, you can't really do much, regardless of how many technical replicates you have.

ADD REPLYlink written 3.1 years ago by Aaron Lun25k

Thank you very much!

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Eleni Christodoulou150

Dear Aaron,

I tried avearrays() for a problem similar to the above. I then want to do differential gene detection using limma. The problem is that some of the replicates are in different batches. I thus want to add the batch in my design matrix. Is this possible when the averaged replicates are across batches? Maybe, in this case, avearrays is not the best way to go and duplicateCorrelation() is the way?


Thank you so much.

ADD REPLYlink written 3.1 years ago by Eleni Christodoulou150

First, is there actually a batch effect? Check out a MDS plot, and if there's no real effect, then you don't need to model the batch factor and you can average the arrays as described. Otherwise, if different replicates are in different batches, and there's a different number of replicates from each sample in each batch, then you'll probably have to use duplicateCorrelation.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Aaron Lun25k

Hello, Aaron and apology for the late reply. Thank you for your response. Yes, there is a batch effect which is evident with MDS. I will then apply duplicateCorrelation in my second problem.



ADD REPLYlink written 3.0 years ago by Eleni Christodoulou150
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 315 users visited in the last hour