Question

duplicateCorrelation and Pearson Correlation

0

Entering edit mode

Eleni Christodoulou ▴ 150

@eleni-christodoulou-2653

Last seen 5.6 years ago

Singapore

Dear BioC people,

I am analyzing a wet lab experiment on a cell line with 5 biological replicates. Each biological replicate has 3 or 4 technical replicates. Here is my targets$BiolRep entry

 targets$BiolRep
 [1] 1 2 2 1 2 1 4 4 1 4 3 3 3 5 5 5

Some of the biological replicates are sensitive and some are resistant to a drug. The purpose of the experiment is to find the differentially expressed genes between the two conditions. My targets$Response entry is

targets$Response
 [1] "sensitive" "sensitive" "sensitive" "sensitive" "sensitive" "sensitive" "sensitive" "sensitive" "sensitive"
[10] "sensitive" "resistant" "resistant" "resistant" "resistant" "resistant" "resistant"

To account for technical and biological replication, I use limma's duplicateCorrelation function. Here is part of my script, where norm.exp is my normalized expression matrix.

ct<-factor(targets$Response)  
design <- model.matrix(~ct)  
colnames(design) <- c(levels(ct))
arrayw <- arrayWeights(norm.exp, design)
w <- asMatrixWeights(arrayw, dim(norm.exp))
dupcor <- duplicateCorrelation(norm.exp, design, block=targets$BiolRep, weights=w)

I am wondering why the dupcor$consensus.correlation is as low as 0.797633. When I calculate the Pearson correlation between the technical replicates of each biological replicate, this is above 0.99.

Thank you very much for your help!

Eleni

limma duplicatecorrelation • 1.8k views

ADD COMMENT • link updated 7.5 years ago by Aaron Lun ★ 28k • written 7.5 years ago by Eleni Christodoulou ▴ 150

score 5 · Answer 1 · 2016-10-20

5

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 19 hours ago

The city by the bay

Your two correlations are not comparable. The value from duplicateCorrelation represents the percentage of variance explained by the blocking factor within a "typical" gene. The average expression level of each gene has (roughly speaking) little effect on this value, as it cancels out when comparing between samples for the same gene. In contrast, correlations between the observed values of two technical replicates will be computed across genes and be affected by the range of expression values in the data.

To illustrate, consider a biological system where genes are expressed at a range of mean values:

ngenes <- 1000
nlibs <- 10
all.means <- rnorm(ngenes)

If we obtain independent biological replicates from this system, the correlation within each gene should be zero by definition, because the observations are independent. This is demonstrated by running duplicateCorrelation and setting dummy.block to force the function to calculate correlations caused by replicate pairing (e.g., as if we were working with a paired-sample design).

all.y <- matrix(rnorm(ngenes*nlibs, mean=all.means, sd=0.1), ncol=nlibs)
dummy.block <- rep(1:5, each=2)
duplicateCorrelation(all.y, block=dummy.block)$consensus # close to 0

However, computing the correlation between any two libraries yields a value quite close to 1. This is driven by the (mostly uninteresting) range of expression values across all genes, rather than any experimental relationship between the replicates.

cor(all.y[,1], all.y[,2]) # close to 1

As an aside, it seems that all of your technical replicates are nested within a single biological level. If so, it may be better to average the technical replicates instead, see A: limma - technical replicates: duplicateCorrelation() or avereps()?.

ADD COMMENT • link 7.5 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you, Aaron, for the nice illustration of the differences between the methods! Really insightful!

At the end, do you maybe mean aveArrays() and not avereps()? My aim is to average the technical replicates of the samples.

I also have another case with just a pair of samples: One before and one after treatment. Each one has 3 technical replicates. Do you suggest to use limma with averrays(), or a paired t-test would be better?

Thank you very much!

ADD REPLY • link 7.5 years ago Eleni Christodoulou ▴ 150

1

Entering edit mode

Yes, avearrays is the appropriate function here. Your other question should go in a separate post; but if you only have one sample for each condition, you can't really do much, regardless of how many technical replicates you have.

ADD REPLY • link 7.5 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you very much!

ADD REPLY • link 7.5 years ago Eleni Christodoulou ▴ 150

0

Entering edit mode

Dear Aaron,

I tried avearrays() for a problem similar to the above. I then want to do differential gene detection using limma. The problem is that some of the replicates are in different batches. I thus want to add the batch in my design matrix. Is this possible when the averaged replicates are across batches? Maybe, in this case, avearrays is not the best way to go and duplicateCorrelation() is the way?

Thank you so much.

ADD REPLY • link 7.5 years ago Eleni Christodoulou ▴ 150

0

Entering edit mode

First, is there actually a batch effect? Check out a MDS plot, and if there's no real effect, then you don't need to model the batch factor and you can average the arrays as described. Otherwise, if different replicates are in different batches, and there's a different number of replicates from each sample in each batch, then you'll probably have to use duplicateCorrelation.

ADD REPLY • link 7.5 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hello, Aaron and apology for the late reply. Thank you for your response. Yes, there is a batch effect which is evident with MDS. I will then apply duplicateCorrelation in my second problem.

Best,

Eleni

ADD REPLY • link 7.5 years ago Eleni Christodoulou ▴ 150