Question

Limma, remove batch effects and technical replicates

0

Entering edit mode

John ▴ 10

@john-7115

Last seen 6.5 years ago

United Kingdom

Hello.

I have data that has 2 treatment groups (6 samples, 3 bio reps each group). For each bio rep, I have 9 technical reps, so 54 total samples. I have a co-variate of number of mutations and a date of experiment of batch effect. The size of the gene expression matrix = 5000 genes by 54 samples.

If I run Limma lmFit eBayes like so;

design<-model.matrix(~Date + mutations + treatment, data=pheno)

block<-pheno$block # this refers to the technical replicates

dupcor <- duplicateCorrelation(rmat,design,block=block)

dupcor$consensus # gives 0.193
fit1 <- lmFit(rmat,design,block=block,correlation=dupcor$consensus)
fit1 <- eBayes(fit1)

sum(topTable(fit1,coef="treatment",Inf)[,5]<=0.1) # gives 409 genes at 10% FDR (seems good).

# extracting the batch effect removed data like so, using my basic logic of trying to preserve the tech rep correlations by including block and treatment in the design, not sure if this right?

# known batch = date of experiment

# known covariate is the number mutations

# want to preserve treatment, and from my simple logic, block

y<-removeBatchEffect(rmat, batch=pheno$Date, covariates=pheno$mutations, design=model.matrix(~block + treatment, data=pheno))

The clustering and PCA plots of y look very nice in terms of first two components and clustering separating the treatment groups well.

For interest, if I now run the above Limma analysis using the corrected y matrix with

design<-model.matrix(~treatment, data=pheno)

block<-pheno$block # this refers to the technical replicates

dupcor <- duplicateCorrelation(y,design,block=block)

dupcor$consensus # now is 0.20
fit1 <- lmFit(y,design,block=block,correlation=dupcor$consensus)
fit1 <- eBayes(fit1)

sum(topTable(fit1,coef="treatment",Inf)[,5]<=0.1) # now is 1781 genes at 10% FDR (huge amounts).

I am a bit surprised after batch removal that the number of significant genes is vastly increased. I naively thought it will be similar numbers.

I think and understand that the first approach is the correct thing to do with Limma for differential gene expression, to include batch, mutations in the design and run duplicateCorrelations and lmFit etc; so this is the real result of differentially expressed genes. And to use the batch removal data only for things like MDS and PCA. This is great and has helped my work. I am just curious why big difference in significant genes, which is probably a mistake in my code/logic.

Thanks.

John.

Limma removebatcheffect() duplicatecorrelation • 2.8k views

ADD COMMENT • link 8.5 years ago John ▴ 10

score 4 · Answer 1 · 2016-10-13

If you regress out the date and mutation effect with removeBatchEffect, then you're not accounting for the uncertainty in the estimates of these effects. This means that you'll overstate the precision of the estimates of the coefficients and variance in the fitted linear model. In particular, the standard error of the coefficients will be underestimated, and the residual degrees of freedom will be overestimated. As a result, the significance calculations will become more optimistic (i.e., more rejections of the null hypothesis), because deviations from the null model will be treated with more confidence than they should be.

The degree of this overoptimism depends on how many degrees of freedom are incorrectly "gained" when you fail to model the additional covariates or factors. I suspect that your second design matrix is a lot simpler than your first (i.e., fewer columns), which is why you get so many more detected genes in the second approach. To be clear, this is not a good thing, which is why we would recommend including additional covariates or factors in the design matrix rather than removing them with removeBatchEffect.

On an unrelated note, do all of the technical replicates for a given biological replicate have the same date/mutation/treatment values? If so, it would be better to average them together with avereps rather than using duplicateCorrelation - see A: limma - technical replicates: duplicateCorrelation() or avereps()? for more details.

score 0 · Answer 2 · 2016-10-13

Hello Aaron,

Thank you for a very quick answer and good explanation to quench curiosity, which I can understand. Yes, I understand it is not a good thing, as I stated too.

As for your technical replicate question, yes and no. What I mean is, the technical replicates for each bio replicate have the same number of mutations and treatment as they are repeats of the same sample, but the dates are 3x3, so 3 reps on one date, 3 on another and so on. So I could average the same date samples, so I end up with 3 average tech replicates per biological rep sample, reducing total sample number to 18. Even with this, I understand it is correct to still use duplicateCorrelation.

Thank you very much for interest and help; your cat looks cool.

John.

and a great cat