Question: Removing batch effect from gene expression before Limma linear modelling
gravatar for abelsousa
17 months ago by
abelsousa0 wrote:



I have a question about a differential expression analysis that I am doing using Limma on RNA-seq data.

Basically, I want find differentially expressed genes between tumour and normal samples (in men and women separately), using as covariates ethnicity and age. To do so, I defined a model like this one:

design <- model.matrix(~ ethnicity + age + tissue_type)
(tissue_type: factor with tumour or normal levels)


However, it happens that the samples come from two different research projects, like this:

Data    Tissue
Project1    Tumour
Project1    Normal (15% of the normal samples)
Project2    Normal (85% of the normal samples)


As I was expecting, there is a batch effect between these samples that was clearly seen on a PCA analysis (the tumour and normal samples from Project1 form a clear distinct group from the normal samples of Project2).


So, before fitting the limma linear model, first I removed this batch effect from the gene counts transformed with voom, using as covariate a factor that indicated whether the samples came from research project1 or project2.

After removing the bacth effect I did another PCA, and showed that this time the samples did not separate by research project.

After this I applied the standard limma pipeline for differential expression analysis:

# fit the linear model for each gene
fit <- lmFit(gene_expression_corrected, design)

#empirical bayes statistics for differential expression
fit <- eBayes(fit)

# extract summary and gene table with statistics
diff_expression_summary <- summary(decideTests(fit))
diff_expression_table <- topTable(fit, coef = "tissue_typetumour", n = Inf, = "p", p = 1)


My question is: Is this approach correct?

Thanks in advance for the help.



ADD COMMENTlink modified 17 months ago by Gordon Smyth39k • written 17 months ago by abelsousa0
Answer: Removing batch effect from gene expression before Limma linear modelling
gravatar for Gordon Smyth
17 months ago by
Gordon Smyth39k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth39k wrote:

No, that approach is not correct, as has been pointed out before on this support site. You need to include the batch factor in the linear model rather than attempt to remove the batch before hand.

The help page for removeBatchEffect() says that it is intended to prepare data for plotting, not for input to linear modeling.

If there is strong batch effect between projects 1 and 2, and if project 1 is the only one to include both normals and tumours, then project 2 will provide only limited additional information. The main contribution of project 2 will be to improve the estimates of the coefficients for the ethnicity and age covariates. Your analysis as it stands risks over-stating the evidence you have for comparing normals and tumours.

ADD COMMENTlink modified 17 months ago • written 17 months ago by Gordon Smyth39k

Thank you Professor Gordon.

In fact, I compared the results of one analysis where I included the batch factor in the linear model with other analysis where I removed the samples from project 2, and the genes that appear to be differentially expressed between tumour and normal are mostly the same.

I just thought in adding these normal samples from project 2 because I have many more tumour samples than normal samples in project 1. But, anyway, probably I am going to remove these normal samples from the analysis.

ADD REPLYlink written 17 months ago by abelsousa0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 221 users visited in the last hour