**0**wrote:

Hello,

I have a question about a differential expression analysis that I am doing using Limma on RNA-seq data.

Basically, I want find differentially expressed genes between tumour and normal samples (in men and women separately), using as covariates ethnicity and age. To do so, I defined a model like this one:

design <- model.matrix(~ ethnicity + age + tissue_type)

(tissue_type: factor with tumour or normal levels)

However, it happens that the samples come from two different research projects, like this:

Data Tissue

Project1 Tumour

Project1 Normal (15% of the normal samples)

Project2 Normal (85% of the normal samples)

As I was expecting, there is a batch effect between these samples that was clearly seen on a PCA analysis (the tumour and normal samples from Project1 form a clear distinct group from the normal samples of Project2).

So, before fitting the limma linear model, first I removed this batch effect from the gene counts transformed with voom, using as covariate a factor that indicated whether the samples came from research project1 or project2.

After removing the bacth effect I did another PCA, and showed that this time the samples did not separate by research project.

After this I applied the standard limma pipeline for differential expression analysis:

# fit the linear model for each gene

fit <- lmFit(gene_expression_corrected, design)

#empirical bayes statistics for differential expression

fit <- eBayes(fit)

# extract summary and gene table with statistics

diff_expression_summary <- summary(decideTests(fit))

diff_expression_table <- topTable(fit, coef = "tissue_typetumour", n = Inf, sort.by = "p", p = 1)

My question is: Is this approach correct?

Thanks in advance for the help.

Abel

**37k**• written 13 months ago by abelsousa •

**0**