I have a question about a differential expression analysis that I am doing using Limma on RNA-seq data.
Basically, I want find differentially expressed genes between tumour and normal samples (in men and women separately), using as covariates ethnicity and age. To do so, I defined a model like this one:
design <- model.matrix(~ ethnicity + age + tissue_type)
(tissue_type: factor with tumour or normal levels)
However, it happens that the samples come from two different research projects, like this:
Project1 Normal (15% of the normal samples)
Project2 Normal (85% of the normal samples)
As I was expecting, there is a batch effect between these samples that was clearly seen on a PCA analysis (the tumour and normal samples from Project1 form a clear distinct group from the normal samples of Project2).
So, before fitting the limma linear model, first I removed this batch effect from the gene counts transformed with voom, using as covariate a factor that indicated whether the samples came from research project1 or project2.
After removing the bacth effect I did another PCA, and showed that this time the samples did not separate by research project.
After this I applied the standard limma pipeline for differential expression analysis:
# fit the linear model for each gene
fit <- lmFit(gene_expression_corrected, design)
#empirical bayes statistics for differential expression
fit <- eBayes(fit)
# extract summary and gene table with statistics
diff_expression_summary <- summary(decideTests(fit))
diff_expression_table <- topTable(fit, coef = "tissue_typetumour", n = Inf, sort.by = "p", p = 1)
My question is: Is this approach correct?
Thanks in advance for the help.