Hello,
I have a question about a differential expression analysis that I am doing using Limma on RNA-seq data.
Basically, I want find differentially expressed genes between tumour and normal samples (in men and women separately), using as covariates ethnicity and age. To do so, I defined a model like this one:
design <- model.matrix(~ ethnicity + age + tissue_type)
(tissue_type: factor with tumour or normal levels)
However, it happens that the samples come from two different research projects, like this:
Data Tissue
Project1 Tumour
Project1 Normal (15% of the normal samples)
Project2 Normal (85% of the normal samples)
As I was expecting, there is a batch effect between these samples that was clearly seen on a PCA analysis (the tumour and normal samples from Project1 form a clear distinct group from the normal samples of Project2).
So, before fitting the limma linear model, first I removed this batch effect from the gene counts transformed with voom, using as covariate a factor that indicated whether the samples came from research project1 or project2.
After removing the bacth effect I did another PCA, and showed that this time the samples did not separate by research project.
After this I applied the standard limma pipeline for differential expression analysis:
# fit the linear model for each gene
fit <- lmFit(gene_expression_corrected, design)
#empirical bayes statistics for differential expression
fit <- eBayes(fit)
# extract summary and gene table with statistics
diff_expression_summary <- summary(decideTests(fit))
diff_expression_table <- topTable(fit, coef = "tissue_typetumour", n = Inf, sort.by = "p", p = 1)
My question is: Is this approach correct?
Thanks in advance for the help.
Abel
Thank you Professor Gordon.
In fact, I compared the results of one analysis where I included the batch factor in the linear model with other analysis where I removed the samples from project 2, and the genes that appear to be differentially expressed between tumour and normal are mostly the same.
I just thought in adding these normal samples from project 2 because I have many more tumour samples than normal samples in project 1. But, anyway, probably I am going to remove these normal samples from the analysis.