Hi all,
I am conducting differential gene expression analysis on a single-cell RNAseq dataset involving three conditions: control, treatmentA, and treatmentB. Due to issues during library preparation, I have two replicates for the control and treatmentA but only one for treatmentB, as the second replicate failed quality control. The experiments spanned two batches, with the first replicate processed in batchX and the second in batchY. Each sample was derived from a different donor. Essentially, the meta data looks like this:
mdata <- data.frame(condition = c("ctrl", "ctrl", "treatmentA", "treatmentA", "treatmentB"),
batch = c("batchX", "batchY", "batchX", "batchY", "batchX"),
donor = c("A", "B", "C", "D", "E"))
condition batch donor
1 ctrl batchX A
2 ctrl batchY B
3 treatmentA batchX C
4 treatmentA batchY D
5 treatmentB batchX E
I am interested in identifying differentially expressed genes between treatmentB and the ctrl for a specific celltype. Because I cant use the pseudobulk method, I plan on using MAST with a random effect as recommanded here and here.
My main concern is selecting the appropriate formula for the linear mixed model, particularly regarding how to account for the batch effects. Intuitively, I thought the model could be structured as follows:
formula <- ~ ngeneson + condition + (1 | donor)
Is my understanding correct that by allowing a random intercept for the donor, I am implicitely also adjusting for batch because the batch effect can be "absorbed" in the donor specific intercept? Or should the correct formula look like this:
formula <- ~ ngeneson + batch + condition + (1 | donor)
Any insights or pointers are much appreciated!
Cheers!