Hello
I am looking for help with the analysis of bulk RNA-seq data. I have an analysis in mind, but I would like to verify its correctness.
Experiment:
We differentiate stem cells into embryonic bodies. These embryonic bodies produce progenitor cells that are further differentiated into a certain cell type. These progenitors can be harvested at two time points: day 18 or day 25 after which the second differentiation protocol starts. These "second differentiated cells" were sent for bulk RNA sequencing.
We have both wild type cells and knock-out (KO) samples. Our primary question is which genes are differentially expressed between wild type and KO samples. A secondary question is whether there are differences between day 18 and day 25.
A PCA plot shows that "Differentiation" (one embryonic body) is a significant source of variation. This variation is captured in PC1. I think it is correct to include "Differentiation" in my model, but I believe it is not possible with fixed effects since there is not always a day 18 and day 25 present in each "Differentiation"
I would like to include this in my analysis, but I believe it is not possible with fixed effects since there is not always a day 18 and day 25 in each "Differentiation".
My Design:
Inspired by section 7.6 of "A Guide to Creating Design Matrices for Gene Expression Experiments," (great resource by the way!) my idea was to use limma
with a design matrix "~Day + Condition2"
and include "Differentiation" as a random effect (block becomes "Differentiation" in duplicateCorrelation
and lmFit
), because the blocks are not complete (not every Day and Condition is in every differentiation). The last parameter in my design is then my effect of primary interest.
Is this a correct way to analyze this data?
Thank you very much for your time and help.