Dear all,
I would like to have some advice regarding the inclusion of certain covariates and factors in my design matrix from a human transcriptomics project. To give some more context we are comparing two different conditions with each other in a retrospective study with around 250 patients in each group. A total of around 500 whole blood samples were sequenced using Lexogen's Quantseq technology. Now we are trying to run a DE-analysis and I have come to the point that I have to decide which factors and covariates should be included in the design matrix. We have the baseline data of all the patients with age, sex, comorbidites, lab values etc and in a post on Biostars we discussed that I should propably run a PCA plot taking into account the factors and covariates to see if those are the variables that split the data. If so, then I should include them in the design matrix. (see link: https://www.biostars.org/p/9494249/#9497484)
Furthermore, as it is a bulk RNAseq experiment, I think we should also correct for composition of the tissue that is sequenced. So that means we should correct for cell differentiation (number of basophils, neutrophils, lymfocyts etc.). As we discussed on Biostars the suggestion would be to include the cell counts as a random effect in the EdgeR model (which I used for DE analysis), but we were not sure how to do this. Does anyone have a suggestion how to include these variables in a proper way in EdgeR?
Furthermore, if anyone has complementary advice on which variables to include and why, or a method on how to select them, please feel free to share it with me!
Allright, point taken! Thanks for your advice!
Any other thoughts about the experimental design and which factors/covariates to include in the first place?