recently, I met a question in DEseq2.
Firstly, I have a dataset including four groups (ctrl, a, b, c), each group have three samples.
if I want to get DEGs between a vs ctrl, b vs ctrl, c vs ctrl, there are two ways to calculate.
way1: I construct a matrix with 12 column (samples) and 20000 row (protein coding gene). Then I used DESeq()
and result()
to get the DEGs of a vs ctrl, b vs ctrl, c vs ctrl.
way2: for a vs ctrl, I construct a matrix with 6 column (ctrl1 ctrl2 ctrl3 a1 a2 a3) and 20000 row (protein coding gene). Then I used DESeq()
and result()
to get the DEGs. repeat above for other comparision.
The p value of the same gene between way1 and way2 is very different.
so I have two question:
- whether more samples will influence the construction of GLM for each gene? in the past, I just think that GLM was contructed in each group independently. is the different GLM the reason why p value is different ?
- samples in the same group have strong batch effect. therefore, which way is more suitable for me? way1 or way2
thanks for you !