Dear DeSeq2 community:
I have a quick question about fixed effects. We performed an experiment where roughly half the individuals were exposed to a treatment and the other half were kept under identical conditions but unexposed (controls). All tissues/individuals were sampled at the exact same time point. Half the individuals from each treatment (both control and exposed) were run on one sequencing machine (without paired-end reads) and the second half were run at a later date on a different sequencing machine (with paired-ends); all individuals have ~20 million mapped reads. If I analyze the two datasets separately (1 dataset from each machine), I end up with 172 identified DE genes from dataset 1 (FDR p-value < 0.05) and 874 DE genes from dataset 2. I can also combine the two datasets (using the same merged gtf file generated with stringtie) and then use “sequencingdate” (the equivalent of machine) as a fixed effect. The model is: design = ~ sequencingdate + treatment. After normal DEseq2 analyses this combined data set yields 779 DE genes. The PCA of the combined data set is available here: https://ibb.co/gzKLRZ0 and you can see a clear effect of sequencing machine (PC1), but also treatment (PC2) for the 779 DE genes. In a typical statistical model, my understanding is that the combined analyses would be more appropriate (and that it is not surprising that we can still see the fixed effect of sequencing machine). Thus, I am more inclined to use the downstream results from the combined data set. My question is: which of the two approaches (approach 1: data sets analyzed separately vs. approach 2: data sets combined and sequencing_date used as a fixed effect) is more appropriate? If approach 1 is more appropriate, could you please explain why?
Many thanks,
Mark Christie http://christielab.bio.purdue.edu/
Thank you very much! This recommendation is more helpful than you might think.
Kind regards,
Mark