I am using edgeR to do differential gene expression analysis on my mRNA seq data.
I have a question about what would be the proper way to do my differential gene expression analysis.
Situation 1 Let's say I have an experiment where I have three conditions A (control), B (stimulation 1) and C (stimulation 2) which each 5 replicates.I only want to test if there are deferentially expressed genes between A&B and A&C, I don't care about the difference between B and C.
What is the proper way to do the differential gene expression analysis. Should I take all three conditions into my DGE list object, and do normalization and dispersion estimation with the data of all three conditions and then only at the glmTest phase indicate that I want to test between A&B and A&C. This seems odd to me, as while testing for A&B, the estimated dispersion that I am using in the model was also based on the data of condition C which is no relevant for this comparison.
Alternatively, I could first subset my data to only include data from condition A and B before I do my dispersion estimates, then calculate the dispersion and then my glmtest between A and B. The the dispersion estimated that I use are only based on the data from A and B, which makes more sense to me.
For the comparison between A&C I will do the same, so first subset for A&C and then dispersion estimates and test.
Situation 2 Here I have four A,B,C,D where I only want to test between A&B and C&D (biologically it does not make any sense to do any other comparisons). But the data was acquired in one experiment and also the samples were sequenced together and are part of the same story let's say. Basically, the same situation as above, only here the control are not shared between the two comparisons.
Again, should I put the data from all four conditions together in a DGE list, then calculate the dispersion based on the data of all four conditions and then only in the GLM test phase indicate that I want to test A&B and C&D. Again I have the same doubt about this approach. If I do it like this, when testing between A&B, the dispersion estimates that I use are also based on the data from C&D which then would influence the results. Also, I consider the comparisons between A&B and C&D as independent experiments that happened to be collected during the same day and sequenced in the same batch.
Again, alternatively, I could subset my data for A&B before calculating the dispersion (like described above)
The thing is that the number of genes that are deferentially expressed differs between the two methods, so that is why I am curious what would be the most valid way to do this (in both scenarios)?
I am curious what you would suggest as most appropriate approach.