Dear reader
I am using edgeR to do differential gene expression analysis on my mRNA seq data.
I have a question about what would be the proper way to do my differential gene expression analysis.
Situation 1 Let's say I have an experiment where I have three conditions A (control), B (stimulation 1) and C (stimulation 2) which each 5 replicates.I only want to test if there are deferentially expressed genes between A&B and A&C, I don't care about the difference between B and C.
What is the proper way to do the differential gene expression analysis. Should I take all three conditions into my DGE list object, and do normalization and dispersion estimation with the data of all three conditions and then only at the glmTest phase indicate that I want to test between A&B and A&C. This seems odd to me, as while testing for A&B, the estimated dispersion that I am using in the model was also based on the data of condition C which is no relevant for this comparison.
Alternatively, I could first subset my data to only include data from condition A and B before I do my dispersion estimates, then calculate the dispersion and then my glmtest between A and B. The the dispersion estimated that I use are only based on the data from A and B, which makes more sense to me.
For the comparison between A&C I will do the same, so first subset for A&C and then dispersion estimates and test.
Situation 2 Here I have four A,B,C,D where I only want to test between A&B and C&D (biologically it does not make any sense to do any other comparisons). But the data was acquired in one experiment and also the samples were sequenced together and are part of the same story let's say. Basically, the same situation as above, only here the control are not shared between the two comparisons.
Again, should I put the data from all four conditions together in a DGE list, then calculate the dispersion based on the data of all four conditions and then only in the GLM test phase indicate that I want to test A&B and C&D. Again I have the same doubt about this approach. If I do it like this, when testing between A&B, the dispersion estimates that I use are also based on the data from C&D which then would influence the results. Also, I consider the comparisons between A&B and C&D as independent experiments that happened to be collected during the same day and sequenced in the same batch.
Again, alternatively, I could subset my data for A&B before calculating the dispersion (like described above)
The thing is that the number of genes that are deferentially expressed differs between the two methods, so that is why I am curious what would be the most valid way to do this (in both scenarios)?
I am curious what you would suggest as most appropriate approach.
Best, Jurgen
Hi Gordon,
Thank you for your elaborate reply, this is super useful.
I was really struggling with deciding how to approach the analysis here. In most examples online I noticed that people usually analyse all data together but as stated above I was not sure if that would be the best case for my situation. So thanks a lot! Getting these useful and fast replies makes using a package like edgeR also much easier and more pleasant!
Situation 2, in this case I am working with mosquitoes and I have been using the same mosquitoes for A&B and C&D. However, A&B were exposed to E.coli bacteria and C&D were exposed to M. luteus bacteria. Then in the comparison I silenced gene X (B/D) or performed mock silencing (A/C)
So although the same mosquito was used, the two comparisons were exposed to different bacteria which might influence the variability of the data. So maybe it is better to analyse them separately?
You wrote to check "the variability is not very different for the two datasets". Is there a rule or thumb or some kind of threshold to decide the variability in the two dataset is too different so that it is better than analyse them separately?
I was indeed using the QL pipeline (as you recommended before). Out of curiosity, would the sample apply about analyzing the data together instead of sub-setting while using the Exact Test (classical approach) pipeline?
Thanks a lot,
Jurgen
Your Situation 2 experiment sounds to me as if it would be amenable to being analysed all together. It's your experiment though, I can't tell you how to analyse it. It's not a minefield in which any mistake leads to disaster. Rather, if you're not sure which way to go, then either approach is likely to be fine.
To just whether variability is different between conditions, just make an MDS plot. If a difference in variability is obvious then there's a difference. If it isn't obvious, then any difference isn't big enough to worry about. It is a matter of judgement, not a formal test.
The advice regarding whether to analyse all samples together is the same regardless of which edgeR pipeline you use, but the effects are slightly different. exactTest doesn't take account of the number of df used to estimate the dispersion, so you don't necessarily get a gain in power from analysing all together, because you wouldn't be penalized for small samples in the first place. You will still be affected by uncertain estimation of the dispersion, but it might lead to a modest increase in the FDR rather to than a loss of power.