EdgeR: Should I include all samples/conditions in my dispersion estimation while only testing for certain contrasts
1
0
Entering edit mode
Jurgen • 0
@6b3a35b0
Last seen 3 days ago
The Netherlands

Dear reader

I am using edgeR to do differential gene expression analysis on my mRNA seq data.

I have a question about what would be the proper way to do my differential gene expression analysis.

Situation 1 Let's say I have an experiment where I have three conditions A (control), B (stimulation 1) and C (stimulation 2) which each 5 replicates.I only want to test if there are deferentially expressed genes between A&B and A&C, I don't care about the difference between B and C.

What is the proper way to do the differential gene expression analysis. Should I take all three conditions into my DGE list object, and do normalization and dispersion estimation with the data of all three conditions and then only at the glmTest phase indicate that I want to test between A&B and A&C. This seems odd to me, as while testing for A&B, the estimated dispersion that I am using in the model was also based on the data of condition C which is no relevant for this comparison.

Alternatively, I could first subset my data to only include data from condition A and B before I do my dispersion estimates, then calculate the dispersion and then my glmtest between A and B. The the dispersion estimated that I use are only based on the data from A and B, which makes more sense to me.

For the comparison between A&C I will do the same, so first subset for A&C and then dispersion estimates and test.

Situation 2 Here I have four A,B,C,D where I only want to test between A&B and C&D (biologically it does not make any sense to do any other comparisons). But the data was acquired in one experiment and also the samples were sequenced together and are part of the same story let's say. Basically, the same situation as above, only here the control are not shared between the two comparisons.

Again, should I put the data from all four conditions together in a DGE list, then calculate the dispersion based on the data of all four conditions and then only in the GLM test phase indicate that I want to test A&B and C&D. Again I have the same doubt about this approach. If I do it like this, when testing between A&B, the dispersion estimates that I use are also based on the data from C&D which then would influence the results. Also, I consider the comparisons between A&B and C&D as independent experiments that happened to be collected during the same day and sequenced in the same batch.

Again, alternatively, I could subset my data for A&B before calculating the dispersion (like described above)

The thing is that the number of genes that are deferentially expressed differs between the two methods, so that is why I am curious what would be the most valid way to do this (in both scenarios)?

I am curious what you would suggest as most appropriate approach.

Best, Jurgen

edgeR • 520 views
ADD COMMENT
2
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

You don't say which edgeR analysis pipeline you are using but I will assume you're using QL, as you were recommended to do in an earlier thread on this forum. Your reference to "glmTest" is a bit confusing because there is no function by that name in edgeR.

edgeR follows the same principles as ANOVA or General Linear Models. edgeR is designed to analyse complete experiments together rather than to subset datasets differently for every comparison you might make. edgeR assumes that the dispersion parameters are gene-specific but not sample-specific or group-specific. Keeping all the data together allows edgeR to estimate the dispersion parameters as reliably as possible. This also increases statistical power because it allows the F-tests for each comparison to be conducted on more degrees of freedom. On the other hand, edgeR is not designed to analyse unrelated experiments together. The different samples must be closely enough related that the level of biological variability is similar for all samples.

In Situation 1 it is almost always better to analyse all the conditions together in the usual edgeR way. The only problem that might arise is that one of the conditions might have much more or less dispersion than either of the other two conditions but, even then, your suggestion to subset the dataset for each contrast would not solve the problem. You would still be comparing at least two conditions with different dispersions. In such a case it would be better to either find a covariate that corrects sample heterogeneity or switch to voomLmFit and estimate sample weights to reflect the heterogeneous dispersions.

In Situation 2, the A&B and C&D datasets would usually be analysed separately. I would analyse all four conditions together only if (a) the cell type was the same for all four conditions and (b) the AvsB and CvsD comparisons are part of the same scientific study and will be published in the same paper and (c) the variability is not very different for the two datsets. In any other situation, I would analyse them separately.

ADD COMMENT
0
Entering edit mode

Hi Gordon,

Thank you for your elaborate reply, this is super useful.

I was really struggling with deciding how to approach the analysis here. In most examples online I noticed that people usually analyse all data together but as stated above I was not sure if that would be the best case for my situation. So thanks a lot! Getting these useful and fast replies makes using a package like edgeR also much easier and more pleasant!

Situation 2, in this case I am working with mosquitoes and I have been using the same mosquitoes for A&B and C&D. However, A&B were exposed to E.coli bacteria and C&D were exposed to M. luteus bacteria. Then in the comparison I silenced gene X (B/D) or performed mock silencing (A/C)

So although the same mosquito was used, the two comparisons were exposed to different bacteria which might influence the variability of the data. So maybe it is better to analyse them separately?

You wrote to check "the variability is not very different for the two datasets". Is there a rule or thumb or some kind of threshold to decide the variability in the two dataset is too different so that it is better than analyse them separately?

I was indeed using the QL pipeline (as you recommended before). Out of curiosity, would the sample apply about analyzing the data together instead of sub-setting while using the Exact Test (classical approach) pipeline?

Thanks a lot,

Jurgen

ADD REPLY
1
Entering edit mode

Your Situation 2 experiment sounds to me as if it would be amenable to being analysed all together. It's your experiment though, I can't tell you how to analyse it. It's not a minefield in which any mistake leads to disaster. Rather, if you're not sure which way to go, then either approach is likely to be fine.

To just whether variability is different between conditions, just make an MDS plot. If a difference in variability is obvious then there's a difference. If it isn't obvious, then any difference isn't big enough to worry about. It is a matter of judgement, not a formal test.

The advice regarding whether to analyse all samples together is the same regardless of which edgeR pipeline you use, but the effects are slightly different. exactTest doesn't take account of the number of df used to estimate the dispersion, so you don't necessarily get a gain in power from analysing all together, because you wouldn't be penalized for small samples in the first place. You will still be affected by uncertain estimation of the dispersion, but it might lead to a modest increase in the FDR rather to than a loss of power.

ADD REPLY

Login before adding your answer.

Traffic: 606 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6