Question

EdgeR: Should I include all samples/conditions in my dispersion estimation while only testing for certain contrasts

0

Entering edit mode

Jurgen • 0

@6b3a35b0

Last seen 4 months ago

The Netherlands

Dear reader

I am using edgeR to do differential gene expression analysis on my mRNA seq data.

I have a question about what would be the proper way to do my differential gene expression analysis.

Situation 1 Let's say I have an experiment where I have three conditions A (control), B (stimulation 1) and C (stimulation 2) which each 5 replicates.I only want to test if there are deferentially expressed genes between A&B and A&C, I don't care about the difference between B and C.

What is the proper way to do the differential gene expression analysis. Should I take all three conditions into my DGE list object, and do normalization and dispersion estimation with the data of all three conditions and then only at the glmTest phase indicate that I want to test between A&B and A&C. This seems odd to me, as while testing for A&B, the estimated dispersion that I am using in the model was also based on the data of condition C which is no relevant for this comparison.

Alternatively, I could first subset my data to only include data from condition A and B before I do my dispersion estimates, then calculate the dispersion and then my glmtest between A and B. The the dispersion estimated that I use are only based on the data from A and B, which makes more sense to me.

For the comparison between A&C I will do the same, so first subset for A&C and then dispersion estimates and test.

Situation 2 Here I have four A,B,C,D where I only want to test between A&B and C&D (biologically it does not make any sense to do any other comparisons). But the data was acquired in one experiment and also the samples were sequenced together and are part of the same story let's say. Basically, the same situation as above, only here the control are not shared between the two comparisons.

Again, should I put the data from all four conditions together in a DGE list, then calculate the dispersion based on the data of all four conditions and then only in the GLM test phase indicate that I want to test A&B and C&D. Again I have the same doubt about this approach. If I do it like this, when testing between A&B, the dispersion estimates that I use are also based on the data from C&D which then would influence the results. Also, I consider the comparisons between A&B and C&D as independent experiments that happened to be collected during the same day and sequenced in the same batch.

Again, alternatively, I could subset my data for A&B before calculating the dispersion (like described above)

The thing is that the number of genes that are deferentially expressed differs between the two methods, so that is why I am curious what would be the most valid way to do this (in both scenarios)?

I am curious what you would suggest as most appropriate approach.

Best, Jurgen

edgeR • 928 views

ADD COMMENT • link updated 6 months ago by Gordon Smyth 51k • written 6 months ago by Jurgen • 0

score 2 · Accepted Answer · 2024-01-18

You don't say which edgeR analysis pipeline you are using but I will assume you're using QL, as you were recommended to do in an earlier thread on this forum. Your reference to "glmTest" is a bit confusing because there is no function by that name in edgeR.

edgeR follows the same principles as ANOVA or General Linear Models. edgeR is designed to analyse complete experiments together rather than to subset datasets differently for every comparison you might make. edgeR assumes that the dispersion parameters are gene-specific but not sample-specific or group-specific. Keeping all the data together allows edgeR to estimate the dispersion parameters as reliably as possible. This also increases statistical power because it allows the F-tests for each comparison to be conducted on more degrees of freedom. On the other hand, edgeR is not designed to analyse unrelated experiments together. The different samples must be closely enough related that the level of biological variability is similar for all samples.

In Situation 1 it is almost always better to analyse all the conditions together in the usual edgeR way. The only problem that might arise is that one of the conditions might have much more or less dispersion than either of the other two conditions but, even then, your suggestion to subset the dataset for each contrast would not solve the problem. You would still be comparing at least two conditions with different dispersions. In such a case it would be better to either find a covariate that corrects sample heterogeneity or switch to voomLmFit and estimate sample weights to reflect the heterogeneous dispersions.

In Situation 2, the A&B and C&D datasets would usually be analysed separately. I would analyse all four conditions together only if (a) the cell type was the same for all four conditions and (b) the AvsB and CvsD comparisons are part of the same scientific study and will be published in the same paper and (c) the variability is not very different for the two datsets. In any other situation, I would analyse them separately.