edgeR shows different results for same dataset
2
0
Entering edit mode
@f68bdf4c
Last seen 1 hour ago
Norway

I am using the edgeR for 2 datasets, but one dataset is a subset of another one. I have 3 different diet (D1, D2, D3), and 3 time points (T1, T2, T3). One dataset looks like-

table(mapping_file4$Time, mapping_file4$Diet)
ctr  D1  D2  D3
T1  12  12  12  12
T2  12  12  12  12
T3  12  12  12  12

mapping_file4$group<-paste(mapping_file4$Diet, mapping_file4$Time, sep = ".") y <- DGEList(counts=rawCounts_total4, genes=row.names(rawCounts_total4)) y <- calcNormFactors(y) design <- model.matrix(~ 0 + group, data=mapping_file4) rownames(design) <- colnames(y) y <- estimateDisp(y, design, robust=TRUE) fit <- glmQLFit(y, design, robust=TRUE) qlt <- glmQLFTest(fit, contrast=makeContrasts(groupmc1.T1 -groupctr.T1, levels=design)) topgenes<-topTags(qlt, n=dim(row.names(rawCounts_total4))) table(topgenes$table$FDR<0.05) FALSE TRUE 69363 26  But when I use another dataset, which is a subset of the above dataset, that looks like- table(mapping_file4$Time, mapping_file4$Diet) ctr D1 T1 12 12 T2 12 12 T3 12 12 qlt <- glmQLFTest(fit, contrast=makeContrasts(groupmc1.T1 -groupctr.T1, levels=design)) topgenes<-topTags(qlt, n=dim(row.names(rawCounts_total4))) table(topgenes$table\$FDR<0.05)
FALSE  TRUE
69378    11


I used the same commands for the above dataset as well. My question is why I am getting different differentially expressed genes for the same Diet (D1) compared with control). But the 11 genes are the subset of 26 genes.

Many thanks!

edgeR • 382 views
2
Entering edit mode
@gordon-smyth
Last seen 4 hours ago
WEHI, Melbourne, Australia

In the first case with four groups you have 44 degrees of freedom to estimate the dispersion (44 = 48 observations minus 4 mean parameters) and the edgeR QL F-test will have 44 degrees of freedom for its denominator. In the second case with two groups you have only 22 degrees of freedom to estimate the dispersion and edgeR QL F-test will have only 22 degrees of freedom for the denominator. F-tests with more degrees of freedom have more power. This is a general principle, not just of limma and edgeR, but for anova and linear models in general.

We recommend you use edgeR with the complete dataset. As you have found, it gives more power and better results.

0
Entering edit mode
@steve-lianoglou-2771
Last seen 15 hours ago
United States

The simple/hand-wavy answer is that edgeR (and limma::voom (and DESeq2)) leverage all of the data in a statistically rigorous manner to say something (pvalues, in this case) about any individual gene in particular.

When you remove samples from the analysis, you reduce the power you have to do so.

I'd argue, though, that the results you are seeing aren't really all that different. Dollars to donuts you are simply observing a thresholding effect here where you have marginal differences in pvalues/FDR's that pushing a few genes above or below your threshold for statistical significance.

To visualize this, you might try plotting something like the -log10(pval) for each gene from the two different analysis against each other, or perhaps the -log10(pval) * its logFC to get something like a "signed pvalue" ... i like to use the t-statistic when doing something similar when using limma::voom, but you don't get those here. Anyway, you should see that the points on this scatter plot will be floating around the x = y 45° line.

0
Entering edit mode

Thank you for the explanation, but I still don't understand how removing the other samples, in this case removing other Diet samples (D2 and D3) reduces the power for D1 group comparing it with control samples. In both the datasets, I am using the contrast function to compare between D1 and control at time point 1.

Lastly, what would you recommend doing in this case? If I want to compare D1 with control at time point 1, should I include all the samples (in this case D2 and D3) or subsetting it?

1
Entering edit mode

I'm having a hard time finding an edgeR specific answer to illustrate how more samples help with statistical power, but take a look at this post by Michael Love where he explains how having more samples helps you to better estimate your gene level dispersions in DESeq2. In particular, pay attention to this:

If we assume that the dispersion is roughly similar for different groups of samples, then we have better estimates of that parameter by putting all the samples together for dispersion estimation. We have a FAQ in the vignette about when this is a good idea or when not.

... and note that there is an entry in the DESeq2 faq that can help explain more.

Although edgeR and DESeq2 have their differences, you can imagine a "similar-enough" thing is happening in the edgeR world. There is always the primary literature you can read through if you really want to know the gory details, but perhaps this may be enough to guide your intuition.

As for my recommendation in this case, since the samples are "similar enough" (conducted in same experiment, same paradigm, same cell type, etc), I'd keep all of the samples together for the analysis, then just pull out the stats for the contrast of interest.