I am new to DESeq2
myself so take this answer with a grain of salt.
I was initially confused by the difference between a Wald test
and a t-test
as well. To explain why a Wald test is used we first need to explain why we are fitting a generalized linear model (GLM
).
A lot is going on under the hood in DESeq
, and I would recommend reading the introduction document here in order to get a better understanding of what is going on. For an explanation of why a linear model is being fit and how that is used, see here. But I can give you a brief summary based on my limited understanding.
The linear model mainly accomplishes three things:
Controll for library/sample size
Obtain a better estimate of the variation in each gene by pooling information across samples
Model experimental effects so you can focus in on the differences you care about.
Very roughly speaking, the model will fit a parameter for each element of your experimental design where each parameter indicates how much that element explains (or affects) changes in gene expression. For example, if you had some samples that fell into two batches and three treatments, the linear model would fit a parameter for each batch (2 parameters) and each treatment (3 parameters) if you used the design ~ batch + treatment
. Then if you want to know if gene x
is significanly different between treatment 1
and treatment 2
you compare their coefficients (parameters) for gene x
using a Wald test
. This is where the benefits of point #3 come in, since the coefficients come from a model that includes batch, they can be though of to represent the unique effect of treatment x
after accounting for batch effects and the other treatments.
As you may have noticed, this is where the Wald test
comes in. Basically if you want to compare coefficients from a regression, you use a Wald test
, if you want to compare raw data, then you use a t-test
.
So why not just do a t-test
on the raw data? You can, but you are not accounting for batch effect, library size, the fact that low count genes have much higher variance than high count genes, ect. all of which can impact the accuracy of your results.
DESeq
uses linear modeling (and many other techniques) to try to capture and control for these sources of variation so that your comparisons are as accurate as possible. The majority of these techniques operate over the entire data set, allowing them to harness the power of all of that additional information. Compare that to a t-test
which only operates on a very small subset of your data, and you have the basic rationale behind using DESeq2
or similar programs vs just doing a bunch of t-tests
.
NOTE: I am by no means an expert in this area, so some of what I have said may be (very) wrong. This just represents my current understanding of things as I am trying to learn what is going on myself.
Thanks for the comment.