I am new to `DESeq2`

myself so take this answer with a grain of salt.

I was initially confused by the difference between a `Wald test`

and a `t-test`

as well. To explain why a Wald test is used we first need to explain why we are fitting a generalized linear model (`GLM`

).

A **lot** is going on under the hood in `DESeq`

, and I would recommend reading the introduction document here in order to get a better understanding of what is going on. For an explanation of why a linear model is being fit and how that is used, see here. But I can give you a brief summary based on my limited understanding.

The linear model mainly accomplishes three things:

Controll for library/sample size

Obtain a better estimate of the variation in each gene by pooling information across samples

Model experimental effects so you can focus in on the differences you care about.

**Very** roughly speaking, the model will fit a parameter for each element of your experimental design where each parameter indicates how much that element explains (or affects) changes in gene expression. For example, if you had some samples that fell into two batches and three treatments, the linear model would fit a parameter for each batch (2 parameters) and each treatment (3 parameters) if you used the design `~ batch + treatment`

. Then if you want to know if `gene x`

is significanly different between `treatment 1`

and `treatment 2`

you compare their coefficients (parameters) for `gene x`

using a `Wald test`

. This is where the benefits of point #3 come in, since the coefficients come from a model that includes batch, they can be though of to represent the unique effect of `treatment x`

after accounting for batch effects and the other treatments.

As you may have noticed, this is where the `Wald test`

comes in. Basically if you want to compare coefficients from a regression, you use a `Wald test`

, if you want to compare raw data, then you use a `t-test`

.

So why not just do a `t-test`

on the raw data? You can, but you are not accounting for batch effect, library size, the fact that low count genes have much higher variance than high count genes, ect. all of which can impact the accuracy of your results.

`DESeq`

uses linear modeling (and many other techniques) to try to capture and control for these sources of variation so that your comparisons are as accurate as possible. The majority of these techniques operate over the entire data set, allowing them to harness the power of all of that additional information. Compare that to a `t-test`

which only operates on a **very** small subset of your data, and you have the basic rationale behind using `DESeq2`

or similar programs vs just doing a bunch of `t-tests`

.

**NOTE:** I am by no means an expert in this area, so some of what I have said may be (very) wrong. This just represents my current understanding of things as I am trying to learn what is going on myself.

•

link
modified 4 months ago
•
written
4 months ago by
wunderl • **20**