Question

DESeq2 GLM modelling and Wald's test

1

Entering edit mode

s.apocarpum ▴ 10

@sapocarpum-8616

Last seen 5.5 years ago

Netherlands

Hi, Recently i've posted a couple of comments on Biostars web site: "Why we need to make a GLM model before performing a Wald test itself (as i can understand it's just a simple t-test in rough approximation?)? Why not just perform a Wald test on count data?"

"In addition, when i perform T-test for example in R i don't need anything except trait observations in two groups. In particular, no coefficiets are required for that. So why in DESeq2 i need to estimate some coefficient before the test itself? Could you elaborate on this please (just for understanding)? Finally, as i understand the GLM output already contains p-values. So why just not use these ones to test for DE genes?"

But got recommendation to ask this directly in Bioconductor support. So, could you clarify it for me please?

Regards, Denis

enter link description here

deseq2 • 7.3k views

ADD COMMENT • link updated 5.5 years ago by wunderl ▴ 40 • written 5.5 years ago by s.apocarpum ▴ 10

score 4 · Answer 1 · 2019-06-14

I am new to DESeq2 myself so take this answer with a grain of salt.

I was initially confused by the difference between a Wald test and a t-test as well. To explain why a Wald test is used we first need to explain why we are fitting a generalized linear model (GLM).

A lot is going on under the hood in DESeq, and I would recommend reading the introduction document here in order to get a better understanding of what is going on. For an explanation of why a linear model is being fit and how that is used, see here. But I can give you a brief summary based on my limited understanding.

The linear model mainly accomplishes three things:

Controll for library/sample size
Obtain a better estimate of the variation in each gene by pooling information across samples
Model experimental effects so you can focus in on the differences you care about.

Very roughly speaking, the model will fit a parameter for each element of your experimental design where each parameter indicates how much that element explains (or affects) changes in gene expression. For example, if you had some samples that fell into two batches and three treatments, the linear model would fit a parameter for each batch (2 parameters) and each treatment (3 parameters) if you used the design ~ batch + treatment. Then if you want to know if gene x is significanly different between treatment 1 and treatment 2 you compare their coefficients (parameters) for gene x using a Wald test . This is where the benefits of point #3 come in, since the coefficients come from a model that includes batch, they can be though of to represent the unique effect of treatment x after accounting for batch effects and the other treatments.

As you may have noticed, this is where the Wald test comes in. Basically if you want to compare coefficients from a regression, you use a Wald test, if you want to compare raw data, then you use a t-test.

More on how a Wald test works here
Why you would choose to use a Wald test instead of a t-test can be found here

So why not just do a t-test on the raw data? You can, but you are not accounting for batch effect, library size, the fact that low count genes have much higher variance than high count genes, ect. all of which can impact the accuracy of your results.

DESeq uses linear modeling (and many other techniques) to try to capture and control for these sources of variation so that your comparisons are as accurate as possible. The majority of these techniques operate over the entire data set, allowing them to harness the power of all of that additional information. Compare that to a t-test which only operates on a very small subset of your data, and you have the basic rationale behind using DESeq2 or similar programs vs just doing a bunch of t-tests.

NOTE: I am by no means an expert in this area, so some of what I have said may be (very) wrong. This just represents my current understanding of things as I am trying to learn what is going on myself.

score 1 · Answer 2 · 2019-06-14

I'm not entirely clear on what you are asking, but I'll take a stab at covering some of the things I think you are inquiring about.

Why can't I just use a "normal" wald or t-test on my data? Why do I have to use some special machinery (like DESeq2)?

There have been a number of methods developed to perform more accurate, powerful, sensitive (pick your favorite adjective) over RNA-seq data. To understand why they were written, you should take some time to read a few of the publications written by the authors of some of the more popular approaches:

DESeq2: Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2
edgeR/QLF: There are many papers to choose from (see the references in the ?glmQLFit help page), but perhaps this one goes into more detail about why they've implemented this approach as opposed to just "doing it the normal way": No counts, no variance: allowing for loss of degrees of freedom when assessing biological variability from RNA-seq data
limma/voom: voom: precision weights unlock linear model analysis tools for RNA-seq read counts

You can also look at the numerous benchmarking papers that test the performance of one method vs the rest to better understand why just doing "the normal" thing won't suffice. Perhaps you might start with this one:

A comparison of methods for differential expression analysis of RNA-seq data

When I perform T-test [in R], I don't need anything except trait observations in two groups ...

Are you asking why you sometimes add more coefficients to your model, or something else? If it's the former: because you want to account for other sources of variability in your data so you have more power to detect DGE over the particular group you are interested in. You can also choose to do the same (ie. not include more covariates/coefficients(?)) in DESeq2 (or similar)-land, if you like, but ...

If that's not what you are asking, perhaps you'll find your answer by reading through some of the publications listed above.