Question

What is the effect of setting modelMatrixType="standard"

0

Entering edit mode

Gregory Warnes ▴ 50

@gregory-warnes-2155

Last seen 8.3 years ago

United States

I'm comparing results generated by two tools that internally use DESeq2 but yield quite different results. The only relevent difference is that one tool calls DESeq with modelMatrixType="standard" and the other with modelMatrixType= "expanded".

Both are fitting the same model ( ~ GROUP), where group is a factor with four levels, and specifying a contrast in results using the names of the variable and factor levels (contrast = c("GROUP", "A", "C" ).

As far as I can tell, both code blocks should be test the same model, but I get substantially different results (log2FoldChange values, etc.).

What is the difference between the models fit when modelMatrixType="standard" and modelMatrixType= "expanded"?

deseq2 modelMatrixType DESeq • 2.2k views

ADD COMMENT • link updated 8.9 years ago by Michael Love 42k • written 8.9 years ago by Gregory Warnes ▴ 50

0

Entering edit mode

Below is example code showing the difference in results between "expanded" and "standard" model matrix, using dummy data that mirrors the structure of my data. The dummy data shows a regression slope of 0.81 and correlation of 0.98 between the two sets of results, while the actual data I am using have a regression slope of 0.11 and a correlation of 0.56.

# Create dummy data

set.seed(12345)
counts.table <- 0.45 + 2^matrix( rnorm(1e3 * 20, sd=5), ncol=20 )
counts.table[] <- round(counts.table)
mean(counts.table==0)

df <- data.frame(ID=as.numeric(300+1:20),
                 GROUP=c(rep("TMT_6H",  5),
                         rep("TMT_24H", 5),
                         rep("VEH_6H",  5),
                         rep("VEH_24H", 5)
                        )
                )

library(DESeq2)

dds <- DESeqDataSetFromMatrix(countData = counts.table,
                              colData = df,
                              design = ~ GROUP
                              )

## DESeq2 with "expanded" matrix
deseq.e <- DESeq(dds,
                 modelMatrixType="expanded"
                 )

res.e <- results(deseq.e,
                 contrast = c(0,1,0,-1,0),
                 independentFiltering=FALSE
                 )

## DESeq2 with "standard" matrix (matches tool's output)
deseq.s <- DESeq(dds,
                 modelMatrixType="standard")

res.s  <- results(deseq.s,
                  contrast = c(0,0,-1,0),
                  independentFiltering=FALSE)
                  )

# comparison of results

fit <- lm(res.s$log2FoldChange ~ res.e$log2FoldChange)
summary(fit)

plot(res.s$log2FoldChange ~ res.e$log2FoldChange,
     xlim=c(-15,15),
     ylim=c(-15,15),
     col="blue"
     )
grid()
abline(a=0, b=1, col="black", lty=2)
abline(fit, lty=3, col="red")

mean(sign(res.e$log2FoldChange) != sign(res.s$log2FoldChange) )

ADD REPLY • link 8.9 years ago Gregory Warnes ▴ 50

0

Entering edit mode

R version: 3.2.0

DESeq2 version: 1.8.1

ADD REPLY • link 8.9 years ago Gregory Warnes ▴ 50

0

Entering edit mode

Setting alpha=1 does not make sense (and I've just now added an error for future versions for giving an alpha=1). You want to provide a number which will be the target FDR for the filtering. alpha=0.1 is default, which corresponds to a target FDR of 10%. If you do not want to perform filtering you should set independentFiltering=FALSE.

"regression slope of 0.81 and correlation of 0.98"

Right so, while not identical, the two methods are obviously giving similar results. The methods are not identical, obviously, and the expanded design is preferable for shrinkage of LFC, as the results will be the same regardless of the choice of reference level for factors.

If I put in alpha=0.1, and build a contingency table you see the two methods largely agree on a 10% FDR set:

        FALSE TRUE
  FALSE   321    4
  TRUE     13  124

Anyway, for smaller correlations or regression slopes that you find between these two methods, it could be that the LFC do not show strong evidence of being larger or smaller than 0, and therefore you're just picking up differences in the methods because there is no real signal there.

ADD REPLY • link 8.9 years ago Michael Love 42k

0

Entering edit mode

Updated code to use independent Filtering=FALSE instead of alpha=1

ADD REPLY • link 8.9 years ago Gregory Warnes ▴ 50

0

Entering edit mode

NB: The actual code for fitting the data used independentFiltering=FALSE and alpha=1.

ADD REPLY • link 8.9 years ago Gregory Warnes ▴ 50

score 0 · Answer 1 · 2015-12-02

0

Entering edit mode

Gregory Warnes ▴ 50

@gregory-warnes-2155

Last seen 8.3 years ago

United States

OK, I found the answer for myself, in Love, Huber, and Anders (2015):

Expanded design matrices

For consistency with our software’s documentation, in the following text we will use the terminology of the R statistical language. In linear modeling, a categorical variable or factor can take on two or more values or levels. In standard design matrices, one of the values is chosen as a reference value or base level and absorbed into the intercept. In standard GLMs, the choice of base level does not influence the values of contrasts (LFCs). This, however, is no longer the case in our approach using ridge-regression-like shrinkage on the coefficients (described below), when factors with more than two levels are present in the design matrix, because the base level will not undergo shrinkage while the other levels do.

To recover the desirable symmetry between all levels, DESeq2 uses expanded design matrices, which include an indicator variable for each level of each factor, in addition to an intercept column (i.e., none of the levels is absorbed into the intercept). While such a design matrix no longer has full rank, a unique solution exists because the zero-centered prior distribution (see below) provides regularization. For dispersion estimation and for estimating the width of the LFC prior, standard design matrices are used.

So, in essence, using modelMatrixType="standard" in this context generates incorrect results by computing a contrast between a non-shrunken term for the first factor level and a shrunken term for the third factor level. Perhaps the current warning generated by the code could be improved to indicate this more clearly.

ADD COMMENT • link 8.9 years ago Gregory Warnes ▴ 50

0

Entering edit mode

re: "using modelMatrixType="standard" in this context generates incorrect results by computing a contrast between a non-shrunken term for the first factor level and a shrunken term for the third factor level."

No, this is not correct. Either model matrix type will give nearly the same results. The shrinkage is applied to the comparisons in either case. The reason for the expanded option was to ensure that the results were identical after changing the reference level. My guess is that there is something else different between the two wrapper software you are using (perhaps different versions of DESeq2).

ADD REPLY • link 8.9 years ago Michael Love 42k

0

Entering edit mode

I've looked at the vendor documentation and have written code that matches the tool output, showing that the only difference (that applies in my example) is modelMatrixType="standard".

I've also compared running DESeq2 with both modelMatrixTypes and the difference between the results are quite dramatic in my data.

If I understand the description above, the key issue here is that I am using a factor with 4 levels. When modelMatrixType="standard", the term for the first factor level is absorbed into the overall intercept, which is not shrunk, while the terms for the other factors levels are shrunk. When modelMatrixType="expanded", all of the terms for the 4-level factor are shrunk. When the contrast is computed, for "standard" it is simply the estimate of for the 3rd (shrunken) factor level, while for "expanded", it is the difference between the shrunken factor for the first level and the shrunken factor for the third level, yielding different results.

In the dummy-data example I provided, even the sign of the contrast is different for about 5% of the 'genes'. The results with my actual data are quite a bit more extreme, with the sign differing for about 12%.

ADD REPLY • link 8.9 years ago Gregory Warnes ▴ 50

0

Entering edit mode

"When modelMatrixType="standard", the term for the first factor level is absorbed into the overall intercept, which is not shrunk"

No, think about it like this: with standard model matrix, there is one intercept and 3 interesting betas: B vs A, C vs A, D vs A, and each of these 3 are shrunk toward 0. So you do end up shrinking all these comparison effects. With expanded model matrix, we have 4 betas, which represent the difference from a middle value between the four groups to the four groups. This also produces shrinkage of all comparisons between the four groups. Either way the comparisons between the four groups undergo shrinkage, but the first method has an asymmetry to the shrinkage, where the distances from B,C,D to A are used to set the prior, and these distances are shrunk. So the expanded method is preferable. I wouldn't get too hung up on the differences though. Different approaches produce different results, and when the data does not contain true signal, the differences between methods will be even larger.

ADD REPLY • link 8.9 years ago Michael Love 42k

score 0 · Answer 2 · 2015-12-02

0

Entering edit mode

Michael Love 42k

@mikelove

Last seen 1 day ago

United States

You will need to provide more information, including the exact code and versions used and more quantitative descriptions or examples of differences. The difference between results with standard and expanded model matrix is typically minimal. There could be other differences going on.

ADD COMMENT • link 8.9 years ago Michael Love 42k