Estimation of non-trivial contrasts in MAST
Entering edit mode
Nik Tuzov ▴ 70
Last seen 5 weeks ago
United States


Could you tell me if there is an easy way to estimate non-trivial contrasts in MAST? E.g. there are two factors, A and B, with 3 and 5 levels, the model is Y = A + B + A*B, and I want to test (A1, B2) vs (A2, B5). I assume the answer is no because a similar question was asked of DESeq2 a while ago and MAST doesn't look different in that sense.

mast • 517 views
Entering edit mode
Last seen 4 months ago

Yes, this is possible, and AFAIK is possible in DESeq2, too. It's easy to get tripped up by R builds design matrices from the symbolic formula that contain interactions. When in doubt, expand the interactions explicitly with e.g. interaction.

ng = 10
nc = 20
cd = expand.grid(A = factor(1:3), B = factor(1:5), rep = 1:nc)
cd$AB = with(cd, interaction(A, B))
mm = model.matrix(~AB, data = cd)
eta = mm %*% c(3, -2, 0, # (A=2, B=1) - (A=1, B=1), not tested
               2, #(A=1, B=2) - (A=1, B=1), tested
               rep(0, 11)) # rest null 
U = matrix(eta, nrow = length(eta), ncol = ng)
U = U + rnorm(length(U))
PV = as.vector( exp(eta) / (1 + exp(eta)))
V =  1*(runif(length(U)) < PV)
dim(V) = dim(U)
Y = U*V
sca = FromMatrix(t(Y),cData = cd)
zz = zlm(~AB, sca = sca)

waldTest(zz, Hypothesis("AB2.5 - AB1.2"))
# Similar to (up to sampling error)
waldTest(zz, Hypothesis("-AB1.2"))
# The AB2.5 coefficient should be approx. zero
waldTest(zz, Hypothesis("AB2.5"))

Really, this is more of a base-R question. Forsake not your prophets Venables and Ripley. Chapter 6 shall provide the Light. the Truth, and the Way.

Entering edit mode

Hello Andrew:

Thanks for replying. I also have a MAST specific question. The paper describes how shrinkage is done but it says little about why it was necessary in the first place. What kind of evidence did you get to justify the application of shrinkage?

Regards, Nik

Entering edit mode

There are two forms of shrinkage happening, and they were developed after a fair amount of iterative exploration of data at the time, though little of this made it into the paper. One for the logistic regression coefficients, and one for the linear regression $\sigma^2$. In practice, the first one is probably more important, because it results in reasonable point estimates (and Wald confidence intervals) when there is linear separation, which is the rule rather than the exception.

The shrinkage on $\sigma^2$ is often less important given the larger number of replicates in scRNAseq, but can still have an impact since the number of cells with expression $>0$ varies from gene to gene. If filtering for the amount of expression hasn't occurred, then sometimes you are trying to make inference with only a handful of cells.

If you are curious what the $\sigma^2$ shrinkage is doing to your data you can get the "effective number of cells" and "prior precision" from the model object,


and turn off shrinkage with zlm(..., method = 'glm', ebayes = FALSE). In the simulation I ran you can see that the effective degrees of freedom is extremely high, reflecting the small number of genes and exact homogeneity in the variance. The variance is overestimated because by default, it is fit just using the intercept (the H0 model). See ?ebayes if you are curious for more details.

Entering edit mode

Hello Andrew:

You are probably familiar with this paper of Soneson and Robinson. Their overall conclusion is that the best single-cell-specific method was MAST, but it did as well as the best of bulk RNA methods. Do you think they got something wrong?

Entering edit mode

Hello Andrew:

What happens in contrast estimation if there is a quantitative covariate? E.g. I have the following toy data set:

y <- c(0.0, 0.0, 0.01981825833, 0.40109357786, 0.34727231426, 0.03889038132, 0.0, 0.0,
      0.11225699507, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.90902480961)
A <- factor(c(rep("A1", 8), rep("A2", 8)))
Cov <- c(1.1, 5.5, -3.2, 4, 7.7, 2.2, 0.5, -4.4, 6.6, 3.3, 8.8, 77.1, 55, 44, 33, 22)
data <- data.frame(Cov, A)

After I run zlm() w/o shrinkage, the hurdle logFC is:

logFC      AA2 -0.1119203593

Given the formula from the reference:

u(contrast1)v(contrast1) − u(contrast0)v(contrast0)

the question is how to set the value of Cov. In general, it can/should be set at its mean, but the means in the continuous and discrete parts are different because the continuous part uses only 6 observations out of 16. I'd say one should use two distinct means here, but I didn't manage to reproduce logFC. I also tried using just one mean obtained from all the observations and it didn't work either. When I use just A in the model, I reproduce logFC, so apparently the discrepancy is due to Cov. Please help me clarify this.

Entering edit mode

What you wrote seems accurate. If not otherwise specified, eg with logFC(contrast0 = ...), the coefficient of your continuous covariate Cov will be set to zero, which in this case would mean that Cov is equal to zero. This may not be desirable, though it depends on what the covariate is.

If you center a continuous covariate, so it has mean zero over all observations, then setting the coefficient to zero would be equivalent to estimating the logFC with the covariate set to its mean over all observations (both zero and non-zero). There's no provision to center it with respect to both all and non-zero observations, since this would mean a different contrast0 and contrast1 for each gene.

Entering edit mode

There is a definite difference between MAST in DESeq2 when designs with interactions are considered. In MAST, all of the regression coefficients are shrunken in the logistic part no matter what, whereas in DESeq2 the shrinkage of regression coefficients is disabled when interactions are present:


Login before adding your answer.

Traffic: 355 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6