Question

Logistic regression in MAST - question

0

Entering edit mode

Nik Tuzov ▴ 80

@nik-tuzov-8783

Last seen 3 months ago

United States

Hello: I am trying to reproduce an example given in MAST reference for zlm() where a data frame with 500 points is used. I assume this data set describes one single feature, hence no shrinkage is applied.

I managed to reproduce the continuous part. The regression coefficients for the discrete part are slightly off. MAST output says that the null deviance is on 499 df (i.e. all 500 points are used), but the residual deviance is on 214 df, even though the logistic regression has only 3 parameters. What did I get wrong?

set.seed(314)
data<- data.frame(x=rnorm(500), z=rbinom(500, 1, .3))
logit.y <- with(data, x*2 + z*2); mu.y <- with(data, 10+10*x+10*z + rnorm(500))
y <- (runif(500)<exp(logit.y)/(1+exp(logit.y)))*1
y[y>0] <- mu.y[y>0]
data$y <- y
fit <- zlm(y ~ x+z, data)

summary.glm(fit$disc)
# Coefficients:
#   Estimate Std. Error z value Pr(>|z|)    
# (Intercept) -0.07739    0.13713  -0.564    0.573    
# x            2.30682    0.21370  10.795  < 2e-16 ***
#   z            2.75969    0.35173   7.846 4.29e-15 ***

# Null deviance: 684.41  on 499  degrees of freedom
# Residual deviance: 396.20  on 214  degrees of freedom
# AIC: 402.2

summary.glm(fit$cont)

# Check whether the continous part can be obtained via lm()
cont.data <- data[data$y > 0, ]
lm.fit <- lm(y ~ x + z, data = cont.data )
summary(lm.fit)
# t-values are the same as in fit$cont.

# Check if the discrete part can be obtained via logistic regression
log.resp <- ifelse(data$y > 0, 1, 0)
logistic.fit <- glm(log.resp ~ data$x + data$z, family=binomial(link='logit'))
summary(logistic.fit)
# Coefficients:
#   Estimate Std. Error z value Pr(>|z|)    
# (Intercept) -0.08737    0.13887  -0.629    0.529    
# data$x       2.38849    0.22437  10.645  < 2e-16 ***
# data$z       2.86694    0.36538   7.847 4.28e-15 ***

# Null deviance: 684.41  on 499  degrees of freedom
# Residual deviance: 396.04  on 497  degrees of freedom
# AIC: 402.04

# Note the difference in residual deviance df.

mast • 1.2k views

ADD COMMENT • link updated 5.1 years ago by Andrew_McDavid ▴ 270 • written 5.1 years ago by Nik Tuzov ▴ 80

score 1 · Answer 1 · 2019-03-11

1

Entering edit mode

Andrew_McDavid ▴ 270

@andrew_mcdavid-11488

Last seen 13 months ago

United States

Actually you still need to specify zlm(..., method = 'glm') to turn off shrinkage in the discrete part. Once you do so, you'll see that the coefficients match exactly. But the residual DOF is still off.

For reasons that are obscure to me at this point (to facilitate "per-term" likelihood ratio tests? ) under the hood, we use glm.fit, add some slots to it, then set the class to be "glm". For even more obscure reasons (I think having to do with how arm::bayesglm were calculated, which was unreasonable), we also over-ride the null and residual degrees of freedom. Honestly I'm not sure why the residual DOF are calculated the way they are for the logistic. Feel free to open a PR with a fix.

ADD COMMENT • link 5.1 years ago Andrew_McDavid ▴ 270

0

Entering edit mode

Hello Andrew:

Thanks for replying. If I don't want to create SingleCellAssay object, how can I put a number of transcripts in a data frame to do a similar analysis. Do I need to somehow add TranscriptID column to that simulated data frame?

ADD REPLY • link 5.1 years ago Nik Tuzov ▴ 80

1

Entering edit mode

It's unclear to me what you are trying to accomplish. If you want to test a sequence of responses (even if they aren't gene expression) with the Hurdle model, the easiest way I can see would be to make a SingleCellAssay object. If for some reason you don't want to do that, you could write a for-loop or use an *apply function.

ADD REPLY • link 5.1 years ago Andrew_McDavid ▴ 270