How can I compute AIC for MAST hurdle model fitted in the whole dataset?
0
0
Entering edit mode
mcalgaro93 • 0
@mcalgaro93-17855
Last seen 8 months ago
Italy

Hi everyone, I'm doing some comparisons between models and for each of them I'm going to compute Akaike information Criterion (AIC) and Bayesian Information Criterion (BIC). When it comes to MAST hurdle model (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0844-5 ... under the chapter "Single-cell RNA-seq hurdle model"), I don't understand how to determine the number of parameters for the Continous part and for the Discrete one. I'll make a toy example:

library(truncnorm)
library(MAST)

# Experiment with 100 cells and 1000 genes
# Generate obs from a truncated gaussian
tn <- matrix(rtruncnorm(100000, a=1e-8, b=Inf, mean = 10, sd = 5),ncol = 100)
# Generate bernoulli obs
bn <- matrix(rbinom(100000,1,c(0.1,0.4,0.7,0.9)),ncol = 100)
tn[bn==1] <- 0
# Converting to single-cell object
sca <- FromMatrix(tn)
# Compute cngeneson in order to have a discrete analog of global normalization
ngeneson <- apply(tn,2,function(x) mean(x>0))
CD <- colData(sca)
CD$ngeneson <- ngeneson CD$cngeneson <- CD\$ngeneson-mean(ngeneson)
colData(sca) <- CD
# hurdle model estimation
hm <- zlm(~ 1 + cngeneson ,method = "bayesglm", sca = sca)

# AIC calculation: How can I compute the number of parameters?
total_AIC <- sum(2 * num_params - 2 * hm@loglik)
total_BIC <- sum(log(100) * num_params - 2 * hm@loglik)


From the glm theory I remember that if I compute the difference between:

num_params = -(hm@df.null - hm@df.resid)
#  That shoul correspond to = n - 1 - (n - p - 1) = p


Then I should add 1 for the intercepts in both Continous and Discrete parts and another 1 for the variance of the truncated normal in the Continous part? But I'm not very sure about this because I don't understand the way df.resid are calculated in this model. Can someone enlighten me please?