Hi everyone, I'm doing some comparisons between models and for each of them I'm going to compute Akaike information Criterion (AIC) and Bayesian Information Criterion (BIC). When it comes to MAST hurdle model (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0844-5 ... under the chapter "Single-cell RNA-seq hurdle model"), I don't understand how to determine the number of parameters for the Continous part and for the Discrete one. I'll make a toy example:
library(truncnorm)
library(MAST)
# Experiment with 100 cells and 1000 genes
# Generate obs from a truncated gaussian
tn <- matrix(rtruncnorm(100000, a=1e-8, b=Inf, mean = 10, sd = 5),ncol = 100)
# Generate bernoulli obs
bn <- matrix(rbinom(100000,1,c(0.1,0.4,0.7,0.9)),ncol = 100)
# Adding zeroes
tn[bn==1] <- 0
# Converting to single-cell object
sca <- FromMatrix(tn)
# Compute cngeneson in order to have a discrete analog of global normalization
ngeneson <- apply(tn,2,function(x) mean(x>0))
CD <- colData(sca)
CD$ngeneson <- ngeneson
CD$cngeneson <- CD$ngeneson-mean(ngeneson)
colData(sca) <- CD
# hurdle model estimation
hm <- zlm(~ 1 + cngeneson ,method = "bayesglm", sca = sca)
# AIC calculation: How can I compute the number of parameters?
total_AIC <- sum(2 * num_params - 2 * hm@loglik)
total_BIC <- sum(log(100) * num_params - 2 * hm@loglik)
From the glm theory I remember that if I compute the difference between:
num_params = -(hm@df.null - hm@df.resid)
# That shoul correspond to = n - 1 - (n - p - 1) = p
Then I should add 1 for the intercepts in both Continous and Discrete parts and another 1 for the variance of the truncated normal in the Continous part? But I'm not very sure about this because I don't understand the way df.resid are calculated in this model. Can someone enlighten me please?
Hi OP, checking in to see if you found a solution to this issue. I'm also interested in using AICs to evaluate model fits in MAST.