Question

Modeling count data that is zero inflated with negative binomial hurdle regression

0

Entering edit mode

mrk.carty ▴ 30

@mrkcarty-7442

Last seen 9.7 years ago

United States

Hi all,

I am modeling count data that is zero-inflated. To model the data, I am using hurdle() function from the R pscl package. I am also refitting the model after the removal of outlier points. I have written a function to remove outliers from the data set. However, I think there is a bug in the code because the nothing as been removed. I noticed that gc and map covariates are not significant for non-zero part of the model, when I omit them I get better results. When I set map as an offset, the gc is very significant for the zero part of the model. I am wondering if this happening because these two covariates explain so little of variance.

Here is the code:

library(pscl)
library(VGAM)
library(data.table)
library(splines)

# Define function

remove.outliers <- function(dat,mod,f=qzanegbin){
  q1 <- f(0.025,size = mod$theta, munb=predict(mod,newdata = dat,
        dispersion = 1/mod$theta,type='response'),pobs=predprob(mod,newdata=dat)[,1])
  q2 <- f(0.975,size = mod$theta, munb=predict(mod,newdata = dat,
        dispersion = 1/mod$theta,type='response'),pobs=predprob(mod,newdata=dat)[,1])
  q <- sapply(seq(0,0.99,by=0.05),function(k) f(k,size = mod$theta,
        munb=predict(mod,newdata = dat,dispersion = 1/mod$theta,type='response'),
        pobs=predprob(mod,newdata=dat)[,1]))
  H <- apply(q,1,function(x){1.5*IQR(x)})
  counts <- dat$counts
  counts[counts < q1 - H] <- NA
  counts[counts > q2 + H] <- NA
  new.dat <- dat[which(!is.na(counts)),]
  return(new.dat)
}

# Main body of algorithm
x <- readRDS('WorkinProgress/Data.rds')
nrows <- nrow(x)
Intervals <- seq(0,2e6,by=500)
cls <- findInterval(x$D,Intervals,rightmost.closed = T)
# Binning the distances
bins <- split(seq_len(nrows),cls)
# Sampling 10% of data from each bin
idx <- unlist(lapply(bins,function(x){sample(x,floor(length(x)*0.01),replace = F)}))
dat <- x[idx,]
# Knots
knots <- quantile(dat$D,probs = c(0.25,0.5,0.75))
bdpts <- quantile(dat$D,probs=c(0,1))
df <- 6 # degrees of freedom

mod <- hurdle(counts ~ gc + map + bs(D,df=df,knots=c(knots),
      Boundary.knots=c(bdpts))| gc + map + bs(D,df=df,knots=c(knots),
      Boundary.knots=c(bdpts)),data=dat,dist='negbin')

# Remove outliers
new.dat <- remove.outliers(dat,mod)
# Refit the model
mod <- hurdle(counts ~ gc + map + bs(D,df=df,knots=c(knots),
        Boundary.knots=c(bdpts))| gc + map + bs(D,df=df,knots=c(knots),
        Boundary.knots=c(bdpts)),data=new.dat,dist='negbin')


> x
          counts     D         gc        map
       1:      2   239 -1.6960761 -0.6069708
       2:      2  4011 -1.1518852 -0.6399841
       3:      3  8875 -0.7788370 -0.5767152
       4:      0 14045 -1.1544460 -0.4697568
       5:      0 17502 -1.4958545 -0.6374427
      ---                                   
16053150:      7 16926  0.4619302 -0.6723117
16053151:     10  4429  0.6345688 -0.6723117
16053152:      4  7942  0.9654208 -0.6723117
16053153:      4  3309  1.4065476 -0.6723117
16053154:     20   307  1.2228625 -0.6723117
> length(which(x$counts==0))
[1] 14271361

cor(x,method = 'spearman')
             counts             D           gc           map
counts  1.000000000 -0.3445307605  0.003804754 -0.0034717774
D      -0.344530761  1.0000000000 -0.006490156 -0.0003908146
gc      0.003804754 -0.0064901556  1.000000000 -0.1263421203
map    -0.003471777 -0.0003908146 -0.126342120  1.0000000000

summary(mod)

Call:
hurdle(formula = counts ~ gc + map + bs(D, df = df, knots = c(knots), Boundary.knots = c(bdpts)) |
    gc + map + bs(D, df = df, knots = c(knots), Boundary.knots = c(bdpts)), data = new.dat,
    dist = "negbin")

Pearson residuals:
    Min      1Q  Median      3Q     Max
-0.9878 -0.2939 -0.1931 -0.1546 25.157

 sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] splines   stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.4 VGAM_0.9-7       pscl_1.4.8       lattice_0.20-30  MASS_7.3-39     

loaded via a namespace (and not attached):
[1] chron_2.3-45   grid_3.1.2     plyr_1.8.1     Rcpp_0.11.5    reshape2_1.4.1 stringr_0.6.2
[7] tools_3.1.2

pscl outliers • 2.6k views

ADD COMMENT • link updated 10.3 years ago by Sean Davis 21k • written 10.3 years ago by mrk.carty ▴ 30