Analyzing Clustered Count Data With a Cluster Specific Random Effect Zero-Inflated Conway-Maxwell-Poisson Distribution

J Appl Stat. 2018;45(5):799-814. doi: 10.1080/02664763.2017.1312299. Epub 2017 Apr 8.

Abstract

In recent years, data analysis techniques have been developed in biological and medical research areas with different types of count distributions. In particular, zero-inflated versions of parametric count distributions have been used to model excessive zeros that are often present in these assays. Perhaps, the most common count distribution which has been used for analyzing such data is the Poisson distribution. However, a Poisson distribution, having a single underlying parameter, cannot cope with any other data dispersion pattern besides equidispersion. A negative binomial distribution is capable of modeling overdispersed, but not underdispersed data. However, a Conway-Maxwell-Poisson (CMP) distribution (Conway, R. W., and Maxwell, W. L., 1962) can handle not only overdispersion but also underdispersion. We show with an illustrative data set on next generation sequencing of maize hybrids that both underdispersion and overdispersion can be present in genetic data. Furthermore, if count data consists of clustered observations, one of the most efficient statistical technique is to introduce a cluster specific random effect term. Once again, the maize hybrids data presents such a situation. We develop inference procedures for a zero-inflated CMP regression that incorporates a cluster specific random effect term. Unlike, the Gaussian models, the underlying likelihood is computationally challenging. We use a numerical approximation via a Gaussian quadrature to circumvent this issues. A test for checking zero-inflation has also been developed in our setting. Finite sample properties of our estimators and test have been investigated by extensive simulations. Finally, the statistical methodology has been applied to analyze the maize data mentioned before.

Keywords: Gaussian-Hermite (G-H) quadrature; Mixed effects model; Next- generation sequencing (NGS) data; Poisson distribution; Under- and over-dispersions.