With an increasing number of causal genes discovered for complex human disorders, it is crucial to assess the genetic risk of disease onset for individuals who are carriers of these causal mutations and compare the distribution of age-at-onset with that in non-carriers. In many genetic epidemiological studies aiming at estimating causal gene effect on disease, the age-at-onset of disease is subject to censoring. In addition, some individuals' mutation carrier or non-carrier status can be unknown due to the high cost of in-person ascertainment to collect DNA samples or death in older individuals. Instead, the probability of these individuals' mutation status can be obtained from various sources. When mutation status is missing, the available data take the form of censored mixture data. Recently, various methods have been proposed for risk estimation from such data, but none is efficient for estimating a nonparametric distribution. We propose a fully efficient sieve maximum likelihood estimation method, in which we estimate the logarithm of the hazard ratio between genetic mutation groups using B-splines, while applying nonparametric maximum likelihood estimation for the reference baseline hazard function. Our estimator can be calculated via an expectation-maximization algorithm which is much faster than existing methods. We show that our estimator is consistent and semiparametrically efficient and establish its asymptotic distribution. Simulation studies demonstrate superior performance of the proposed method, which is applied to the estimation of the distribution of the age-at-onset of Parkinson's disease for carriers of mutations in the leucine-rich repeat kinase 2 gene.
Keywords: Empirical process; Mixture distribution; Parkinson's disease; Semiparametric efficiency; Sieve maximum likelihood estimation.