The directed acyclic graph (DAG) is a powerful tool to model the interactions of high-dimensional variables. While estimating edge directions in a DAG often requires interventional data, one can estimate the skeleton of a DAG (i.e., an undirected graph formed by removing the direction of each edge in a DAG) using observational data. In real data analyses, the samples of the high-dimensional variables may be collected from a mixture of multiple populations. Each population has its own DAG while the DAGs across populations may have significant overlap. In this article, we propose a two-step approach to jointly estimate the DAG skeletons of multiple populations while the population origin of each sample may or may not be labeled. In particular, our method allows a probabilistic soft label for each sample, which can be easily computed and often leads to more accurate skeleton estimation than hard labels. Compared with separate estimation of skeletons for each population, our method is more accurate and robust to labeling errors. We study the estimation consistency for our method, and demonstrate its performance using simulation studies in different settings. Finally, we apply our method to analyze gene expression data from breast cancer patients of multiple cancer subtypes.
Keywords: Bi-level selection; DAG; PC algorithm; Probabilistic labels; Soft classifier.
© 2018, The International Biometric Society.