Parametric modeling of whole-genome sequencing data for CNV identification

Saran Vardhanabhuti; X Jessie Jeng; Yinghua Wu; Hongzhe Li

doi:10.1093/biostatistics/kxt060

Parametric modeling of whole-genome sequencing data for CNV identification

Biostatistics. 2014 Jul;15(3):427-41. doi: 10.1093/biostatistics/kxt060. Epub 2014 Jan 28.

Authors

Saran Vardhanabhuti¹, X Jessie Jeng², Yinghua Wu³, Hongzhe Li⁴

Affiliations

¹ Harvard School of Public Health, 651 Huntington Avenue, Boston, MA 02115, USA.
² Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA.
³ Division of Biostatistics, University of Pennsylvania, Philadelphia, PA 19104, USA.
⁴ Division of Biostatistics, University of Pennsylvania, Philadelphia, PA 19104, USA hongzhe@upenn.edu.

Abstract

Copy number variants (CNVs) constitute an important class of genetic variants in human genome and are shown to be associated with complex diseases. Whole-genome sequencing provides an unbiased way of identifying all the CNVs that an individual carries. In this paper, we consider parametric modeling of the read depth (RD) data from whole-genome sequencing with the aim of identifying the CNVs, including both Poisson and negative-binomial modeling of such count data. We propose a unified approach of using a mean-matching variance stabilizing transformation to turn the relatively complicated problem of sparse segment identification for count data into a sparse segment identification problem for a sequence of Gaussian data. We apply the optimal sparse segment identification procedure to the transformed data in order to identify the CNV segments. This provides a computationally efficient approach for RD-based CNV identification. Simulation results show that this approach often results in a small number of false identifications of the CNVs and has similar or better performances in identifying the true CNVs when compared with other RD-based approaches. We demonstrate the methods using the trio data from the 1000 Genomes Project.

Keywords: Natural exponential family; Sparse segment identification; Variance stabilization.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

DNA Copy Number Variations / genetics*
Genome, Human / genetics*
Humans
Models, Statistical*
Sequence Analysis, DNA / methods*

Abstract

Publication types

MeSH terms

Grants and funding