Background: Both somatic copy number alterations (CNAs) and germline copy number variants (CNVs) that are prevalent in healthy individuals can appear as recurrent changes in comparative genomic hybridization (CGH) analyses of tumors. In order to identify important cancer genes CNAs and CNVs must be distinguished. Although the Database of Genomic Variants (DGV) contains a list of all known CNVs, there is no standard methodology to use the database effectively.
Results: We develop a prediction model that distinguishes CNVs from CNAs based on the information contained in the DGV and several other variables, including segment's length, height, closeness to a telomere or centromere and occurrence in other patients. The models are fitted on data from glioblastoma and their corresponding normal samples that were collected as part of The Cancer Genome Atlas project and hybridized to Agilent 244 K arrays.
Conclusions: Using the DGV alone CNVs in the test set can be correctly identified with about 85% accuracy if the outliers are removed before segmentation and with 72% accuracy if the outliers are included, and additional variables improve the prediction by about 2-3% and 12%, respectively. Final models applied to data from ovarian tumors have about 90% accuracy with all the variables and 86% accuracy with the DGV alone.