CNV-P: a machine-learning framework for predicting high confident copy number variations

PeerJ. 2021 Dec 2:9:e12564. doi: 10.7717/peerj.12564. eCollection 2021.

Abstract

Background: Copy-number variants (CNVs) have been recognized as one of the major causes of genetic disorders. Reliable detection of CNVs from genome sequencing data has been a strong demand for disease research. However, current software for detecting CNVs has high false-positive rates, which needs further improvement.

Methods: Here, we proposed a novel and post-processing approach for CNVs prediction (CNV-P), a machine-learning framework that could efficiently remove false-positive fragments from results of CNVs detecting tools. A series of CNVs signals such as read depth (RD), split reads (SR) and read pair (RP) around the putative CNV fragments were defined as features to train a classifier.

Results: The prediction results on several real biological datasets showed that our models could accurately classify the CNVs at over 90% precision rate and 85% recall rate, which greatly improves the performance of state-of-the-art algorithms. Furthermore, our results indicate that CNV-P is robust to different sizes of CNVs and the platforms of sequencing.

Conclusions: Our framework for classifying high-confident CNVs could improve both basic research and clinical diagnosis of genetic diseases.

Keywords: Copy number variant; Genome sequencing; Machine learning.

Grants and funding

This project was supported by the National Key Research and Development Program of China (No.2018YFC1004900), the National Natural Science Foundation of China (No.81300075), the Science, Technology and Innovation Commission of Shenzhen Municipality under grant (No.JCYJ20170412152854656, JCYJ20180703093402288). There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.