Our hypothesis is that machine learning (ML) analysis of whole exome sequencing (WES) data can be used to identify individuals at high risk for schizophrenia (SCZ). This study applies ML to WES data from 2,545 individuals with SCZ and 2,545 unaffected individuals, accessed via the database of genotypes and phenotypes (dbGaP). Single nucleotide variants and small insertions and deletions were annotated by ANNOVAR using the reference genome hg19/GRCh37. Rare (predicted functional) variants with a minor allele frequency ≤1% and genotype quality ≥90 including missense, frameshift, stop gain, stop loss, intronic, and exonic splicing variants were selected. A file containing all cases and controls, the names of genes with variants meeting our criteria, and the number of variants per gene for each individual, was used for ML analysis. The supervised machine-learning algorithm used the patterns of variants observed in the different genes to determine which subset of genes can best predict that an individual is affected. Seventy percent of the data was used to train the algorithm and the remaining 30% of data (n = 1,526) was used to evaluate its efficiency. The supervised ML algorithm, gradient boosted trees with regularization (eXtreme Gradient Boosting implementation) was the best performing algorithm yielding promising results (accuracy: 85.7%, specificity: 86.6%, sensitivity: 84.9%, area under the receiver-operator characteristic curve: 0.95). The top 50 features (genes) of the algorithm were analyzed using bioinformatics resources for new insights about the pathophysiology of SCZ. This manuscript presents a novel predictor which could potentially enable studies exploring disease-modifying intervention in the early stages of the disease.
Keywords: artificial intelligence; diagnostic; genomic; prediction; psychosis.
© 2018 Wiley Periodicals, Inc.