Machine learning in schizophrenia genomics, a case-control study using 5,090 exomes

Am J Med Genet B Neuropsychiatr Genet. 2019 Mar;180(2):103-112. doi: 10.1002/ajmg.b.32638. Epub 2018 Apr 28.


Our hypothesis is that machine learning (ML) analysis of whole exome sequencing (WES) data can be used to identify individuals at high risk for schizophrenia (SCZ). This study applies ML to WES data from 2,545 individuals with SCZ and 2,545 unaffected individuals, accessed via the database of genotypes and phenotypes (dbGaP). Single nucleotide variants and small insertions and deletions were annotated by ANNOVAR using the reference genome hg19/GRCh37. Rare (predicted functional) variants with a minor allele frequency ≤1% and genotype quality ≥90 including missense, frameshift, stop gain, stop loss, intronic, and exonic splicing variants were selected. A file containing all cases and controls, the names of genes with variants meeting our criteria, and the number of variants per gene for each individual, was used for ML analysis. The supervised machine-learning algorithm used the patterns of variants observed in the different genes to determine which subset of genes can best predict that an individual is affected. Seventy percent of the data was used to train the algorithm and the remaining 30% of data (n = 1,526) was used to evaluate its efficiency. The supervised ML algorithm, gradient boosted trees with regularization (eXtreme Gradient Boosting implementation) was the best performing algorithm yielding promising results (accuracy: 85.7%, specificity: 86.6%, sensitivity: 84.9%, area under the receiver-operator characteristic curve: 0.95). The top 50 features (genes) of the algorithm were analyzed using bioinformatics resources for new insights about the pathophysiology of SCZ. This manuscript presents a novel predictor which could potentially enable studies exploring disease-modifying intervention in the early stages of the disease.

Keywords: artificial intelligence; diagnostic; genomic; prediction; psychosis.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Alleles
  • Case-Control Studies
  • Computational Biology / methods*
  • Exome / genetics
  • Gene Frequency / genetics
  • Genomics
  • Genotype
  • Humans
  • INDEL Mutation / genetics
  • Machine Learning
  • Polymorphism, Single Nucleotide / genetics
  • ROC Curve
  • Schizophrenia / etiology
  • Schizophrenia / genetics*
  • Schizophrenic Psychology
  • Sensitivity and Specificity
  • Sequence Analysis, DNA / methods*
  • Whole Genome Sequencing / methods