An expanded sequence context model broadly explains variability in polymorphism levels across the human genome

Nat Genet. 2016 Apr;48(4):349-55. doi: 10.1038/ng.3511. Epub 2016 Feb 15.


The rate of single-nucleotide polymorphism varies substantially across the human genome and fundamentally influences evolution and incidence of genetic disease. Previous studies have only considered the immediately flanking nucleotides around a polymorphic site--the site's trinucleotide sequence context--to study polymorphism levels across the genome. Moreover, the impact of larger sequence contexts has not been fully clarified, even though context substantially influences rates of polymorphism. Using a new statistical framework and data from the 1000 Genomes Project, we demonstrate that a heptanucleotide context explains >81% of variability in substitution probabilities, highlighting new mutation-promoting motifs at ApT dinucleotide, CAAT and TACG sequences. Our approach also identifies previously undocumented variability in C-to-T substitutions at CpG sites, which is not immediately explained by differential methylation intensity. Using our model, we present informative substitution intolerance scores for genes and a new intolerance score for amino acids, and we demonstrate clinical use of the model in neuropsychiatric diseases.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Autistic Disorder / genetics
  • Base Sequence
  • Bayes Theorem
  • CpG Islands
  • DNA Methylation
  • DNA, Intergenic / genetics
  • Genome, Human
  • Humans
  • Models, Genetic*
  • Mutation
  • Polymorphism, Single Nucleotide*
  • Regression Analysis
  • Sequence Analysis, DNA


  • DNA, Intergenic