Population identification using genetic data

Annu Rev Genomics Hum Genet. 2012;13:337-61. doi: 10.1146/annurev-genom-082410-101510. Epub 2012 Jun 11.


A large number of algorithms have been developed to classify individuals into discrete populations using genetic data. Recent results show that the information used by both model-based clustering methods and principal components analysis can be summarized by a matrix of pairwise similarity measures between individuals. Similarity matrices have been constructed in a number of ways, usually treating markers as independent but differing in the weighting given to polymorphisms of different frequencies. Additionally, methods are now being developed that take linkage into account. We review several such matrices and evaluate their information content. A two-stage approach for population identification is to first construct a similarity matrix and then perform clustering. We review a range of common clustering algorithms and evaluate their performance through a simulation study. The clustering step can be performed either on the matrix or by first using a dimension-reduction technique; we find that the latter approach substantially improves the performance of most algorithms. Based on these results, we describe the population structure signal contained in each similarity matrix and find that accounting for linkage leads to significant improvements for sequence data. We also perform a comparison on real data, where we find that population genetics models outperform generic clustering approaches, particularly with regard to robustness for features such as relatedness between individuals.

Publication types

  • Review

MeSH terms

  • Algorithms*
  • Cluster Analysis
  • Computer Simulation
  • Genetic Linkage
  • Genetics, Population
  • Genome, Human
  • Humans
  • Models, Genetic*
  • Polymorphism, Genetic
  • Principal Component Analysis