Population identification using genetic data

Daniel John Lawson; Daniel Falush

doi:10.1146/annurev-genom-082410-101510

Population identification using genetic data

Annu Rev Genomics Hum Genet. 2012:13:337-61. doi: 10.1146/annurev-genom-082410-101510. Epub 2012 Jun 11.

Authors

Daniel John Lawson¹, Daniel Falush

Affiliation

¹ Heilbronn Institute for Mathematical Research, School of Mathematics, University of Bristol, Bristol BS8 1TW, UK. dan.lawson@bristol.ac.uk

PMID: 22703172
DOI: 10.1146/annurev-genom-082410-101510

Abstract

A large number of algorithms have been developed to classify individuals into discrete populations using genetic data. Recent results show that the information used by both model-based clustering methods and principal components analysis can be summarized by a matrix of pairwise similarity measures between individuals. Similarity matrices have been constructed in a number of ways, usually treating markers as independent but differing in the weighting given to polymorphisms of different frequencies. Additionally, methods are now being developed that take linkage into account. We review several such matrices and evaluate their information content. A two-stage approach for population identification is to first construct a similarity matrix and then perform clustering. We review a range of common clustering algorithms and evaluate their performance through a simulation study. The clustering step can be performed either on the matrix or by first using a dimension-reduction technique; we find that the latter approach substantially improves the performance of most algorithms. Based on these results, we describe the population structure signal contained in each similarity matrix and find that accounting for linkage leads to significant improvements for sequence data. We also perform a comparison on real data, where we find that population genetics models outperform generic clustering approaches, particularly with regard to robustness for features such as relatedness between individuals.

Publication types

Review

MeSH terms

Algorithms*
Cluster Analysis
Computer Simulation
Genetic Linkage
Genetics, Population
Genome, Human
Humans
Models, Genetic*
Polymorphism, Genetic
Principal Component Analysis