Unsupervised discovery of ancestry-informative markers and genetic admixture proportions in biobank-scale datasets

Seyoon Ko; Benjamin B Chu; Daniel Peterson; Chidera Okenwa; Jeanette C Papp; David H Alexander; Eric M Sobel; Hua Zhou; Kenneth L Lange

doi:10.1016/j.ajhg.2022.12.008

Unsupervised discovery of ancestry-informative markers and genetic admixture proportions in biobank-scale datasets

Am J Hum Genet. 2023 Feb 2;110(2):314-325. doi: 10.1016/j.ajhg.2022.12.008. Epub 2023 Jan 6.

Authors

Seyoon Ko¹, Benjamin B Chu², Daniel Peterson³, Chidera Okenwa⁴, Jeanette C Papp⁵, David H Alexander⁶, Eric M Sobel⁷, Hua Zhou¹, Kenneth L Lange⁸

Affiliations

¹ Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA; Department of Biostatistics, University of California, Los Angeles, Los Angeles, CA 90095, USA.
² Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA; Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA.
³ Department of Mathematics, Brigham Young University, Provo, UT 84602, USA.
⁴ Department of Mathematics, University of California, Berkeley, Berkeley, CA 94720, USA.
⁵ Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA 90095, USA.
⁶ X Development LLC, Mountain View, CA 94043, USA.
⁷ Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA; Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA 90095, USA. Electronic address: esobel@ucla.edu.
⁸ Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA; Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA 90095, USA; Department of Statistics, University of California, Los Angeles, Los Angeles, CA 90095, USA.

Abstract

Admixture estimation plays a crucial role in ancestry inference and genome-wide association studies (GWASs). Computer programs such as ADMIXTURE and STRUCTURE are commonly employed to estimate the admixture proportions of sample individuals. However, these programs can be overwhelmed by the computational burdens imposed by the 10⁵ to 10⁶ samples and millions of markers commonly found in modern biobanks. An attractive strategy is to run these programs on a set of ancestry-informative SNP markers (AIMs) that exhibit substantially different frequencies across populations. Unfortunately, existing methods for identifying AIMs require knowing ancestry labels for a subset of the sample. This supervised learning approach creates a chicken and the egg scenario. In this paper, we present an unsupervised, scalable framework that seamlessly carries out AIM selection and likelihood-based estimation of admixture proportions. Our simulated and real data examples show that this approach is scalable to modern biobank datasets. OpenADMIXTURE, our Julia implementation of the method, is open source and available for free.

Keywords: AIM; OpenADMIXTURE; OpenMendel; SKFR; admixture; ancestry-informative marker; biobank scale; genetic ancestry; sparse K-means with feature ranking; sparse clustering.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Biological Specimen Banks*
Genetics, Population
Genome-Wide Association Study* / methods
Humans
Likelihood Functions
Population Groups
Software

Abstract

Publication types

MeSH terms

Grants and funding