Efficient genome-wide TagSNP selection across populations via the linkage disequilibrium criterion

Lan Liu; Yonghui Wu; Stefano Lonardi; Tao Jiang

doi:10.1089/cmb.2007.0228

Efficient genome-wide TagSNP selection across populations via the linkage disequilibrium criterion

J Comput Biol. 2010 Jan;17(1):21-37. doi: 10.1089/cmb.2007.0228.

Authors

Lan Liu¹, Yonghui Wu, Stefano Lonardi, Tao Jiang

Affiliation

¹ Department of Computer Science and Engineering, University of California, Riverside, California, USA. l.liu@cs.ucr.edu

Abstract

In this article, we studied the tag single-nucleotide polymorphism (tagSNP) selection problem on multiple populations using the pairwise r(2) linkage disequilibrium criterion. We proposed a novel combinatorial optimization model for the tagSNP selection problem, called the minimum common tagSNP selection (MCTS) problem, and presented efficient solutions for MCTS. Our approach consists of the following three main steps: (i) partitioning the SNP markers into small disjoint components, (ii) applying some data reduction rules to simplify the problem, and (iii) applying either a fast greedy algorithm or a Lagrangian relaxation algorithm to solve the remaining (general) MCTS. These algorithms also provide lower bounds on tagging (i.e., the minimum number of tagSNPs needed). The lower bounds allow us to evaluate how far our solution is from the optimum. To the best of our knowledge, it is the first time the tagging lower bounds are discussed in the literature. We assessed the performance of our algorithms on real HapMap data for genome-wide tagging. The experiments demonstrated that our algorithms run 3-4 orders of magnitude faster than the existing single-population tagging programs such as FESTA, LD-Select, and the multiple-population tagging method MultiPop-TagSelect. Our method also greatly reduced the required tagSNPs compared with LD-Select on a single population and MultiPop-TagSelect on multiple populations. Moreover, the numbers of tagSNPs selected by our algorithms are almost optimal because they are very close to the corresponding lower bounds obtained by our method.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms*
Genome, Human
Humans
Linkage Disequilibrium*
Models, Genetic
Polymorphism, Single Nucleotide*

Abstract

Publication types

MeSH terms

Grants and funding