SHARE: an adaptive algorithm to select the most informative set of SNPs for candidate genetic association

Biostatistics. 2009 Oct;10(4):680-93. doi: 10.1093/biostatistics/kxp023. Epub 2009 Jul 15.


Association studies have been widely used to identify genetic liability variants for complex diseases. While scanning the chromosomal region 1 single nucleotide polymorphism (SNP) at a time may not fully explore linkage disequilibrium, haplotype analyses tend to require a fairly large number of parameters, thus potentially losing power. Clustering algorithms, such as the cladistic approach, have been proposed to reduce the dimensionality, yet they have important limitations. We propose a SNP-Haplotype Adaptive REgression (SHARE) algorithm that seeks the most informative set of SNPs for genetic association in a targeted candidate region by growing and shrinking haplotypes with 1 more or less SNP in a stepwise fashion, and comparing prediction errors of different models via cross-validation. Depending on the evolutionary history of the disease mutations and the markers, this set may contain a single SNP or several SNPs that lay a foundation for haplotype analyses. Haplotype phase ambiguity is effectively accounted for by treating haplotype reconstruction as a part of the learning procedure. Simulations and a data application show that our method has improved power over existing methodologies and that the results are informative in the search for disease-causal loci.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Aged
  • Algorithms*
  • Biostatistics / methods
  • Female
  • Genome-Wide Association Study / statistics & numerical data*
  • Haplotypes
  • Humans
  • Linkage Disequilibrium
  • Lipoproteins / genetics
  • Middle Aged
  • Models, Statistical
  • Polymorphism, Single Nucleotide*
  • Regression Analysis
  • Venous Thrombosis / genetics


  • Lipoproteins
  • lipoprotein-associated coagulation inhibitor