Principal component analysis for selection of optimal SNP-sets that capture intragenic genetic variation

Genet Epidemiol. 2004 Jan;26(1):11-21. doi: 10.1002/gepi.10292.


Candidate gene association studies often utilize one single nucleotide polymorphism (SNP) for analysis, with an initial report typically not being replicated by subsequent studies. The failure to replicate may result from incomplete or poor identification of disease-related variants or haplotypes, possibly due to naive SNP selection. A method for identification of linkage disequilibrium (LD) groups and selection of SNPs that capture sufficient intra-genic genetic diversity is described. We assume all SNPs with minor allele frequency above a pre-determined frequency have been identified. Principal component analysis (PCA) is applied to evaluate multivariate SNP correlations to infer groups of SNPs in LD (LD-groups) and to establish an optimal set of group-tagging SNPs (gtSNPs) that provide the most comprehensive coverage of intra-genic diversity while minimizing the resources necessary to perform an informative association analysis. This PCA method differs from haplotype block (HB) and haplotype-tagging SNP (htSNP) methods, in that an LD-group of SNPs need not be a contiguous DNA fragment. Results of the PCA method compared well with existing htSNP methods while also providing advantages over those methods, including an indication of the optimal number of SNPs needed. Further, evaluation of the method over multiple replicates of simulated data indicated PCA to be a robust method for SNP selection. Our findings suggest that PCA may be a powerful tool for establishing an optimal SNP set that maximizes the amount of genetic variation captured for a candidate gene using a minimal number of SNPs.

Publication types

  • Comparative Study
  • Research Support, U.S. Gov't, P.H.S.

MeSH terms

  • Adult
  • Female
  • Gene Frequency
  • Genetic Predisposition to Disease / genetics
  • Genetic Variation*
  • Haplotypes
  • Humans
  • Linkage Disequilibrium
  • Male
  • Middle Aged
  • Polymorphism, Single Nucleotide / genetics*
  • Principal Component Analysis*