M(3): an improved SNP calling algorithm for Illumina BeadArray data

Bioinformatics. 2012 Feb 1;28(3):358-65. doi: 10.1093/bioinformatics/btr673. Epub 2011 Dec 8.


Summary: Genotype calling from high-throughput platforms such as Illumina and Affymetrix is a critical step in data processing, so that accurate information on genetic variants can be obtained for phenotype-genotype association studies. A number of algorithms have been developed to infer genotypes from data generated through the Illumina BeadStation platform, including GenCall, GenoSNP, Illuminus and CRLMM. Most of these algorithms are built on population-based statistical models to genotype every SNP in turn, such as GenCall with the GenTrain clustering algorithm, and require a large reference population to perform well. These approaches may not work well for rare variants where only a small proportion of the individuals carry the variant. A fundamentally different approach, implemented in GenoSNP, adopts a single nucleotide polymorphism (SNP)-based model to infer genotypes of all the SNPs in one individual, making it an appealing alternative to call rare variants. However, compared to the population-based strategies, more SNPs in GenoSNP may fail the Hardy-Weinberg Equilibrium test. To take advantage of both strategies, we propose a two-stage SNP calling procedure, named the modified mixture model (M(3)), to improve call accuracy for both common and rare variants. The effectiveness of our approach is demonstrated through applications to genotype calling on a set of HapMap samples used for quality control purpose in a large case-control study of cocaine dependence. The increase in power with M(3) is greater for rare variants than for common variants depending on the model.

Availability: M(3) algorithm: http://bioinformatics.med.yale.edu/group.

Contact: name@bio.com; hongyu.zhao@yale.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Algorithms*
  • Case-Control Studies
  • Cluster Analysis
  • Computational Biology / methods*
  • Genotype
  • HapMap Project
  • Humans
  • Models, Genetic
  • Oligonucleotide Array Sequence Analysis*
  • Polymorphism, Single Nucleotide*