A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase
- PMID: 16532393
- PMCID: PMC1424677
- DOI: 10.1086/502802
A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase
Abstract
We present a statistical model for patterns of genetic variation in samples of unrelated individuals from natural populations. This model is based on the idea that, over short regions, haplotypes in a population tend to cluster into groups of similar haplotypes. To capture the fact that, because of recombination, this clustering tends to be local in nature, our model allows cluster memberships to change continuously along the chromosome according to a hidden Markov model. This approach is flexible, allowing for both "block-like" patterns of linkage disequilibrium (LD) and gradual decline in LD with distance. The resulting model is also fast and, as a result, is practicable for large data sets (e.g., thousands of individuals typed at hundreds of thousands of markers). We illustrate the utility of the model by applying it to dense single-nucleotide-polymorphism genotype data for the tasks of imputing missing genotypes and estimating haplotypic phase. For imputing missing genotypes, methods based on this model are as accurate or more accurate than existing methods. For haplotype estimation, the point estimates are slightly less accurate than those from the best existing methods (e.g., for unrelated Centre d'Etude du Polymorphisme Humain individuals from the HapMap project, switch error was 0.055 for our method vs. 0.051 for PHASE) but require a small fraction of the computational cost. In addition, we demonstrate that the model accurately reflects uncertainty in its estimates, in that probabilities computed using the model are approximately well calibrated. The methods described in this article are implemented in a software package, fastPHASE, which is available from the Stephens Lab Web site.
Figures
Similar articles
-
Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation.Am J Hum Genet. 2005 Mar;76(3):449-62. doi: 10.1086/428594. Epub 2005 Jan 31. Am J Hum Genet. 2005. PMID: 15700229 Free PMC article.
-
Inference of haplotypic phase and missing genotypes in polyploid organisms and variable copy number genomic regions.BMC Bioinformatics. 2008 Dec 1;9:513. doi: 10.1186/1471-2105-9-513. BMC Bioinformatics. 2008. PMID: 19046436 Free PMC article.
-
A comparison of phasing algorithms for trios and unrelated individuals.Am J Hum Genet. 2006 Mar;78(3):437-50. doi: 10.1086/500808. Epub 2006 Jan 26. Am J Hum Genet. 2006. PMID: 16465620 Free PMC article.
-
Inferring coancestry in population samples in the presence of linkage disequilibrium.Genetics. 2012 Apr;190(4):1447-60. doi: 10.1534/genetics.111.137570. Epub 2012 Jan 31. Genetics. 2012. PMID: 22298700 Free PMC article.
-
Linkage disequilibrium-based quality control for large-scale genetic studies.PLoS Genet. 2008 Aug 1;4(8):e1000147. doi: 10.1371/journal.pgen.1000147. PLoS Genet. 2008. PMID: 18670630 Free PMC article.
Cited by
-
Inference of gorilla demographic and selective history from whole-genome sequence data.Mol Biol Evol. 2015 Mar;32(3):600-12. doi: 10.1093/molbev/msu394. Epub 2014 Dec 21. Mol Biol Evol. 2015. PMID: 25534031 Free PMC article.
-
Genetic association studies: an information content perspective.Curr Genomics. 2012 Nov;13(7):566-73. doi: 10.2174/138920212803251382. Curr Genomics. 2012. PMID: 23633916 Free PMC article.
-
Improving the accuracy and efficiency of identity-by-descent detection in population data.Genetics. 2013 Jun;194(2):459-71. doi: 10.1534/genetics.113.150029. Epub 2013 Mar 27. Genetics. 2013. PMID: 23535385 Free PMC article.
-
Fast accurate missing SNP genotype local imputation.BMC Res Notes. 2012 Aug 3;5:404. doi: 10.1186/1756-0500-5-404. BMC Res Notes. 2012. PMID: 22863359 Free PMC article.
-
Simultaneous analysis of multiple data types in pharmacogenomic studies using weighted sparse canonical correlation analysis.OMICS. 2012 Jul-Aug;16(7-8):363-73. doi: 10.1089/omi.2011.0126. Epub 2012 Jun 26. OMICS. 2012. PMID: 22734853 Free PMC article.
References
Web Resources
-
- HAP Web site, http://research.calit2.net/hap/
-
- HaploBlock, http://bioinfo.cs.technion.ac.il/haploblock/
-
- International HapMap Project, http://www.hapmap.org/
-
- SeattleSNPs, http://pga.gs.washington.edu
References
-
- Akaike H (1974) A new look at the statistical model identification. IEEE Trans Automatic Control AC 19:719–723
-
- Bates JM, Granger CWJ (1969) The combination of forecasts. Oper Res Q 20:451–468
-
- Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
-
- Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39:1–38
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials
Miscellaneous
