Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 71 (5), 1129-37

Haplotype Inference in Random Population Samples


Haplotype Inference in Random Population Samples

Shin Lin et al. Am J Hum Genet.


Contemporary genotyping and sequencing methods do not provide information on linkage phase in diploid organisms. The application of statistical methods to infer and reconstruct linkage phase in samples of diploid sequences is a potentially time- and labor-saving method. The Stephens-Smith-Donnelly (SSD) algorithm is one such method, which incorporates concepts from population genetics theory in a Markov chain-Monte Carlo technique. We applied a modified SSD method, as well as the expectation-maximization and partition-ligation algorithms, to sequence data from eight loci spanning >1 Mb on the human X chromosome. We demonstrate that the accuracy of the modified SSD method is better than that of the other algorithms and is superior in terms of the number of sites that may be processed. Also, we find phase reconstructions by the modified SSD method to be highly accurate over regions with high linkage disequilibrium (LD). If only polymorphisms with a minor allele frequency >0.2 are analyzed and scored according to the fraction of neighbor relations correctly called, reconstructions are 95.2% accurate over entire 100-kb stretches and are 98.6% accurate within blocks of high LD.


Figure  1
Figure 1
Accuracy within |D′| blocks versus |D′| threshold. Accuracy averages of single runs over the eight X-linked loci were treated as independent samples. The means and standard error of the means of 100 such samples are plotted against the |D′| threshold.
Figure  2
Figure 2
Patterns of LD (a) and pairwise accuracies (b) across eight concatenated X-linked loci. a, Plot of LD, constructed in a fashion similar to that of Jeffreys et al. (2001). |D′|, represented in the upper right of the plot, and inverse of the P value, on the lower left, were calculated from 40 haploid sequences in which the minor allele frequency of the sites was >0.20. The P value is from a Fisher’s exact test (Sokal and Rohlf 1981). Each region, colored according to the legend, is plotted as a rectangle centered on each SNP (represented below and to the right of the plots) and extends halfway to each adjacent marker. The region displayed represents the eight X-linked loci, concatenated in the order that they appear on the X chromosome, with white lines demarcating the interloci. b, Corresponding plot of pairwise accuracies, derived from application of the SSD-based method on 100 random pairings of the 40 concatenated X-linked sequences.
Figure  3
Figure 3
Illustration of the rationale behind breaking sequences into blocks to allow for more accurate phase reconstruction. Suppose a list of haplotypes is known from, say, homozygotes. In reconstruction of the haplotypes of the ambiguous individual shown, phasing the whole sequence with the original SSD program will give uncertain calls for the third and fourth positions. However, the phase relationship between the third and fourth positions is clear. By implementation of the method by which final phase relations are called as described in appendix A, the aforementioned phase relationship can be recovered. Incidentally, multiple recombination events presumably occurred between the second and third positions, and the corresponding phase relationship is likely to be impossible to ascertain by statistical methods.
Figure  4
Figure 4
Illustration of how the SSD-based program and LD block information can be used. Genotype i is one member of the sample of genotypes to be input into the SSD-based program. Of course, in a genotyping experiment, the haplotypes of which it is composed will not be known a priori. Nucleotide positions that, when paired, give segregating sites, are marked with colors for each haplotype. The SSD-based program yields phase reconstructions, the one corresponding to individual i labeled as i. Boundaries between blocks of high LD indicate phase relationships that are less likely to be correct. In this example, we see that, indeed, misphasing occurred between segregating sites 8 and 9 (demarcated by an asterisk [*]), which happen to straddle the boundary between blocks 3 and 4. The switch accuracy for this reconstruction is 13/14. This example was culled from a simulation on the X-linked GLRA2 locus in 40 male subjects, performed with q>0.2. Single-letter representations of heterozygous sites are as follows: Y = C/T, S = G/C, and R = G/A.

Similar articles

See all similar articles

Cited by 59 articles

See all "Cited by" articles

Publication types

LinkOut - more resources