Fast two-stage phasing of large-scale sequence data

Brian L Browning; Xiaowen Tian; Ying Zhou; Sharon R Browning

doi:10.1016/j.ajhg.2021.08.005

Fast two-stage phasing of large-scale sequence data

Am J Hum Genet. 2021 Oct 7;108(10):1880-1890. doi: 10.1016/j.ajhg.2021.08.005. Epub 2021 Sep 2.

Authors

Brian L Browning¹, Xiaowen Tian², Ying Zhou³, Sharon R Browning⁴

Affiliations

¹ Department of Medicine, Division of Medical Genetics, University of Washington, Seattle, WA 98195, USA; Department of Biostatistics, University of Washington, Seattle, WA 98195, USA. Electronic address: browning@uw.edu.
² Statistical Innovation, Oncology Biometrics, AstraZeneca, Gaithersburg, MD 20878, USA.
³ Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA.
⁴ Department of Biostatistics, University of Washington, Seattle, WA 98195, USA.

Abstract

Haplotype phasing is the estimation of haplotypes from genotype data. We present a fast, accurate, and memory-efficient haplotype phasing method that scales to large-scale SNP array and sequence data. The method uses marker windowing and composite reference haplotypes to reduce memory usage and computation time. It incorporates a progressive phasing algorithm that identifies confidently phased heterozygotes in each iteration and fixes the phase of these heterozygotes in subsequent iterations. For data with many low-frequency variants, such as whole-genome sequence data, the method employs a two-stage phasing algorithm that phases high-frequency markers via progressive phasing in the first stage and phases low-frequency markers via genotype imputation in the second stage. This haplotype phasing method is implemented in the open-source Beagle 5.2 software package. We compare Beagle 5.2 and SHAPEIT 4.2.1 by using expanding subsets of 485,301 UK Biobank samples and 38,387 TOPMed samples. Both methods have very similar accuracy and computation time for UK Biobank SNP array data. However, for TOPMed sequence data, Beagle is more than 20 times faster than SHAPEIT, achieves similar accuracy, and scales to larger sample sizes.

Keywords: TOPMed; UK Biobank; genotype phasing; haplotype phasing; phasing.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Algorithms
Asthma / genetics*
Atrial Fibrillation / genetics*
Data Interpretation, Statistical*
Female
Genome, Human*
Genome-Wide Association Study
Genotype
Haplotypes*
Humans
Male
Polymorphism, Single Nucleotide*
Software*

Abstract

Publication types

MeSH terms

Grants and funding