Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Oct 11;5(1):42.
doi: 10.1186/s13742-016-0148-z.

The Whole Genome Sequences and Experimentally Phased Haplotypes of Over 100 Personal Genomes

Affiliations
Free PMC article

The Whole Genome Sequences and Experimentally Phased Haplotypes of Over 100 Personal Genomes

Qing Mao et al. Gigascience. .
Free PMC article

Abstract

Background: Since the completion of the Human Genome Project in 2003, it is estimated that more than 200,000 individual whole human genomes have been sequenced. A stunning accomplishment in such a short period of time. However, most of these were sequenced without experimental haplotype data and are therefore missing an important aspect of genome biology. In addition, much of the genomic data is not available to the public and lacks phenotypic information.

Findings: As part of the Personal Genome Project, blood samples from 184 participants were collected and processed using Complete Genomics' Long Fragment Read technology. Here, we present the experimental whole genome haplotyping and sequencing of these samples to an average read coverage depth of 100X. This is approximately three-fold higher than the read coverage applied to most whole human genome assemblies and ensures the highest quality results. Currently, 114 genomes from this dataset are freely available in the GigaDB repository and are associated with rich phenotypic data; the remaining 70 should be added in the near future as they are approved through the PGP data release process. For reproducibility analyses, 20 genomes were sequenced at least twice using independent LFR barcoded libraries. Seven genomes were also sequenced using Complete Genomics' standard non-barcoded library process. In addition, we report 2.6 million high-quality, rare variants not previously identified in the Single Nucleotide Polymorphisms database or the 1000 Genomes Project Phase 3 data.

Conclusions: These genomes represent a unique source of haplotype and phenotype data for the scientific community and should help to expand our understanding of human genome evolution and function.

Keywords: Complete genomics; Haplotypes; LFR; Long fragment read; PGP; Personal Genome Project; Whole genome sequencing.

Figures

Fig. 1
Fig. 1
Self-reported participant ethnicity. As part of the Personal Genome Project sample acquisition process, participants were asked to report their ethnicity. The pie chart illustrates the proportion of samples from each ethnic group. Out of the 184 participants, more than 75 % reported themselves as White
Fig. 2
Fig. 2
Data directory tree. The output from the Long Fragment Read (LFR) process consists of a series of files and folders. A complete description of everything contained within the Complete Genomics data package can be found in Additional file 3. ASM assembly, CNV copy number variations, dbSNP Single Nucleotide Polymorphisms database, SV structural variations, VCF variant cell format
Fig. 3
Fig. 3
Coverage map. Mate-pair read coverage across all 384 wells of a Long Fragment Read (LFR) sample for the region on chromosome 14 from 93,100,000 to 95,100,000. From left to right, each column corresponds to one of the 384 wells, with the leftmost column corresponding to well 0 (this represents mate-pair reads for which the well was not called). The position in Mb along chromosome 14 is displayed on the vertical axis. Each red horizontal line corresponds to a 100 kb increment on chromosome 14. The gray scale encodes the number of mate-pairs mapped within each 1 kb bin. The fragments are clearly visible as vertical dark streaks in each column
Fig. 4
Fig. 4
Haplotype extraction. Haplotypes can easily be retrieved from Long Fragment Read (LFR) samples starting from the variant file with file name format var-GS0000#####-ASM.tsv_with_wellcount_exc.txt. Following the steps provided in the figure will result in the highest quality haplotypes with an extremely low error rate, but with some loss of real variants. LFR haplotype performance using these filters has previously been described [3, 16]
Fig. 5
Fig. 5
Genome quality metrics. Metrics from 225 individual genomic libraries (Additional file 1) from 184 Personal Genome Project (PGP) participants are plotted in each panel. Each dot represents a single genomic library from a PGP sample and is colored by ethnicity as follows: blue, Unreported (Urp); light green, White (Wht); purple, Asian (Asn); dark red, Hispanic or Latino (Hsp); light orange, American Indian/Alaska Native/White (Aaw); light blue, Black or African American (Blk); pink, Asian/White (Awt); dark blue, Asian/Hispanic or Latino (Asp); light purple, Hispanic or Latino/White (Hsw). The large red colored dot in panels a, cf represents the average across the PGP data set. a The percent called across the genome is plotted on the x-axis and the percent called across the exome is plotted on the y-axis. b The total number of variant sites per genome is plotted on the y-axis and the ethnic group to which each sample was self-reported is plotted on the x-axis. Red colored dots represent the average number of single nucleotide polymorphisms (SNPs) in each population group as reported by the 1000 Genomes (1KG) project [9]. The ethnic groups in our study without a red dot lack a representative population in the 1KG data. c The heterozygous to homozygous SNP ratio (Het/Hom) is plotted on the y-axis and transition to transversion ratio (Ts/Tv) is plotted on the x-axis. d The SNP phasing rate is plotted on the y-axis and the N50 length of the assembled haplotype contigs in kilobases (kb) is plotted on the x-axis. e The average Long Fragment Read (LFR) fragment length is plotted on the y-axis and the N50 length of assembled haplotype contigs is plotted on the x-axis. Both values are in kb. f The number of cells-worth of genomic DNA was calculated based on assembled long fragment coverage and is plotted on the y-axis. The N50 length of the assembled haplotype contigs in kb is plotted on the x-axis
Fig. 6
Fig. 6
Venn diagram of the overlap between Personal Genome Project variants and those from the 1000 Genomes Project and the Single Nucleotide Polymorphisms database. Single nucleotide polymorphisms (SNPs) from all 225 Personal Genome Project (PGP) genomic libraries (Additional file 1) were filtered with the following criteria: 1) Each SNP must have a PASS in the “varFilter” field; this helps remove false-positive errors. 2) The variant call – and for heterozygous SNPs also the reference call – must have a “wellCount” of six or more; this removes most of the remaining false-positive errors. 3) For heterozygous SNPs, the “SharedWellCount” field is less than or equal to 0.25X (“MinExclusiveWellCountInThisLocus” + “SharedWellCount”); this removes potential mapping errors that result in an excess of wells for which both the reference and variant base is called. The combination of this set of filters has previously been shown [16] to remove the vast majority of false-positive errors and was chosen to create a set of very high confidence variants. This set was compared with variants in the 1000 Genomes (1KG, Phase 3) and the SNP database (dbSNP, Build 147) datasets. In total, more than 17 million SNPs were found in the PGP samples and these were compared with over 81 million and 142 million in 1KG and dbSNP, respectively. As expected, more than 85 % of SNPs found in the PGP samples were found in the 1KG and/or dbSNP datasets
Fig. 7
Fig. 7
Principle component analysis. SNPRelate [21] was used to project 225 libraries (Additional file 1) from 184 Personal Genome Project (PGP) samples onto a principle component analysis using four different populations from the HapMap 3 project. Hierarchical clustering of this data using SNPRelate suggests that self-reported ethnicity for 182 of the 184 PGP samples matched the correct HapMap 3 ethnicity. The two PGP samples whose self-reported ethnicity did not cluster with the correct HapMap 3 ethnic group self-reported as Asian but their grandparents were of Indian and Sri Lankan ancestry. ASW African ancestry in Southwest USA, CEU Utah residents with Northern and Western European ancestry from the Centre de’Etude du Polymorphism Humain, Foundation Jean Dausset in Paris, France, CHB Han Chinese in Beijing, China, EV eigenvector, MEX Mexican ancestry in Los Angeles, California
Fig. 8
Fig. 8
Pairwise comparisons of haplotype data between replicates. For each sample, replicate libraries were analyzed through pairwise comparisons. Single nucleotide polymorphisms (SNPs) were filtered with the same criteria as used in Fig. 6: 1) Each SNP must have a PASS in the “varFilter” field. 2) The “wellCount” field should be equal to six or greater for both variant and reference calls. 3) The “SharedWellCount” field must be less than or equal to 0.25X (“MinExclusiveWellCountInThisLocus” + “SharedWellCount”). In addition, overlapping blocks must contain at least ten SNPs and only pairwise comparisons between 1 million or more SNPs were analyzed. This final criterion reduced the number of pairwise comparisons to 35 and the number of participant samples to 12. This filter was applied to remove Long Fragment Read (LFR) libraries that were intentionally made with low coverage and thus have sparse haplotype coverage. Switch discordances were calculated by comparing the phase of heterozygous SNPs in completely overlapping blocks between replicate samples. a Short switch discordance rates were calculated by dividing the total number of discordant SNPs by the total number of phased SNPs in the compared blocks. Long switch discordance rates were calculated by dividing the total number of long switch events by the total number of phased SNPs in the compared blocks. Individual pairwise comparisons are represented by small blue dots on the plot and the average of all 35 comparisons is represented by the large red dot. b The fraction of total blocks with no errors (red dotted line), with one short (black solid line) or long (block dashed line) switch discordance, two to three short (solid grey line) or long (dashed grey line) switch discordances, or four or more short (solid light grey line) or long (dashed light grey line) were plotted against the block length in base pairs (bp). The vast majority of compared blocks (~86 %) have no discordances. Of those blocks that are discordant, very few have more than one short or long switch

Similar articles

See all similar articles

Cited by 5 articles

References

    1. Hayden EC. Technology: the $1,000 genome. Nature. 2014;507(7492):294–5. doi: 10.1038/507294a. - DOI - PubMed
    1. Personal Genome Project. http://www.personalgenomes.org/harvard/sign-up.
    1. Peters BA, Kermani BG, Sparks AB, Alferov O, Hong P, Alexeev A, et al. Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells. Nature. 2012;487(7406):190–5. doi: 10.1038/nature11236. - DOI - PMC - PubMed
    1. Dean FB, Hosono S, Fang L, Wu X, Faruqi AF, Bray-Ward P, et al. Comprehensive human genome amplification using multiple displacement amplification. Proc Natl Acad Sci U S A. 2002;99(8):5261–6. doi: 10.1073/pnas.082089499. - DOI - PMC - PubMed
    1. Drmanac R, Sparks AB, Callow MJ, Halpern AL, Burns NL, Kermani BG, et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science. 2010;327(5961):78–81. doi: 10.1126/science.1181498. - DOI - PubMed
Feedback