Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 92 (1), 52-66

Deep Whole-Genome Sequencing of 100 Southeast Asian Malays

Affiliations

Deep Whole-Genome Sequencing of 100 Southeast Asian Malays

Lai-Ping Wong et al. Am J Hum Genet.

Abstract

Whole-genome sequencing across multiple samples in a population provides an unprecedented opportunity for comprehensively characterizing the polymorphic variants in the population. Although the 1000 Genomes Project (1KGP) has offered brief insights into the value of population-level sequencing, the low coverage has compromised the ability to confidently detect rare and low-frequency variants. In addition, the composition of populations in the 1KGP is not complete, despite the fact that the study design has been extended to more than 2,500 samples from more than 20 population groups. The Malays are one of the Austronesian groups predominantly present in Southeast Asia and Oceania, and the Singapore Sequencing Malay Project (SSMP) aims to perform deep whole-genome sequencing of 100 healthy Malays. By sequencing at a minimum of 30× coverage, we have illustrated the higher sensitivity at detecting low-frequency and rare variants and the ability to investigate the presence of hotspots of functional mutations. Compared to the low-pass sequencing in the 1KGP, the deeper coverage allows more functional variants to be identified for each person. A comparison of the fidelity of genotype imputation of Malays indicated that a population-specific reference panel, such as the SSMP, outperforms a cosmopolitan panel with larger number of individuals for common SNPs. For lower-frequency (<5%) markers, a larger number of individuals might have to be whole-genome sequenced so that the accuracy currently afforded by the 1KGP can be achieved. The SSMP data are expected to be the benchmark for evaluating the value of deep population-level sequencing versus low-pass sequencing, especially in populations that are poorly represented in population-genetics studies.

Figures

Figure 1
Figure 1
PCA of Samples from the SSMP and SGVP PCA of the 100 samples from the SSMP (black circles) and the 268 samples from the SGVP, which includes 96 Chinese (red), 89 Malays (green), and 83 Indians (blue). A set of 111,776 SNPs present on the Illumina Omni1-Quad array, as well as in the SGVP, was used for this analysis.
Figure 2
Figure 2
Size Distribution Variants Discovered in the SSMP Compared to the 1KGP Variants detected in the SSMP include SNPs, small indels between the sizes of −50 and 50 bp, and large deletions between the sizes of 50 bp and 1 Mb. Variants identified by the 1KGP were compared with those in NCBI build 36 and dbSNP129, and variants identified by the SSMP were compared with those in dbSNP132 and the July 2010 release of the 1KGP.
Figure 3
Figure 3
Number of Variants in Individual Whole-Genome Sequencing Illustration of the number of variants detected in the individual whole-genome-sequencing projects that have been performed, along with the number of samples and the corresponding sequence depth. The number shown at the start of each horizontal bar indicates the exact number of variants discovered in the individual sequencing or the average number of variants discovered in multisample sequencing (for the Koreans and the Malays). The error bars for the multisample sequencing of the Koreans and Malays show the minimum and the maximum number of variants detected across the samples.
Figure 4
Figure 4
Density of SNPs in the SSMP Density of SNPs discovered in the SSMP. Each chromosome is divided into nonoverlapping windows of 1 Mb and the number of SNPs in each of the three categories: all SNPs (A), nsSNPs (B), and damaging nsSNPs (C). Horizontal dashed lines correspond to the thresholds used for defining the regions of interest where the SNP densities are at least 50% of those observed at the HLA region on chromosome 6.
Figure 5
Figure 5
Size Distribution of Indels by Population Frequency Indels discovered in the SSMP are distributed by size and categorized into three MAF bins: rare (≤1%), low frequency (1%–5%), and common (≥5%). Previously identified indels refer to those that are present in dbSNP132 or in the July 2010 release of the IKGP (lower panel), whereas nonoverlapping indels are defined as those present in only the SSMP and not in either dbSNP132 or the 1KGP (upper panel). The lines shown in the upper panel indicate the proportion of nonoverlapping indels identified by the SSMP (orange line) and the 1KGP (green).
Figure 6
Figure 6
Number of SNPs Detectable by Sequencing at Different Depths Pictorial representation of the number of SNPs detected by sequencing at 30× or 5× coverage. The blue bars represent the number of SNPs found by both 5× and 30× sequencing, and the red bars represent those that were only detected by sequencing at 30×.
Figure 7
Figure 7
Genomic Coverage of Genotyping Arrays Coverage of SNP variation for Southeast Asian Singapore Malays (SSM), Europeans (CEU), East Asians (CHB + JPT), and Africans (YRI) from the 1KGP on various commercially available genome-wide genotyping arrays. (A) SNPs of common frequency (≥5% in each population) were assessed. (B) SNPs of low and common frequency (≥1% in each population) were assessed.
Figure 8
Figure 8
Genomic Coverage of Exome Arrays The percentage of variation covered by the two currently commercially available exome-focused genotyping arrays for the different 1KGP population groups: Southeast Asia Singapore Malays (SSM), Europeans (CEU), East Asians (CHB + JPT), and Africans (YRI). Assessment of the coverage of common exonic variants by the Illumina HumanExome Beadchip (A) and the Affymetrix Axiom Exome Array (B) are shown. Additionally, low-frequency exonic SNPs are included in the coverage assessment of the Illumina HumanExome Beadchip (C) and Affymetrix Axiom Exome Array (D).
Figure 9
Figure 9
Comparison of Reference Panels in Genotype Imputation Evaluation of the performance of genotype imputation of 2,542 Singapore Malays who were genotyped on the Illumina610 array against the two reference panels constructed from (1) 96 Malays from the SSMP and (2) 1,092 samples from 14 populations in phase 1 of the 1KGP. The correlation r2 between the allele dosages and the actual genotype calls was calculated for each SNP on the microarray. The vertical bars represent the percentage of SNPs in each MAF bin where r2 is less than 0.9. The figure at the top of each frequency bin represents the number of SNPs with a MAF (calculated from 2,542 Malay samples) that falls within the frequency spectrum of the bin. The vertical axis is represented in logarithmic scale for ease of interpretation.

Similar articles

See all similar articles

Cited by 68 PubMed Central articles

See all "Cited by" articles

Publication types

LinkOut - more resources

Feedback