A survey of direct-to-consumer genotype data, and quality control tool (GenomePrep) for research

Comput Struct Biotechnol J. 2021 Jun 27:19:3747-3754. doi: 10.1016/j.csbj.2021.06.040. eCollection 2021.

Abstract

Two major forces have contributed to the fast growth of human genetic data. One from medical research supported by governments and academic institutes; the other from direct-to-consumer (DTC) sequencing companies. While the former benefits from meticulously designed sequencing standards and quality control procedures, the latter comes in various formats and sequencing methods which are subject to changes over time and the particular needs of different companies. Thanks to the general public who shared their DNA data without constraint, here we provide a review for over 7000 genomes made public between 2011 and 2020, and produced by over six DTC sequencing companies. An open source tool-kit to systematically parse, quality check and filter genome files and statistically problematic alleles is provided to prepare consumer DNA datasets for research. The GenomePrep output is available in two common DNA datafile formats to enable further analysis with other tools. We also provide for download the combined output for all OpenSNP array genomes processed in this paper in a single data freeze file.

Keywords: Direct-to-consumer sequencing; Genotyping; Open genome; Personal genome; SNP arrays.