VariantSpark: population scale clustering of genotype information

Aidan R O'Brien; Neil F W Saunders; Yi Guo; Fabian A Buske; Rodney J Scott; Denis C Bauer

doi:10.1186/s12864-015-2269-7

VariantSpark: population scale clustering of genotype information

BMC Genomics. 2015 Dec 10:16:1052. doi: 10.1186/s12864-015-2269-7.

Authors

Aidan R O'Brien^{1

2}, Neil F W Saunders¹, Yi Guo³, Fabian A Buske^{4

5}, Rodney J Scott⁶, Denis C Bauer⁷

Affiliations

¹ CSIRO, Health & Biosecurity Flagship, 11 Julius Av, Sydney, 2113, Australia.
² School of Biomedical Sciences and Pharmacy, Faculty of Health, Newcastle, 2308, Australia.
³ CSIRO, Data61, Sydney, 2052, Australia. Yi.Guo@csiro.au.
⁴ Cancer Epigenetics Program, Cancer Research Division, Kinghorn Cancer Centre, Garvan Institute of Medical Research, 384 Victoria St, Sydney, 2010, Australia.
⁵ UNSW Medicine, University of New South Wales, Sydney, 2052, Australia.
⁶ School of Biomedical Sciences and Pharmacy, Faculty of Health, Newcastle, 2308, Australia. rodney.scott@newcastle.edu.au.
⁷ CSIRO, Health & Biosecurity Flagship, 11 Julius Av, Sydney, 2113, Australia. Denis.Bauer@CSIRO.au.

Abstract

Background: Genomic information is increasingly used in medical practice giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. The widely used Hadoop MapReduce architecture and associated machine learning library, Mahout, provide the means for tackling computationally challenging tasks. However, many genomic analyses do not fit the Map-Reduce paradigm. We therefore utilise the recently developed SPARK engine, along with its associated machine learning library, MLlib, which offers more flexibility in the parallelisation of population-scale bioinformatics tasks. The resulting tool, VARIANTSPARK provides an interface from MLlib to the standard variant format (VCF), offers seamless genome-wide sampling of variants and provides a pipeline for visualising results.

Results: To demonstrate the capabilities of VARIANTSPARK, we clustered more than 3,000 individuals with 80 Million variants each to determine the population structure in the dataset. VARIANTSPARK is 80 % faster than the SPARK-based genome clustering approach, ADAM, the comparable implementation using Hadoop/Mahout, as well as ADMIXTURE, a commonly used tool for determining individual ancestries. It is over 90 % faster than traditional implementations using R and Python.

Conclusion: The benefits of speed, resource consumption and scalability enables VARIANTSPARK to open up the usage of advanced, efficient machine learning algorithms to genomic data.

Publication types

Comparative Study

MeSH terms

Algorithms
Cluster Analysis
Computational Biology / methods*
Genotype*
Humans
Polymorphism, Single Nucleotide
Software

Grants and funding

1051757/Medical Research Council/United Kingdom