Dynamic compression schemes for graph coloring

Harun Mustafa; Ingo Schilken; Mikhail Karasikov; Carsten Eickhoff; Gunnar Rätsch; André Kahles

doi:10.1093/bioinformatics/bty632

Dynamic compression schemes for graph coloring

Bioinformatics. 2019 Feb 1;35(3):407-414. doi: 10.1093/bioinformatics/bty632.

Authors

Harun Mustafa^{1

2

3}, Ingo Schilken¹, Mikhail Karasikov^{1

2

3}, Carsten Eickhoff⁴, Gunnar Rätsch^{1

2

3}, André Kahles^{1

2

3}

Affiliations

¹ Department of Computer Science, ETH Zurich, Zurich, Switzerland.
² Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland.
³ SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland.
⁴ Brown Center for Biomedical Informatics, Brown University, Providence, RI, USA.

Abstract

Motivation: Technological advancements in high-throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains hard to query for the research community due to a lack of efficient data representation and indexing solutions. One of the available techniques to represent read data is a condensed form as an assembly graph. Such a representation contains all sequence information but does not store contextual information and metadata.

Results: We present two new approaches for a compressed representation of a graph coloring: a lossless compression scheme based on a novel application of wavelet tries as well as a highly accurate lossy compression based on a set of Bloom filters. Both strategies retain a coloring even when adding to the underlying graph topology. We present construction and merge procedures for both methods and evaluate their performance on a wide range of different datasets. By dropping the requirement of a fully lossless compression and using the topological information of the underlying graph, we can reduce memory requirements by up to three orders of magnitude. Representing individual colors as independently stored modules, our approaches can be efficiently parallelized and provide strategies for dynamic use. These properties allow for an easy upscaling to the problem sizes common to the biomedical domain.

Availability and implementation: We provide prototype implementations in C++, summaries of our experiments as well as links to all datasets publicly at https://github.com/ratschlab/graph_annotation.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Color
Computational Biology*
Data Compression*
Genomics
High-Throughput Nucleotide Sequencing
Software*