Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 May;9(9):624-635.
doi: 10.14778/2947618.2947619.

Decibel: The Relational Dataset Branching System

Affiliations
Free PMC article

Decibel: The Relational Dataset Branching System

Michael Maddox et al. Proceedings VLDB Endowment. .
Free PMC article

Abstract

As scientific endeavors and data analysis become increasingly collaborative, there is a need for data management systems that natively support the versioning or branching of datasets to enable concurrent analysis, cleaning, integration, manipulation, or curation of data across teams of individuals. Common practice for sharing and collaborating on datasets involves creating or storing multiple copies of the dataset, one for each stage of analysis, with no provenance information tracking the relationships between these datasets. This results not only in wasted storage, but also makes it challenging to track and integrate modifications made by different users to the same dataset. In this paper, we introduce the Relational Dataset Branching System, Decibel, a new relational storage system with built-in version control designed to address these shortcomings. We present our initial design for Decibel and provide a thorough evaluation of three versioned storage engine designs that focus on efficient query processing with minimal storage overhead. We also develop an exhaustive benchmark to enable the rigorous testing of these and future versioned storage engine designs.

Figures

Figure 1
Figure 1
Two Example Workflows
Figure 2
Figure 2
Example of Tuple-First
Figure 3
Figure 3
Example of Version-First (depicts branches resulting from merges, but could be commits)
Figure 4
Figure 4
Example of Hybrid
Figure 5
Figure 5
The various branching strategies in the versioning benchmark: a) Deep b) Flat c) Science (Sci.) d) Curation (Cur.)
Figure 6
Figure 6
The Impact of Scaling Branches
Figure 7
Figure 7
Query 1
Figure 8
Figure 8
Query 2
Figure 9
Figure 9
Query 3
Figure 10
Figure 10
Query 4
Figure 11
Figure 11
Table-Wise Updates: Query 1 (10 Branches)

Similar articles

See all similar articles

References

    1. Pugliese A, et al. Scaling RDF with time. WWW. 2008
    1. Ahn I, Snodgrass R. Performance evaluation of a temporal database management system. SIGMOD, pages. 1986:96–107.
    1. Berenson H, Bernstein P, Gray J, Melton J, O’Neil E, O’Neil P. A critique of ANSI SQL isolation levels. SIGMOD, pages. 1995:1–10.
    1. Bhardwaj AP, Bhattacherjee S, Chavan A, Deshpande A, Elmore AJ, Madden S, Parameswaran AG. DataHub: collaborative data science & dataset version management at scale. CIDR. 2015
    1. Bhattacherjee S, Chavan A, Huang S, Deshpande A, Parameswaran AG. Principles of dataset versioning: Exploring the recreation/storage tradeoff. PVLDB. 2015 - PMC - PubMed

LinkOut - more resources

Feedback