Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jun 4;14(6):e0217852.
doi: 10.1371/journal.pone.0217852. eCollection 2019.

Split4Blank: Maintaining Consistency While Improving Efficiency of Loading RDF Data With Blank Nodes

Affiliations
Free PMC article

Split4Blank: Maintaining Consistency While Improving Efficiency of Loading RDF Data With Blank Nodes

Atsuko Yamaguchi et al. PLoS One. .
Free PMC article

Abstract

In life sciences, accompanied by the rapid growth of sequencing technology and the advancement of research, vast amounts of data are being generated. It is known that as the size of Resource Description Framework (RDF) datasets increases, the more efficient loading to triple stores is crucial. For example, UniProt's RDF version contains 44 billion triples as of December 2018. PubChem also has an RDF dataset with 137 billion triples. As data sizes become extremely large, loading them to a triple store consumes time. To improve the efficiency of this task, parallel loading has been recommended for several stores. However, with parallel loading, dataset consistency must be considered if the dataset contains blank nodes. By definition, blank nodes do not have global identifiers; thus, pairs of identical blank nodes in the original dataset are recognized as different if they reside in separate files after the dataset is split for parallel loading. To address this issue, we propose the Split4Blank tool, which splits a dataset into multiple files under the condition that identical blank nodes are not separated. The proposed tool uses connected component and multiprocessor scheduling algorithms and satisfies the above condition. Furthermore, to confirm the effectiveness of the proposed approach, we applied Split4Blank to two life sciences RDF datasets. In addition, we generated synthetic RDF datasets to evaluate scalability based on the properties of various graphs, such as a scale-free and random graph.

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Computation time to split the Allie dataset.
x-axis and y-axis correspond to the number of files from two to ten and the average of running time [ms], respectively.
Fig 2
Fig 2. Computation time to split the Nikkaji dataset.
x-axis and y-axis correspond to the number of files from two to ten and the average of running time [ms], respectively.
Fig 3
Fig 3. Computation time to split the Allie dataset.
x-axis and y-axis correspond to the number of files (2, 10, 100, and 1000) and the average of running time [ms], respectively.
Fig 4
Fig 4. Computation time to split the Nikkaji dataset.
x-axis and y-axis correspond to the number of files (2, 10, 100, and 1000) and the average of running time [ms], respectively.
Fig 5
Fig 5. Computation time required to split the dataset generated using the random graph.
x-axis and y-axis correspond to the number of nodes and the average of running time [ms].
Fig 6
Fig 6. Computation time required to split the dataset generated using the Watts–Strogatz model.
x-axis and y-axis correspond to the number of nodes and the average of running time [ms].
Fig 7
Fig 7. Computation time required to split the dataset generated using the Barabasi–Albert model.
x-axis and y-axis correspond to the number of nodes and the average of running time [ms].

Similar articles

See all similar articles

References

    1. Mantini D, Perrucci MG, Del Gratta C, Romani GL, Corbetta M. Electrophysiological signatures of resting state networks in the human brain. Proc Natl Acad Sci U S A. 2007. 7;104(32): 13170–13175. 10.1073/pnas.0700668104 - DOI - PMC - PubMed
    1. O’Driscoll A, Belogrudov V, Carroll J, Kropp K, Walsh P, Ghazal P, et al. HBLAST: Parallelised sequence similarity–A Hadoop MapReducable basic local alignment search tool. J Biomed Inform. 2015. 54:58–64. 10.1016/j.jbi.2015.01.008 - DOI - PubMed
    1. Hey T, Tansley S, Tolle K. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, Washington, 2009.
    1. The National Center for Biotechnology Information. URL: https://www.ncbi.nlm.nih.gov/.
    1. The European Bioinformatics Institute. URL: https://www.ebi.ac.uk/.

Publication types

Grant support

NBDC (https://biosciencedbc.jp/en/) financially supported our work and there is no grant number. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Feedback