Exploiting Reused-Based Sharing Work Opportunities in Big Data Multiquery Optimization with Flink

Big Data. 2021 Dec;9(6):454-479. doi: 10.1089/big.2020.0141. Epub 2021 Oct 6.

Abstract

Multiquery optimization is fundamental to retrieve data from different sources in a specific time frame to fulfill the credibility of the big data applications. To avoid high-cost inputs or outputs operations over large-scale data, exploiting sharing opportunities including (join, aggregation, and sort) is helpful to improve the performance of the multiple queries. Furthermore, considering the in-memory big data platforms such as Flink is essential to enhance the performance of the multiple queries. We have extended our previous proposed system, Multi-Query Optimization using Tuple Size and Histogram (MOTH), to exploit sharing work among multiquery, including join, aggregation, and sort. The comprehensive system, called the Join-Aggregation-Sort (JAS)-MOTH system, is used to minimize the data in-network movement time, that is, shuffle time needed to transfer intermediate data. The proposed system is introduced by developing two additional modules to investigate the sharing work; query explorer and JAS-MOTH optimizer, including the sort exploiter module. The JAS-MOTH system can exploit the shared explicit and implicit sorts among multiple sorts and aggregation queries. The proposed system refines the pipelined multiway join execution for multiple join queries by considering coarse-grained sharing data, join ordering, joining pipelining, and shared implicit sorts. Furthermore, it introduces an end-to-end multiway join optimizer over Flink. An elaborated experimental comparison of the naive and state-of-art techniques among a broader class of queries (i.e., join, aggregation, and sort) is presented. Given the consideration of the findings based on Flink, the proposed system has improved the query execution time regarding the naive and state-of-art techniques by 47% and 30%, respectively. For intermediate data size, the JAS-MOTH system has reduced the intermediate data size reduction concerning the naive and state-of-art techniques by 50% and 31% on average, respectively, over Hadoop-like infrastructures.

Keywords: big data; data granularities; join, aggregation, sort; multiquery optimization; reused-based opportunity; sharing data; sharing opportunity; sharing work.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Big Data*