Authors:
(1) Limeng Zhang, Centre for Research on Engineering Software Technologies (CREST), The University of Adelaide, Australia;
(2) M. Ali Babar, Centre for Research on Engineering Software Technologies (CREST), The University of Adelaide, Australia. Table of Links Abstract and 1 Introduction 1.1 Configuration Parameter Tuning Challenges and 1.2 Contributions 2 Tuning Objectives 3 Overview of Tuning Framework 4 Workload Characterization and 4.1 Query-level Characterization 4.2 Runtime-based Characterization 5 Feature Pruning and 5.1 Workload-level Pruning 5.2 Configuration-level Pruning 5.3 Summary 6 Knowledge from Experience 7 Configuration Recommendation and 7.1 Bayesian Optimization 7.2 Neural Network 7.3 Reinforcement Learning 7.4 Search-based Solutions 8 Experimental Setting 9 Related Work 10 Discussion and Conclusion, and References 4 WORKLOAD CHARACTERIZATION The initial step in the tuning system involves modeling the characteristics of the target workload to facilitate the tuning algorithm in acquiring knowledge from historical data, thereby accelerating the tuning process or improving coverage to achieve optimal configurations. In this section, we present two main aspects for modeling DBMS workloads: a query-level characterization based on the queries composed in a workload and a runtime-based characterization based on DBMS runtime statistics obtained during the execution of the instance. 4.1 Query-level Characterization Query-level characterization aims at modeling a workload from its composed queries [3]. A DBMS application allows users and other applications to engage with the database by formulating and executing queries, such as SELECT, INSERT, UPDATE, DELETE—tailored to manipulate and retrieve information and accomplish various tasks. Hence, researchers have endeavored to model a workload by analyzing the composition of queries and utilizing query characteristics to represent the workload [3]. Existing work on query-level characterization can be summarized as follows: • Features from the query text: In a DBMS workload, each query will perform a certain type of operation within the DBMS. Hence, some researchers try to characterize a workload by analyzing its static query text and then merging each query vector to ultimately represent the workload. RestTune [6] analyzes the cost of a MySQL workload by examining the distribution of queries within each workload based on their reserved SQL keywords (e.g., SELECT, UPDATE, DISTINCT). Initially, it identifies the reserved operation keywords for each query and subsequently computes their TF-IDF (term frequency-inverse document frequency) values. These TF-IDF feature vectors serve as input for a random forest model utilized for classification. This model provides a predicted probability distribution regarding predefined resource cost levels for each query. Ultimately, the average of the probability distributions for all queries across the entire workload is calculated to represent the metafeature cost for the input workload. LOCAT [7] features a Spark SQL workload directly with its query composition. It then filters out some unimportant queries and uses those important queries to represent a workload when evaluating the performance of recommended configuration. ONLINETUNE [3] features a workload in terms of the query arrival rate within it and its query composition. It leverages a standard LSTM encoder-decoder network to encode each query in the workload. Subsequently, it averages the query encodings to obtain the composition feature of the workload. • Features from the query plan: In addition to the static query text, some researchers model a query through looking into its execution process. When a query is submitted to the DBMS, the system first parses the query to check its syntax and semantics. After parsing, the query is compiled into an execution plan, which outlines the steps needed to execute the query efficiently. Some researchers represent a query through parsing its query plan to aggregate key features, including the cost or categories of operators, scanned tables, and predicates, and then featurizing the workload with the query vectors. In Qtune [9], each SQL query is represented with features including query type (e.g., insert, delete, select, update), involved tables, and the estimated processing cost derived from inherent operations (e.g., scan, hash join, aggregate). Then, it merges all the query vectors into a unified workload vector according to its predefined merging strategies for each part. • Features from other query-related factors. In addition to the aforementioned query features, certain studies consider the underlying data distribution in the database as an informative factor, based on the observation that only data changes affecting the workload queries are relevant to the tuning policy. Building upon this premise, ONLINETUNE [3] extracts the database data distribution by leveraging insights from the DBMS optimizer, which encompass estimates of rows scrutinized by queries, the proportion of rows filtered by table conditions in queries, and the utilization of indexes. Subsequently, it computes the average of these three query features to derive the underlying data distribution associated with a workload. Finally, combined with the features extracted from query text, it concatenates the workload feature and underlying data feature to obtain the final contextual features. This paper is available on arxiv under CC BY 4.0 DEED. Authors: (1) Limeng Zhang, Centre for Research on Engineering Software Technologies (CREST), The University of Adelaide, Australia; (2) M. Ali Babar, Centre for Research on Engineering Software Technologies (CREST), The University of Adelaide, Australia. Authors: Authors: (1) Limeng Zhang, Centre for Research on Engineering Software Technologies (CREST), The University of Adelaide, Australia; (2) M. Ali Babar, Centre for Research on Engineering Software Technologies (CREST), The University of Adelaide, Australia. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 1.1 Configuration Parameter Tuning Challenges and 1.2 Contributions 1.1 Configuration Parameter Tuning Challenges and 1.2 Contributions 2 Tuning Objectives 2 Tuning Objectives 3 Overview of Tuning Framework 3 Overview of Tuning Framework 4 Workload Characterization and 4.1 Query-level Characterization 4 Workload Characterization and 4.1 Query-level Characterization 4.2 Runtime-based Characterization 4.2 Runtime-based Characterization 5 Feature Pruning and 5.1 Workload-level Pruning 5 Feature Pruning and 5.1 Workload-level Pruning 5.2 Configuration-level Pruning 5.2 Configuration-level Pruning 5.3 Summary 5.3 Summary 6 Knowledge from Experience 6 Knowledge from Experience 7 Configuration Recommendation and 7.1 Bayesian Optimization 7 Configuration Recommendation and 7.1 Bayesian Optimization 7.2 Neural Network 7.2 Neural Network 7.3 Reinforcement Learning 7.3 Reinforcement Learning 7.4 Search-based Solutions 7.4 Search-based Solutions 8 Experimental Setting 8 Experimental Setting 9 Related Work 9 Related Work 10 Discussion and Conclusion, and References 10 Discussion and Conclusion, and References 4 WORKLOAD CHARACTERIZATION The initial step in the tuning system involves modeling the characteristics of the target workload to facilitate the tuning algorithm in acquiring knowledge from historical data, thereby accelerating the tuning process or improving coverage to achieve optimal configurations. In this section, we present two main aspects for modeling DBMS workloads: a query-level characterization based on the queries composed in a workload and a runtime-based characterization based on DBMS runtime statistics obtained during the execution of the instance. 4.1 Query-level Characterization Query-level characterization aims at modeling a workload from its composed queries [3]. A DBMS application allows users and other applications to engage with the database by formulating and executing queries, such as SELECT, INSERT, UPDATE, DELETE—tailored to manipulate and retrieve information and accomplish various tasks. Hence, researchers have endeavored to model a workload by analyzing the composition of queries and utilizing query characteristics to represent the workload [3]. Existing work on query-level characterization can be summarized as follows: • Features from the query text: In a DBMS workload, each query will perform a certain type of operation within the DBMS. Hence, some researchers try to characterize a workload by analyzing its static query text and then merging each query vector to ultimately represent the workload. RestTune [6] analyzes the cost of a MySQL workload by examining the distribution of queries within each workload based on their reserved SQL keywords (e.g., SELECT, UPDATE, DISTINCT). Initially, it identifies the reserved operation keywords for each query and subsequently computes their TF-IDF (term frequency-inverse document frequency) values. These TF-IDF feature vectors serve as input for a random forest model utilized for classification. This model provides a predicted probability distribution regarding predefined resource cost levels for each query. Ultimately, the average of the probability distributions for all queries across the entire workload is calculated to represent the metafeature cost for the input workload. LOCAT [7] features a Spark SQL workload directly with its query composition. It then filters out some unimportant queries and uses those important queries to represent a workload when evaluating the performance of recommended configuration. ONLINETUNE [3] features a workload in terms of the query arrival rate within it and its query composition. It leverages a standard LSTM encoder-decoder network to encode each query in the workload. Subsequently, it averages the query encodings to obtain the composition feature of the workload. • Features from the query text: • Features from the query plan: In addition to the static query text, some researchers model a query through looking into its execution process. When a query is submitted to the DBMS, the system first parses the query to check its syntax and semantics. After parsing, the query is compiled into an execution plan, which outlines the steps needed to execute the query efficiently. Some researchers represent a query through parsing its query plan to aggregate key features, including the cost or categories of operators, scanned tables, and predicates, and then featurizing the workload with the query vectors. In Qtune [9], each SQL query is represented with features including query type (e.g., insert, delete, select, update), involved tables, and the estimated processing cost derived from inherent operations (e.g., scan, hash join, aggregate). Then, it merges all the query vectors into a unified workload vector according to its predefined merging strategies for each part. • Features from the query plan: • Features from other query-related factors. In addition to the aforementioned query features, certain studies consider the underlying data distribution in the database as an informative factor, based on the observation that only data changes affecting the workload queries are relevant to the tuning policy. Building upon this premise, ONLINETUNE [3] extracts the database data distribution by leveraging insights from the DBMS optimizer, which encompass estimates of rows scrutinized by queries, the proportion of rows filtered by table conditions in queries, and the utilization of indexes. Subsequently, it computes the average of these three query features to derive the underlying data distribution associated with a workload. Finally, combined with the features extracted from query text, it concatenates the workload feature and underlying data feature to obtain the final contextual features. • Features from other query-related factors. This paper is available on arxiv under CC BY 4.0 DEED. This paper is available on arxiv under CC BY 4.0 DEED. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Tuning DBMS: The Secret Life of Queries and Runtime Data

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A New Testing Platform and Testing Platform Roles

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

10 Ways to Optimize Your Database

10 Essential Computer Skills for Data Mining

10 Most Evolving Big Data Technologies to Catch Up on in 2022

Top 10 JavaScript Charting Libraries for Every Data Visualization Need

A New Testing Platform and Testing Platform Roles

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

10 Ways to Optimize Your Database

10 Essential Computer Skills for Data Mining

10 Most Evolving Big Data Technologies to Catch Up on in 2022

Top 10 JavaScript Charting Libraries for Every Data Visualization Need

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps