In JDBC source batch processing tasks, if table_path and partition_column are configured, the engine will dynamically partition the data. We can optimize partition intervals by analyzing sample data to avoid data skew issues. However, it has been observed that even when where_condition is configured, the dynamic partitioning algorithm still partitions the entire table. This leads to excessive time consumption in the partitioning phase, especially when retrieving a small amount of data from a large table. To address this, we need to modify the relevant processes to optimize performance. All SQL statements in the following explanations use MySQL as an example. Different data sources have specific subclass methods implemented via overwriting. Key Problem Analysis 55GB MySQL Table Case:
Original implementation took 25 minutes to read 1000 rows
Optimized version reduced to23 seconds (PR #8760) 1.1 Main Process of Data Partitioning The entry point for dynamic partitioning in the code is the splitTableIntoChunks method in the DynamicChunkSplitter class. The red-highlighted boxes in the flowchart indicate the parts that need modifications, which will be detailed in the following sub-processes. 1.1.1 Querying Minimum and Maximum Values It is necessary to include the where_condition configuration in the source and append it to the query. 1.1.2 General Column Partitioning 1. Querying the Total Number of Records Add a condition to ensure that the “yes” branch is executed only when where_condition is empty.
Modify the “no” branch to include where_condition validation and concatenate it into the corresponding query statement. The query rules are as follows:
If query is configured, use: SELECT COUNT(*) FROM (<subquery>) T Otherwise, use: SELECT COUNT(*) FROM <table_name> If where_condition is configured, append it to the end of the query. 2. Partitioning Data Ranges Refer to the following sub-process for details. 1.1.2.1 Pagination-Based Partitioning Querying the next partition boundary (nextChunkEnd) Max Query Section If where_condition is configured, append it to the LIMIT query layer. Min Query Section If where_condition is configured, append it to the query. 1.1.2.2 Sample-Based Partitioning If where_condition is configured, append it to the query. 1.1.3 Date Column Partitioning This section reuses the logic from 1.1.2.1, requiring only a single modification. 1.2 Usage of Partitions This section does not require modifications. The analysis here aims to understand how partitions are utilized, ensuring the correctness, necessity, and risks of the previous modifications. Once the data is partitioned, it is distributed to the worker’s SourceSeaTunnelTask. Finally, it is used in the open method of the JdbcInputFormat class. The primary process is as follows: From the process above, it is evident that it where_condition is appended to the generated SQL in the final step. If it where_condition is not considered during partition generation, some partitions may end up querying no data when where_condition is applied. When numerous such partitions exist, it not only impacts partitioning performance but also degrades data retrieval performance due to a high volume of ineffective queries. 1.3 Optimization Results After optimization, local testing showed that the time required to filter and retrieve 1,000 rows from a 55GB MySQL table using where_condition and write them to an Oracle table reduced from 25 minutes to 23 seconds. The corresponding PR for this optimization can be found here: PR #8760. In JDBC source batch processing tasks, if table_path and partition_column are configured, the engine will dynamically partition the data. We can optimize partition intervals by analyzing sample data to avoid data skew issues. However, it has been observed that even when where_condition is configured, the dynamic partitioning algorithm still partitions the entire table. This leads to excessive time consumption in the partitioning phase, especially when retrieving a small amount of data from a large table. To address this, we need to modify the relevant processes to optimize performance. table_path partition_column where_condition All SQL statements in the following explanations use MySQL as an example. Different data sources have specific subclass methods implemented via overwriting. Key Problem Analysis 55GB MySQL Table Case:
Original implementation took 25 minutes to read 1000 rows
Optimized version reduced to23 seconds (PR #8760) 55GB MySQL Table Case: Original implementation took 25 minutes to read 1000 rows Optimized version reduced to23 seconds (PR #8760) 55GB MySQL Table Case : 55GB MySQL Table Case Original implementation took 25 minutes to read 1000 rows 25 minutes Optimized version reduced to 23 seconds ( PR #8760 ) 23 seconds PR #8760 1.1 Main Process of Data Partitioning The entry point for dynamic partitioning in the code is the splitTableIntoChunks method in the DynamicChunkSplitter class. The red-highlighted boxes in the flowchart indicate the parts that need modifications, which will be detailed in the following sub-processes. splitTableIntoChunks DynamicChunkSplitter 1.1.1 Querying Minimum and Maximum Values It is necessary to include the where_condition configuration in the source and append it to the query. where_condition 1.1.2 General Column Partitioning 1. Querying the Total Number of Records Add a condition to ensure that the “yes” branch is executed only when where_condition is empty. Modify the “no” branch to include where_condition validation and concatenate it into the corresponding query statement. The query rules are as follows: If query is configured, use: Add a condition to ensure that the “yes” branch is executed only when where_condition is empty. where_condition Modify the “no” branch to include where_condition validation and concatenate it into the corresponding query statement. The query rules are as follows: where_condition If query is configured, use: query SELECT COUNT(*) FROM (<subquery>) T SELECT COUNT(*) FROM (<subquery>) T Otherwise, use: Otherwise, use: SELECT COUNT(*) FROM <table_name> SELECT COUNT(*) FROM <table_name> If where_condition is configured, append it to the end of the query. If where_condition is configured, append it to the end of the query. where_condition 2. Partitioning Data Ranges Refer to the following sub-process for details. 1.1.2.1 Pagination-Based Partitioning Querying the next partition boundary ( nextChunkEnd ) Querying the next partition boundary ( nextChunkEnd ) Max Query Section Max Query Section Max Query Section If where_condition is configured, append it to the LIMIT query layer. If where_condition is configured, append it to the LIMIT query layer. If where_condition is configured, append it to the LIMIT query layer. where_condition LIMIT Min Query Section Min Query Section Min Query Section If where_condition is configured, append it to the query. If where_condition is configured, append it to the query. where_condition 1.1.2.2 Sample-Based Partitioning If where_condition is configured, append it to the query. If where_condition is configured, append it to the query. where_condition 1.1.3 Date Column Partitioning This section reuses the logic from 1.1.2.1 , requiring only a single modification. 1.1.2.1 1.2 Usage of Partitions This section does not require modifications. The analysis here aims to understand how partitions are utilized, ensuring the correctness, necessity, and risks of the previous modifications. Once the data is partitioned, it is distributed to the worker’s SourceSeaTunnelTask . Finally, it is used in the open method of the JdbcInputFormat class. The primary process is as follows: SourceSeaTunnelTask open JdbcInputFormat From the process above, it is evident that it where_condition is appended to the generated SQL in the final step. If it where_condition is not considered during partition generation, some partitions may end up querying no data when where_condition is applied. When numerous such partitions exist, it not only impacts partitioning performance but also degrades data retrieval performance due to a high volume of ineffective queries. where_condition where_condition where_condition 1.3 Optimization Results After optimization, local testing showed that the time required to filter and retrieve 1,000 rows from a 55GB MySQL table using where_condition and write them to an Oracle table reduced from 25 minutes to 23 seconds . where_condition 25 minutes to 23 seconds The corresponding PR for this optimization can be found here: PR #8760 . PR #8760

Column

Oracle

This Code Change Sped Up MySQL Queries from 25 Minutes to 23 Seconds

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

AI’s Data Bottleneck Is Getting Worse—Here’s a New Way to Fix It

10 Minute Guide to Fixing Damaged SQL Databases - No Recovery Required!

10 Free Resources to Become a Health Data Scientist

10 Best Online Courses to Learn Oracle and PL/SQL for Beginners

The Noonification: A Deep Dive Into Semantic Segmentation Evaluation Metrics (1/28/2024)

The Noonification: Whats Up With AI Regulations? (1/20/2024)

AI’s Data Bottleneck Is Getting Worse—Here’s a New Way to Fix It

10 Minute Guide to Fixing Damaged SQL Databases - No Recovery Required!

10 Free Resources to Become a Health Data Scientist

10 Best Online Courses to Learn Oracle and PL/SQL for Beginners

The Noonification: A Deep Dive Into Semantic Segmentation Evaluation Metrics (1/28/2024)

The Noonification: Whats Up With AI Regulations? (1/20/2024)

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps