SeaTunnel Cluster Avoids Downtime With Advanced JVM and GC Tweaks

Cluster Configuration Item Description Quantity 3 servers Specifications Alibaba Cloud ECS 16C64G Slot Mode Static 50 slots ST Memory Config -Xms32g -Xmx32g -XX:MaxMetaspaceSize=8g Item Description Quantity 3 servers Specifications Alibaba Cloud ECS 16C64G Slot Mode Static 50 slots ST Memory Config -Xms32g -Xmx32g -XX:MaxMetaspaceSize=8g Item Description Item Item Description Description Quantity 3 servers Quantity Quantity 3 servers 3 servers Specifications Alibaba Cloud ECS 16C64G Specifications Specifications Alibaba Cloud ECS 16C64G Alibaba Cloud ECS 16C64G Slot Mode Static 50 slots Slot Mode Slot Mode Static 50 slots Static 50 slots ST Memory Config -Xms32g -Xmx32g -XX:MaxMetaspaceSize=8g ST Memory Config ST Memory Config -Xms32g -Xmx32g -XX:MaxMetaspaceSize=8g -Xms32g -Xmx32g -XX:MaxMetaspaceSize=8g Exception Issues Since April, there have been three occurrences of cluster split-brain, all involving a certain node going split-brain or shutting down automatically. Core logs are as follows: Master Node Hazelcast monitoring thread prints a Slow Operation log. Slow Operation After a 60s Hazelcast heartbeat timeout, we see that 198 has left the cluster. left the cluster 198 Worker Node We can see that it is already unable to obtain heartbeats from Hazelcast cluster nodes, with timeouts exceeding 60000ms. Attempting to reconnect to the cluster Attempting to reconnect to the cluster Afterward, any requests sent to this node—such as status queries or job submissions—become stuck and return no status. At this point, the entire cluster becomes unavailable, entering a deadlock state. The health check interfaces we wrote for the nodes are all unreachable. Service downtime occurred during peak morning hours, so after observing in the logs that the cluster entered a split-brain state, we quickly restarted the cluster. Service downtime occurred during peak morning hours After parameter tuning, the issue of automatic node shutdown even occurred. automatic node shutdown Problem Analysis The issue may lie in Hazelcast cluster network setup failure, with the following possible causes: cluster network setup failure NTP time of the ECS instances in the cluster is inconsistent; Network jitter on the ECS instances in the cluster causes access to be unavailable; SeaTunnel experiences FULL GC leading to JVM stalling, resulting in setup failure; NTP time of the ECS instances in the cluster is inconsistent; Network jitter on the ECS instances in the cluster causes access to be unavailable; SeaTunnel experiences FULL GC leading to JVM stalling, resulting in setup failure; The first two issues were ruled out by our operations colleagues: no network problems were identified, Alibaba Cloud NTP service is functioning normally, and the server clocks are synchronized across all three servers. no network problems server clocks are synchronized Regarding the third issue, in the last Hazelcast health check log before the anomaly on node 198, we found that the cluster time was at -100 milliseconds, which seems to have a limited impact. cluster time So during subsequent startups, we added JVM GC logging parameters to observe full GC durations. We observed that in the worst case it lasted 27s. Since the three nodes monitor each other via ping, this easily leads to Hazelcast exceeding the 60s heartbeat timeout. We also discovered that when synchronizing a 1.4 billion-row ClickHouse table, full GC exceptions are likely to occur after the job has been running for some time. Solution Increase ST Cluster Heartbeat Timeout Hazelcast’s cluster failure detector is responsible for determining whether a cluster member is unreachable or has crashed. But according to the well-known FLP result, in asynchronous systems, it’s impossible to distinguish between a crashed member and a slow member. A solution to this limitation is to use an unreliable failure detector. An unreliable detector allows members to suspect others of failure, typically based on liveness criteria, though it may make mistakes. FLP result unreliable failure detector Hazelcast provides the following built-in failure detectors: Deadline Failure Detector and Phi Accrual Failure Detector. Deadline Failure Detector Phi Accrual Failure Detector By default, Hazelcast uses the Deadline Failure Detector. There is also a Ping Failure Detector which, if enabled, works in parallel with the above detectors and can detect failures at OSI Layer 3 (network layer). This detector is disabled by default. OSI Layer 3 (network layer) disabled by default Deadline Failure Detector Deadline Failure Detector Uses absolute timeout for missed/lost heartbeats. After the timeout, a member is considered crashed/unreachable and marked as suspect. Relevant parameters and descriptions: hazelcast: properties: hazelcast.heartbeat.failuredetector.type: deadline hazelcast.heartbeat.interval.seconds: 5 hazelcast.max.no.heartbeat.seconds: 120 hazelcast: properties: hazelcast.heartbeat.failuredetector.type: deadline hazelcast.heartbeat.interval.seconds: 5 hazelcast.max.no.heartbeat.seconds: 120 Configuration Item Description hazelcast.heartbeat.failuredetector.type Cluster failure detector mode: deadline. hazelcast.heartbeat.interval.seconds Interval at which members send heartbeat messages to each other. hazelcast.max.no.heartbeat.seconds Timeout for suspecting a cluster member if no heartbeat is received. Configuration Item Description hazelcast.heartbeat.failuredetector.type Cluster failure detector mode: deadline. hazelcast.heartbeat.interval.seconds Interval at which members send heartbeat messages to each other. hazelcast.max.no.heartbeat.seconds Timeout for suspecting a cluster member if no heartbeat is received. Configuration Item Description Configuration Item Configuration Item Description Description hazelcast.heartbeat.failuredetector.type Cluster failure detector mode: deadline. hazelcast.heartbeat.failuredetector.type hazelcast.heartbeat.failuredetector.type hazelcast.heartbeat.failuredetector.type Cluster failure detector mode: deadline. Cluster failure detector mode: deadline. hazelcast.heartbeat.interval.seconds Interval at which members send heartbeat messages to each other. hazelcast.heartbeat.interval.seconds hazelcast.heartbeat.interval.seconds hazelcast.heartbeat.interval.seconds Interval at which members send heartbeat messages to each other. Interval at which members send heartbeat messages to each other. hazelcast.max.no.heartbeat.seconds Timeout for suspecting a cluster member if no heartbeat is received. hazelcast.max.no.heartbeat.seconds hazelcast.max.no.heartbeat.seconds hazelcast.max.no.heartbeat.seconds Timeout for suspecting a cluster member if no heartbeat is received. Timeout for suspecting a cluster member if no heartbeat is received. Phi-accrual Failure Detector Phi-accrual Failure Detector Tracks intervals between heartbeats within a sliding time window, measures the average and variance of these samples, and calculates a suspicion level (Phi). As the time since the last heartbeat increases, the phi value increases. If the network becomes slow or unreliable, leading to increased mean and variance, members are suspected after a longer time without a heartbeat. Relevant parameters and descriptions: hazelcast: properties: hazelcast.heartbeat.failuredetector.type: phi-accrual hazelcast.heartbeat.interval.seconds: 1 hazelcast.max.no.heartbeat.seconds: 60 hazelcast.heartbeat.phiaccrual.failuredetector.threshold: 10 hazelcast.heartbeat.phiaccrual.failuredetector.sample.size: 200 hazelcast.heartbeat.phiaccrual.failuredetector.min.std.dev.millis: 100 hazelcast: properties: hazelcast.heartbeat.failuredetector.type: phi-accrual hazelcast.heartbeat.interval.seconds: 1 hazelcast.max.no.heartbeat.seconds: 60 hazelcast.heartbeat.phiaccrual.failuredetector.threshold: 10 hazelcast.heartbeat.phiaccrual.failuredetector.sample.size: 200 hazelcast.heartbeat.phiaccrual.failuredetector.min.std.dev.millis: 100 Configuration Item Description hazelcast.heartbeat.failuredetector.type Cluster failure detector mode: phi-accrual hazelcast.heartbeat.interval.seconds Interval at which members send heartbeat messages to each other hazelcast.max.no.heartbeat.seconds Timeout for suspecting a member due to missed heartbeats. With an adaptive detector, this can be shorter than the timeout defined for deadline detector hazelcast.heartbeat.phiaccrual.failuredetector.threshold Phi threshold to consider a member unreachable. A lower value detects failures more aggressively but increases false positives; higher values are more conservative but detect actual failures more slowly. Default is 10 hazelcast.heartbeat.phiaccrual.failuredetector.sample.size Number of samples to retain in history. Default is 200 hazelcast.heartbeat.phiaccrual.failuredetector.min.std.dev.millis Minimum standard deviation for phi calculation in a normal distribution Configuration Item Description hazelcast.heartbeat.failuredetector.type Cluster failure detector mode: phi-accrual hazelcast.heartbeat.interval.seconds Interval at which members send heartbeat messages to each other hazelcast.max.no.heartbeat.seconds Timeout for suspecting a member due to missed heartbeats. With an adaptive detector, this can be shorter than the timeout defined for deadline detector hazelcast.heartbeat.phiaccrual.failuredetector.threshold Phi threshold to consider a member unreachable. A lower value detects failures more aggressively but increases false positives; higher values are more conservative but detect actual failures more slowly. Default is 10 hazelcast.heartbeat.phiaccrual.failuredetector.sample.size Number of samples to retain in history. Default is 200 hazelcast.heartbeat.phiaccrual.failuredetector.min.std.dev.millis Minimum standard deviation for phi calculation in a normal distribution Configuration Item Description Configuration Item Configuration Item Description Description hazelcast.heartbeat.failuredetector.type Cluster failure detector mode: phi-accrual hazelcast.heartbeat.failuredetector.type hazelcast.heartbeat.failuredetector.type hazelcast.heartbeat.failuredetector.type Cluster failure detector mode: phi-accrual Cluster failure detector mode: phi-accrual hazelcast.heartbeat.interval.seconds Interval at which members send heartbeat messages to each other hazelcast.heartbeat.interval.seconds hazelcast.heartbeat.interval.seconds hazelcast.heartbeat.interval.seconds Interval at which members send heartbeat messages to each other Interval at which members send heartbeat messages to each other hazelcast.max.no.heartbeat.seconds Timeout for suspecting a member due to missed heartbeats. With an adaptive detector, this can be shorter than the timeout defined for deadline detector hazelcast.max.no.heartbeat.seconds hazelcast.max.no.heartbeat.seconds hazelcast.max.no.heartbeat.seconds Timeout for suspecting a member due to missed heartbeats. With an adaptive detector, this can be shorter than the timeout defined for deadline detector Timeout for suspecting a member due to missed heartbeats. With an adaptive detector, this can be shorter than the timeout defined for deadline detector hazelcast.heartbeat.phiaccrual.failuredetector.threshold Phi threshold to consider a member unreachable. A lower value detects failures more aggressively but increases false positives; higher values are more conservative but detect actual failures more slowly. Default is 10 hazelcast.heartbeat.phiaccrual.failuredetector.threshold hazelcast.heartbeat.phiaccrual.failuredetector.threshold hazelcast.heartbeat.phiaccrual.failuredetector.threshold Phi threshold to consider a member unreachable. A lower value detects failures more aggressively but increases false positives; higher values are more conservative but detect actual failures more slowly. Default is 10 Phi threshold to consider a member unreachable. A lower value detects failures more aggressively but increases false positives; higher values are more conservative but detect actual failures more slowly. Default is 10 hazelcast.heartbeat.phiaccrual.failuredetector.sample.size Number of samples to retain in history. Default is 200 hazelcast.heartbeat.phiaccrual.failuredetector.sample.size hazelcast.heartbeat.phiaccrual.failuredetector.sample.size hazelcast.heartbeat.phiaccrual.failuredetector.sample.size Number of samples to retain in history. Default is 200 Number of samples to retain in history. Default is 200 hazelcast.heartbeat.phiaccrual.failuredetector.min.std.dev.millis Minimum standard deviation for phi calculation in a normal distribution hazelcast.heartbeat.phiaccrual.failuredetector.min.std.dev.millis hazelcast.heartbeat.phiaccrual.failuredetector.min.std.dev.millis hazelcast.heartbeat.phiaccrual.failuredetector.min.std.dev.millis Minimum standard deviation for phi calculation in a normal distribution Minimum standard deviation for phi calculation in a normal distribution Reference docs: https://docs.hazelcast.com/hazelcast/5.1/system-properties https://docs.hazelcast.com/hazelcast/5.1/clusters/failure-detector-configuration https://docs.hazelcast.com/hazelcast/5.4/clusters/phi-accrual-detector https://docs.hazelcast.com/hazelcast/5.1/system-properties https://docs.hazelcast.com/hazelcast/5.1/clusters/failure-detector-configuration https://docs.hazelcast.com/hazelcast/5.4/clusters/phi-accrual-detector https://docs.hazelcast.com/hazelcast/5.1/system-properties https://docs.hazelcast.com/hazelcast/5.1/system-properties https://docs.hazelcast.com/hazelcast/5.1/clusters/failure-detector-configuration https://docs.hazelcast.com/hazelcast/5.1/clusters/failure-detector-configuration https://docs.hazelcast.com/hazelcast/5.4/clusters/phi-accrual-detector https://docs.hazelcast.com/hazelcast/5.4/clusters/phi-accrual-detector To improve accuracy, we adopted community recommendations to use phi-accrual in hazelcast.yml, and set the timeout to 180s: community recommendations phi-accrual hazelcast.yml hazelcast: properties: # Newly added parameters hazelcast.heartbeat.failuredetector.type: phi-accrual hazelcast.heartbeat.interval.seconds: 1 hazelcast.max.no.heartbeat.seconds: 180 hazelcast.heartbeat.phiaccrual.failuredetector.threshold: 10 hazelcast.heartbeat.phiaccrual.failuredetector.sample.size: 200 hazelcast.heartbeat.phiaccrual.failuredetector.min.std.dev.millis: 100 hazelcast: properties: # Newly added parameters hazelcast.heartbeat.failuredetector.type: phi-accrual hazelcast.heartbeat.interval.seconds: 1 hazelcast.max.no.heartbeat.seconds: 180 hazelcast.heartbeat.phiaccrual.failuredetector.threshold: 10 hazelcast.heartbeat.phiaccrual.failuredetector.sample.size: 200 hazelcast.heartbeat.phiaccrual.failuredetector.min.std.dev.millis: 100 GC Configuration Optimization SeaTunnel uses the G1 garbage collector by default. The larger the memory configuration, the more likely it is that YoungGC or MixedGC won’t reclaim enough memory efficiently (even with multithreading), triggering frequent FullGC. (Java 8 handles FullGC in single-threaded mode, which is very slow.) If multiple cluster nodes trigger FullGC simultaneously, the chances of cluster networking failures greatly increase. Therefore, our goal is to make sure YoungGC/MixedGC can reclaim sufficient memory as much as possible using multithreaded processing. Unoptimized parameters: -Xms32g -Xmx32g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/seatunnel/dump/zeta-server -XX:MaxMetaspaceSize=8g -XX:+UseG1GC -XX:+PrintGCDetails -Xloggc:/alidata1/za-seatunnel/logs/gc.log -XX:+PrintGCDateStamps -Xms32g -Xmx32g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/seatunnel/dump/zeta-server -XX:MaxMetaspaceSize=8g -XX:+UseG1GC -XX:+PrintGCDetails -Xloggc:/alidata1/za-seatunnel/logs/gc.log -XX:+PrintGCDateStamps So, we tried increasing the allowed GC pause duration: -- This parameter sets the target maximum pause time. The default is 200 milliseconds. -XX:MaxGCPauseMillis=5000 -- This parameter sets the target maximum pause time. The default is 200 milliseconds. -XX:MaxGCPauseMillis=5000 Mixed Garbage Collections use this parameter and the historical GC durations to estimate how many Regions can be collected within the target 200ms. If only a few are collected and the desired GC effect is not achieved, G1 has a special strategy: after a Stop-The-World (STW) pause and collection, it resumes system threads and then performs another STW, executing a mixed GC to collect a portion of the Regions. This repeats up to ‐XX:G1MixedGCCountTarget=8 (default is 8 times). Mixed Garbage Collections For example: if 400 Regions need to be collected, and the 200ms limit allows only 50 Regions to be collected at a time, then repeating the process 8 times will collect them all. This avoids a long STW pause from a single collection. JVM parameters after the first optimization: JVM parameters after the first optimization: -Xms32g -Xmx32g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/seatunnel/dump/zeta-server -XX:MaxMetaspaceSize=8g -XX:+UseG1GC -XX:+PrintGCDetails -Xloggc:/alidata1/za-seatunnel/logs/gc.log -XX:+PrintGCDateStamps -XX:MaxGCPauseMillis=5000 -Xms32g -Xmx32g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/seatunnel/dump/zeta-server -XX:MaxMetaspaceSize=8g -XX:+UseG1GC -XX:+PrintGCDetails -Xloggc:/alidata1/za-seatunnel/logs/gc.log -XX:+PrintGCDateStamps -XX:MaxGCPauseMillis=5000 Mixed GC logs: Mixed GC pause times — this parameter is only a target value, and the observed results were all within the expected range: Full GC logs: However, Full GCs still couldn't be avoided, and each took about 20 seconds. These additional parameters only marginally improved GC performance. By observing the logs, we noticed that during MixedGC scenarios, the old generation wasn’t being properly collected — a large amount of data remained in the old generation without being cleaned up. So we tried tuning the old generation memory and several performance-related G1 GC parameters. Optimized parameters were as follows: -Xms42g -Xmx42g -XX:GCTimeRatio=4 -XX:G1ReservePercent=15 -XX:G1HeapRegionSize=32M -Xms42g -Xmx42g -XX:GCTimeRatio=4 -XX:G1ReservePercent=15 -XX:G1HeapRegionSize=32M Heap memory (-Xms / -Xmx) increased from 32G to 42G, indirectly increasing the upper limit of the old generation, which should theoretically reduce the frequency of Full GCs. The CPU time ratio used by GC threads (-XX:GCTimeRatio) increased from 10% to 20%. The formula is 100/(1+GCTimeRatio). This increases the CPU time allowed for GC. Reserved space (-XX:G1ReservePercent) increased from 10% to 15%. Evacuation Failure refers to when G1 fails to allocate new regions in the heap space and triggers a Full GC. Increasing the reserved space can help avoid such scenarios, though it reduces the usable space in the old generation. So we increased heap memory. This adjustment helps in the following cases: Reserved space No free regions available when copying live objects from the young generation. No free regions available when evacuating live objects from the old generation. No contiguous space available in the old generation for allocating large objects. No free regions available when copying live objects from the young generation. No free regions available when evacuating live objects from the old generation. No contiguous space available in the old generation for allocating large objects. Heap Region Size (-XX:G1HeapRegionSize) — set to 32MB to optimize large object collection. JVM parameters after the second optimization: JVM parameters after the second optimization: -Xms42g -Xmx42g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/seatunnel/dump/zeta-server -XX:MaxMetaspaceSize=8g -XX:+UseG1GC -XX:+PrintGCDetails -Xloggc:/alidata1/za-seatunnel/logs/gc.log -XX:+PrintGCDateStamps -XX:MaxGCPauseMillis=5000 -XX:GCTimeRatio=4 -XX:G1ReservePercent=15 -XX:G1HeapRegionSize=32M -Xms42g -Xmx42g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/seatunnel/dump/zeta-server -XX:MaxMetaspaceSize=8g -XX:+UseG1GC -XX:+PrintGCDetails -Xloggc:/alidata1/za-seatunnel/logs/gc.log -XX:+PrintGCDateStamps -XX:MaxGCPauseMillis=5000 -XX:GCTimeRatio=4 -XX:G1ReservePercent=15 -XX:G1HeapRegionSize=32M After the optimization, we observed a noticeable decrease in the number of FullGCs that day, but we still didn't reach the ideal state of having zero FullGCs. Further log analysis showed that the parallel intersection stage consumed a lot of time and frequently encountered aborts. We applied the following parameter tuning: -XX:ConcGCThreads=12 -XX:InitiatingHeapOccupancyPercent=50 -XX:ConcGCThreads=12 -XX:InitiatingHeapOccupancyPercent=50 The number of GC threads running concurrently with the application (-XX:ConcGCThreads) was increased from 4 to 12. The lower this value, the higher the application throughput, but too low will make GC take too long. When the concurrent GC cycle is too long, increasing the number of GC threads can help. However, this reduces the number of threads available for application logic, which will impact throughput. For offline data sync scenarios, avoiding FullGCs is more important, making this parameter crucial. which will impact throughput Concurrent marking threshold for the old generation (-XX:InitiatingHeapOccupancyPercent) increased from 45% to 50%, triggering concurrent marking earlier and improving MixedGC performance. JVM parameters after the third optimization: JVM parameters after the third optimization: -Xms42g -Xmx42g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/seatunnel/dump/zeta-server -XX:MaxMetaspaceSize=8g -XX:+UseG1GC -XX:+PrintGCDetails -Xloggc:/alidata1/za-seatunnel/logs/gc.log -XX:+PrintGCDateStamps -XX:MaxGCPauseMillis=5000 -XX:InitiatingHeapOccupancyPercent=50 -XX:+UseStringDeduplication -XX:GCTimeRatio=4 -XX:G1ReservePercent=15 -XX:ConcGCThreads=12 -XX:G1HeapRegionSize=32M -Xms42g -Xmx42g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/seatunnel/dump/zeta-server -XX:MaxMetaspaceSize=8g -XX:+UseG1GC -XX:+PrintGCDetails -Xloggc:/alidata1/za-seatunnel/logs/gc.log -XX:+PrintGCDateStamps -XX:MaxGCPauseMillis=5000 -XX:InitiatingHeapOccupancyPercent=50 -XX:+UseStringDeduplication -XX:GCTimeRatio=4 -XX:G1ReservePercent=15 -XX:ConcGCThreads=12 -XX:G1HeapRegionSize=32M JVM Tuning References: JVM Tuning References: https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/g1_gc_tuning.html#sthref56 https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/g1_gc.html#pause_time_goal https://zhuanlan.zhihu.com/p/181305087 https://blog.csdn.net/qq_32069845/article/details/130594667 https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/g1_gc_tuning.html#sthref56 https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/g1_gc_tuning.html#sthref56 https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/g1_gc.html#pause_time_goal https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/g1_gc.html#pause_time_goal https://zhuanlan.zhihu.com/p/181305087 https://zhuanlan.zhihu.com/p/181305087 https://blog.csdn.net/qq_32069845/article/details/130594667 https://blog.csdn.net/qq_32069845/article/details/130594667 Optimization Results Optimization Results Since the configuration was optimized on April 26, no cluster split-brain incidents have occurred, and service availability monitoring shows all nodes have returned to normal. no cluster split-brain incidents have occurred After the JVM tuning on April 30, during the May Day holiday, we achieved zero FullGCs on three nodes, and no further stalling or anomalies were detected in the system’s health check interfaces. zero FullGCs on three nodes Although this came at the cost of some application thread throughput, we ensured the stability of the cluster, laying a solid foundation for the internal large-scale rollout of Zeta.