As Zoom’s business expanded and its data scenarios grew more complex, the company’s scheduling needs also evolved—from traditional batch processing to unified management of streaming jobs. To address this, Zoom selected Apache DolphinScheduler as the core scheduling framework and built a unified scheduling platform that supports both batch and stream tasks. This platform has been deeply customized and optimized using modern infrastructure like Kubernetes and multi-cloud deployment. In this article, we’ll dive into the system’s architectural evolution, the key challenges Zoom encountered, how they were solved, and the team's plans—all based on real-world production experience. Apache DolphinScheduler Kubernetes multi-cloud deployment Background & Challenges: Expanding from Batch to Streaming In its early stages, Zoom’s data platform focused primarily on Spark SQL batch processing, with tasks scheduled using DolphinScheduler's standard plugins on AWS EMR. Spark SQL batch processing AWS EMR However, new business demands led to a surge in real-time processing needs, such as: Real-time metrics computation using Flink SQL Spark Structured Streaming for processing logs and event data Long-running streaming tasks requiring state tracking and fault recovery Real-time metrics computation using Flink SQL Flink SQL Spark Structured Streaming for processing logs and event data Spark Structured Streaming Long-running streaming tasks requiring state tracking and fault recovery This posed a new challenge for DolphinScheduler: How can streaming tasks be “scheduled” and “managed” just like batch tasks? How can streaming tasks be “scheduled” and “managed” just like batch tasks? Limitations of the Initial Architecture The Original Approach In the early integration of streaming jobs, Zoom used DolphinScheduler's Shell task plugin to call the AWS EMR API and launch streaming tasks (e.g., Spark/Flink). Shell task plugin This implementation was simple but quickly revealed several issues: No state control: After submission, the task exited immediately without tracking status—causing duplicate submissions or false failures. No task instances or logs: Troubleshooting was difficult due to missing logs and observability. Fragmented logic: Streaming and batch jobs used different logic paths, making unified maintenance hard. No state control: After submission, the task exited immediately without tracking status—causing duplicate submissions or false failures. No state control No task instances or logs: Troubleshooting was difficult due to missing logs and observability. No task instances or logs Fragmented logic: Streaming and batch jobs used different logic paths, making unified maintenance hard. Fragmented logic These issues highlighted the urgent need for a unified batch-stream scheduling architecture. unified batch-stream scheduling architecture System Evolution: Introducing a State Machine for Streaming Jobs To enable stateful scheduling of streaming jobs, Zoom designed a two-stage task model for streaming workloads based on DolphinScheduler's task state machine capability: two-stage task model 1. Submit Task – Submission Phase Runs on the Dolphin Worker Submits Flink/Spark Streaming jobs to Yarn or Kubernetes Considered successful once the Yarn Application enters the Running state Fails immediately if the submission fails Runs on the Dolphin Worker Submits Flink/Spark Streaming jobs to Yarn or Kubernetes Considered successful once the Yarn Application enters the Running state Running Fails immediately if the submission fails 2. Track Status Task – Status Tracking Phase Runs on the Dolphin Master Periodically checks the running status of Yarn/Kubernetes jobs Implemented as an independent task, similar to a dependent task Continuously updates job status to DolphinScheduler’s metadata center Runs on the Dolphin Master Periodically checks the running status of Yarn/Kubernetes jobs Implemented as an independent task, similar to a dependent task independent task Continuously updates job status to DolphinScheduler’s metadata center This two-task model effectively addresses several key issues: Prevents duplicate submissions Brings streaming jobs into a unified state and logging system Ensures architectural consistency with batch jobs for easier maintenance and scaling Prevents duplicate submissions Brings streaming jobs into a unified state and logging system Ensures architectural consistency with batch jobs for easier maintenance and scaling High Availability: Handling Master/Worker Failures In large-scale production, system stability is critical. Zoom implemented robust fault-tolerance for DolphinScheduler Master and Worker nodes. DolphinScheduler Master and Worker nodes 1. Worker Failure Recovery If the Submit Task is running and the Worker crashes: The original task instance is logically deleted A new task instance is created and assigned to a healthy Worker The previously submitted Yarn Application is not forcibly killed If the Track Status Task is running: No re-scheduling is needed Since the task runs on the Master, the Worker failure does not impact status tracking If the Submit Task is running and the Worker crashes: The original task instance is logically deleted A new task instance is created and assigned to a healthy Worker The previously submitted Yarn Application is not forcibly killed Submit Task The original task instance is logically deleted A new task instance is created and assigned to a healthy Worker The previously submitted Yarn Application is not forcibly killed The original task instance is logically deleted A new task instance is created and assigned to a healthy Worker The previously submitted Yarn Application is not forcibly killed not forcibly killed If the Track Status Task is running: No re-scheduling is needed Since the task runs on the Master, the Worker failure does not impact status tracking Track Status Task No re-scheduling is needed Since the task runs on the Master, the Worker failure does not impact status tracking No re-scheduling is needed Since the task runs on the Master, the Worker failure does not impact status tracking 2. Master Failure Recovery Uses ZooKeeper + MySQL for fault tolerance Multiple Master nodes are deployed with a distributed lock for leader election When a Master node fails: The active node is switched automatically All status-tracking tasks are reloaded and resumed Idempotent checks and logical deletions are key to preventing task duplication Uses ZooKeeper + MySQL for fault tolerance ZooKeeper + MySQL Multiple Master nodes are deployed with a distributed lock for leader election When a Master node fails: The active node is switched automatically All status-tracking tasks are reloaded and resumed Idempotent checks and logical deletions are key to preventing task duplication The active node is switched automatically All status-tracking tasks are reloaded and resumed Idempotent checks and logical deletions are key to preventing task duplication The active node is switched automatically All status-tracking tasks are reloaded and resumed Idempotent checks and logical deletions are key to preventing task duplication Idempotent checks logical deletions In summary, this architecture achieves: In summary, this architecture achieves: Advantage 1: Leverages DolphinScheduler's workflow and task state machine features Prevents duplicate job submissions Advantage 2: Easier debugging and issue resolution Streaming jobs now have task instances and logs like batch jobs Supports log search and fault diagnostics Advantage 3: Unified architecture for streaming and batch jobs Improved maintainability and consistency across systems Advantage 1: Leverages DolphinScheduler's workflow and task state machine features Prevents duplicate job submissions Advantage 1 Leverages DolphinScheduler's workflow and task state machine features Prevents duplicate job submissions Leverages DolphinScheduler's workflow and task state machine features Prevents duplicate job submissions Advantage 2: Easier debugging and issue resolution Streaming jobs now have task instances and logs like batch jobs Supports log search and fault diagnostics Advantage 2 Easier debugging and issue resolution Streaming jobs now have task instances and logs like batch jobs Supports log search and fault diagnostics Easier debugging and issue resolution Streaming jobs now have task instances and logs like batch jobs Supports log search and fault diagnostics Advantage 3: Unified architecture for streaming and batch jobs Improved maintainability and consistency across systems Advantage 3 Unified architecture for streaming and batch jobs Improved maintainability and consistency across systems Unified architecture for streaming and batch jobs Improved maintainability and consistency across systems Unified Spark and Flink Scheduling on Kubernetes Zoom has migrated both batch and streaming jobs to Kubernetes, using Spark Operator and Flink Operator for cloud-native task orchestration. Spark Operator Flink Operator Architecture Overview Spark/Flink jobs are submitted as SparkApplication or FlinkDeployment Custom Resources (CRDs) Spark/Flink jobs are submitted as SparkApplication or FlinkDeployment Custom Resources (CRDs) SparkApplication FlinkDeployment DolphinScheduler creates and manages these CRs Task status is synced via the Operator and Kubernetes API Server Dolphin Master and Worker pods continuously track pod status using the state machine and reflect it in the scheduling system DolphinScheduler creates and manages these CRs Task status is synced via the Operator and Kubernetes API Server Dolphin Master and Worker pods continuously track pod status using the state machine and reflect it in the scheduling system Multi-Cloud Cluster Scheduling Supports scheduling across multiple cloud Kubernetes clusters (e.g., Cloud X / Cloud Y) Scheduling logic and resource management are fully decoupled across clusters Enables cross-cloud, unified management of batch and stream tasks Supports scheduling across multiple cloud Kubernetes clusters (e.g., Cloud X / Cloud Y) Scheduling logic and resource management are fully decoupled across clusters Enables cross-cloud, unified management of batch and stream tasks cross-cloud, unified management Online Issues and Mitigation Strategies Issue 1: Task Duplication Due to Master Crash DolphinScheduler’s distributed locks are non-blocking, creating race conditions: Fixes: Add a lock acquisition timeout Enforce idempotent control for Submit Tasks (avoid duplicate submissions) Validate task status before restoring from MySQL Fixes: Add a lock acquisition timeout Enforce idempotent control for Submit Tasks (avoid duplicate submissions) Validate task status before restoring from MySQL Fixes Add a lock acquisition timeout Enforce idempotent control for Submit Tasks (avoid duplicate submissions) Validate task status before restoring from MySQL Add a lock acquisition timeout Enforce idempotent control for Submit Tasks (avoid duplicate submissions) Validate task status before restoring from MySQL Issue 2: Workflow Stuck in READY_STOP State READY_STOP Cause: The Dolphin API lacked optimistic locking when stopping workflows Race conditions during multi-threaded state updates led to stuck workflows Improvements: Add optimistic locks at the API layer Refactor long transaction logic Add multiple layers of state verification when Master updates task status Cause: The Dolphin API lacked optimistic locking when stopping workflows Race conditions during multi-threaded state updates led to stuck workflows Cause The Dolphin API lacked optimistic locking when stopping workflows Race conditions during multi-threaded state updates led to stuck workflows The Dolphin API lacked optimistic locking when stopping workflows Race conditions during multi-threaded state updates led to stuck workflows Improvements: Add optimistic locks at the API layer Refactor long transaction logic Add multiple layers of state verification when Master updates task status Improvements Add optimistic locks at the API layer Refactor long transaction logic Add multiple layers of state verification when Master updates task status Add optimistic locks at the API layer Refactor long transaction logic Add multiple layers of state verification when Master updates task status Future Plans Zoom plans to further optimize DolphinScheduler to meet increasingly complex production demands. The main areas of focus include: 1. Asynchronous Task Mechanism Decouple submission and status tracking logic Allow Worker nodes to execute tasks asynchronously, avoiding long resource blocks Lays the foundation for elastic scheduling and advanced dependency handling Decouple submission and status tracking logic Allow Worker nodes to execute tasks asynchronously, avoiding long resource blocks Lays the foundation for elastic scheduling and advanced dependency handling 2. Upgraded Unified Batch-Stream Scheduling Platform Workflow templates will support mixed task types Fully unified logs, states, and monitoring Enhanced cloud-native capabilities to build a distributed computing hub for large-scale production scheduling Workflow templates will support mixed task types Fully unified logs, states, and monitoring Enhanced cloud-native capabilities to build a distributed computing hub for large-scale production scheduling distributed computing hub for large-scale production scheduling Final Thoughts Zoom’s in-depth practice with DolphinScheduler proves the platform’s scalability, stability, and architectural flexibility as an enterprise-grade scheduler. Especially in unified batch-stream scheduling, cloud-native deployment on Kubernetes, and multi-cluster fault tolerance, Zoom’s architecture offers valuable lessons for the community and other enterprise users. unified batch-stream scheduling cloud-native deployment on Kubernetes multi-cluster fault tolerance 📢 We warmly welcome more developers to join the Apache DolphinScheduler community—share your insights and experiences, and help us build the next-generation open-source scheduler together! Apache DolphinScheduler community GitHub: https://github.com/apache/dolphinscheduler GitHub https://github.com/apache/dolphinscheduler