I. Why Did We Abandon Azkaban? At the beginning, we chose LinkedIn’s open-source Azkaban for scheduling mainly because of two features we valued: first, the clean interface and simple operation; second, the use of “projects” to manage tasks, which felt very intuitive. At that time, the team was just starting to build the data platform, and this lightweight and clear tool perfectly matched our needs. There were also other reasons: Active community (at the time)Simple deployment with few dependencies (only MySQL + Web Server + Executor)Supports job file–defined dependencies, suitable for DAG scenarios Active community (at the time) Simple deployment with few dependencies (only MySQL + Web Server + Executor) Supports job file–defined dependencies, suitable for DAG scenarios However, as the business scale expanded, Azkaban’s shortcomings gradually surfaced: Lack of an automatic task failure retry mechanism Lack of an automatic task failure retry mechanism Lack of an automatic task failure retry mechanism Azkaban’s retry strategy is extremely primitive: either manually rerun the task or trigger it by polling status through external scripts. We once experienced a case where a Hive task failed due to a temporary resource shortage, causing more than 20 downstream tasks to be blocked, forcing on-call engineers to intervene in the middle of the night. Coarse-grained permission control Coarse-grained permission control Coarse-grained permission control Azkaban’s permission model only supports project-level read or write. It cannot achieve “User A can only modify Task X but not Task Y.” When multiple teams share the same scheduling platform, such permission chaos frequently leads to misoperations. No task version management No task version management No task version management Every modification of a job file overwrites history with no rollback. We once spent two days investigating incorrect ETL results caused by a single parameter change, just because there was no version traceability. Poor extensibility Poor extensibility Poor extensibility Azkaban’s plugin mechanism is honestly underwhelming. Integrating enterprise WeChat alerts, syncing with internal CMDB, or supporting Spark on K8s — basically, all require source code changes. Meanwhile, community updates are slow, GitHub issues pile up and often go unanswered. Reflection: Azkaban works fine for small teams with simple workloads. But once the data platform scales up and more teams join, you’ll quickly notice its architectural limitations, and the pain points will keep popping up. Reflection: II. Why Choose DolphinScheduler? By the end of 2022, we began evaluating alternatives — comparing Airflow, XXL-JOB, DolphinScheduler, and other popular schedulers. We ultimately selected DolphinScheduler (hereafter DS), based mainly on: Rich built-in task types Rich built-in task types Rich built-in task types DS has built-in support for Shell, SQL, Spark, Flink, DataX, Python, and more than a dozen task types, and supports custom plugins — no more writing wrapper scripts for everything. Comprehensive failure-handling mechanism Comprehensive failure-handling mechanism Comprehensive failure-handling mechanism Supports task-level retry (configurable retry count and interval)Supports failure alerts (email, DingTalk, enterprise WeChat)Supports “skip after failure” or “terminate workflow on failure” Supports task-level retry (configurable retry count and interval) Supports failure alerts (email, DingTalk, enterprise WeChat) Supports “skip after failure” or “terminate workflow on failure” Fine-grained permission control Fine-grained permission control Fine-grained permission control Permission management in DS is very detailed. Permissions can be set at the tenant, project, workflow, and even task level — secure collaboration especially across multiple teams. Visual DAG + version management Visual DAG + version management Visual DAG + version management Drag-and-drop DAG editing with dependencies, conditional branches, and subprocesses Every workflow release automatically saves a version and supports rollback to any historical version Active Chinese community Active Chinese community Active Chinese community As an Apache top-level project, DS has a large user base in China, complete documentation, and quick responses. Several of our production issues were answered in the community within 24 hours. III. Real Migration Case: From Azkaban to DolphinScheduler Background Original system: Azkaban 3.80, around 150 workflows, 800+ daily tasksGoal: Smooth migration to DS 3.1.2 with no impact on business data output Original system: Azkaban 3.80, around 150 workflows, 800+ daily tasks Goal: Smooth migration to DS 3.1.2 with no impact on business data output Migration steps Task inventory and classification Task inventory and classification Task inventory and classification Perform a full inventory of existing Azkaban jobs. Classify by type (e.g., Shell scripts, Hive SQL, Spark jobs), then focus on identifying strong dependencies and mapping complete upstream-downstream relationships.Mark strong dependency chains (e.g., A → B → C) Perform a full inventory of existing Azkaban jobs. Classify by type (e.g., Shell scripts, Hive SQL, Spark jobs), then focus on identifying strong dependencies and mapping complete upstream-downstream relationships. Mark strong dependency chains (e.g., A → B → C) DS environment deployment and testing DS environment deployment and testing DS environment deployment and testing Deploy DS cluster (Master + Worker + API Server + Alert Server)Create tenants, users, projects, and configure resource queues (YARN) Deploy DS cluster (Master + Worker + API Server + Alert Server) Create tenants, users, projects, and configure resource queues (YARN) Task refactoring and validation Task refactoring and validation Task refactoring and validation Convert Azkaban’s .job files into DS workflow definitionsKey conversions: parameter passing (Azkaban uses ${}; DS also uses ${} but syntax differs slightly); dependency logic (Azkaban uses dependencies; DS uses DAG edges)Run full workflows in the test environment and verify data consistency Convert Azkaban’s .job files into DS workflow definitions .job Key conversions: parameter passing (Azkaban uses ${}; DS also uses ${} but syntax differs slightly); dependency logic (Azkaban uses dependencies; DS uses DAG edges) ${} ${} dependencies Run full workflows in the test environment and verify data consistency Gray release switching Gray release switching Gray release switching First, migrate non-core report jobs (e.g., operation daily reports)Observe for one week, then gradually migrate core pipelines (e.g., user behavior ETL)Eventually switch all over, keeping Azkaban in read-only mode for 1 month for traceability First, migrate non-core report jobs (e.g., operation daily reports) Observe for one week, then gradually migrate core pipelines (e.g., user behavior ETL) Eventually switch all over, keeping Azkaban in read-only mode for 1 month for traceability Pitfalls we encountered Pitfall 1: Inconsistent parameter passing Pitfall 1: Inconsistent parameter passing Pitfall 1: Inconsistent parameter passing In Azkaban, ${date} automatically injects the current date, whereas DS requires explicitly defining global parameters or using system-built-ins like ${system.datetime}. We wrote a script to convert parameter syntax automatically. ${date} ${system.datetime} Pitfall 2: Resource isolation issues Pitfall 2: Resource isolation issues Pitfall 2: Resource isolation issues Previously, everything ran in the same YARN queue. Long-running jobs hogged all resources and queued small ones forever. Later, we allocated separate users and queues per business line — finally peaceful, no mutual interference. Pitfall 3: Alert storms Pitfall 3: Alert storms Pitfall 3: Alert storms At first, every little task failure triggered alerts — hundreds a day — overwhelming. Late,r we tuned the strategy: core jobs alert immediately; non-core ones send daily summaries instead. Much cleaner. IV. Practical Suggestions for Future Migrators Don’t blindly pursue “big and comprehensive” Don’t blindly pursue “big and comprehensive” Don’t blindly pursue “big and comprehensive” If you only have a few dozen Shell tasks, Cron + simple monitoring might be more efficient. Scheduling systems incur operational costs — evaluate ROI first. Take permissions and tenant design seriously Take permissions and tenant design seriously Take permissions and tenant design seriously Plan tenant structure from day one (e.g., by business line), or chaos will follow later. Enable workflow approval for key task changes. Establish workflow health indicators Establish workflow health indicators Establish workflow health indicators Task failure rateAverage runtime fluctuationDependency blocking frequency Task failure rate Average runtime fluctuation Dependency blocking frequency We use Prometheus + Grafana to monitor these and detect risks early. Make good use of SubProcess Make good use of SubProcess Make good use of SubProcess Complicated DAGs quickly become a messy tangle. Pack reusable logic — e.g., data quality checks, log archiving — into subprocesses. Easier reuse, easier maintenance. Backup and disaster recovery are mandatory Backup and disaster recovery are mandatory Backup and disaster recovery are mandatory Regularly back up DS metadata database (MySQL/PostgreSQL)Configure multi-Master HAEnable cross-cluster DR for critical workflows (failover if the primary cluster goes down) Regularly back up DS metadata database (MySQL/PostgreSQL) Configure multi-Master HA Enable cross-cluster DR for critical workflows (failover if the primary cluster goes down) V. Quick Comparison Table: Azkaban vs DolphinScheduler CapabilityAzkabanDolphinSchedulerTask Retry❌ (Manual Required)✅ (Configurable)Fine-grained Permission❌ (Project-level Only)✅ (Task-level)Version Control❌✅Built-in Task TypesLimited (Mainly Shell)Diverse (Including Spark/Flink)Community Activity (2025)Low✅ High (Apache Project)Visual DAGWeak (Only Dependency Graph)✅ CapabilityAzkabanDolphinSchedulerTask Retry❌ (Manual Required)✅ (Configurable)Fine-grained Permission❌ (Project-level Only)✅ (Task-level)Version Control❌✅Built-in Task TypesLimited (Mainly Shell)Diverse (Including Spark/Flink)Community Activity (2025)Low✅ High (Apache Project)Visual DAGWeak (Only Dependency Graph)✅ CapabilityAzkabanDolphinScheduler CapabilityAzkabanDolphinScheduler Capability Azkaban DolphinScheduler Task Retry❌ (Manual Required)✅ (Configurable)Fine-grained Permission❌ (Project-level Only)✅ (Task-level)Version Control❌✅Built-in Task TypesLimited (Mainly Shell)Diverse (Including Spark/Flink)Community Activity (2025)Low✅ High (Apache Project)Visual DAGWeak (Only Dependency Graph)✅ Task Retry❌ (Manual Required)✅ (Configurable) Task Retry ❌ (Manual Required) ✅ (Configurable) Fine-grained Permission❌ (Project-level Only)✅ (Task-level) Fine-grained Permission ❌ (Project-level Only) ✅ (Task-level) Version Control❌✅ Version Control ❌ ✅ Built-in Task TypesLimited (Mainly Shell)Diverse (Including Spark/Flink) Built-in Task Types Limited (Mainly Shell) Diverse (Including Spark/Flink) Community Activity (2025)Low✅ High (Apache Project) Community Activity (2025) Low ✅ High (Apache Project) Visual DAGWeak (Only Dependency Graph)✅ Visual DAG Weak (Only Dependency Graph) ✅ Technology selection for a tool is not the finish line, but the starting point for continuous optimization. Technology selection for a tool is not the finish line, but the starting point for continuous optimization.