In our day-to-day big-data job orchestration, Apache DolphinScheduler has become one of our most critical tools. We used to run it on bare-metal (v3.1.9 still sat on physical machines), but that approach exposed gaps in elastic scaling, resource isolation, and operational efficiency. As the company’s cloud-native strategy accelerated, we finally upgraded DolphinScheduler to 3.2.2 in 2025 and partially migrated it to Kubernetes. physical machines Kubernetes The motivation was crystal-clear: first, elastic scaling—K8s can spin up extra Worker pods at peak load; second, resource isolation so jobs don’t clobber each other; third, automated rollout & rollback, slashing maintenance costs; and finally, and most importantly, alignment with our cloud-native direction. elastic scaling resource isolation automated rollout & rollback cloud-native direction Image Build: From Source to Modules The first step of the migration was image construction. We prepared a base image containing Hadoop, Hive, Spark, Flink, Python, etc., then built DolphinScheduler’s base image on top, bundling recompiled modules and the MySQL driver. Note: MySQL stores DolphinScheduler’s metadata, so the driver JAR must be symlinked into every module: dolphinscheduler-tools, dolphinscheduler-master, dolphinscheduler-worker, dolphinscheduler-api, and dolphinscheduler-alert-server. dolphinscheduler-tools dolphinscheduler-master dolphinscheduler-worker dolphinscheduler-api dolphinscheduler-alert-server Module images are customized on top of the base DS image, mainly tweaking ports and configs. To minimize later changes, we kept the image names identical to the official ones. You can build a single module: ./build.sh worker-server ./build.sh worker-server or batch-build everything: ./build-all.sh ./build-all.sh Typical headaches: huge base image → slow builds; refactored JARs not overwriting old ones; mismatched port configs & start scripts across modules. Overlook any of these and you’ll suffer later. huge base image → slow builds; refactored JARs not overwriting old ones; mismatched port configs & start scripts across modules. Deployment: From Hand-Rolled YAML to Official Helm Chart Early on, we hand-wrote YAMLs—painful for config sprawl and upgrades. We switched to the official Helm chart for centralized configs and smoother upgrades. K8s cluster version: v1.25. Firs,t create the namespace: v1.25 kubectl create ns dolphinscheduler helm pull oci://registry-1.docker.io/apache/dolphinscheduler-helm --version 3.2.2 kubectl create ns dolphinscheduler helm pull oci://registry-1.docker.io/apache/dolphinscheduler-helm --version 3.2.2 values.yaml is where the dragons hide. Key snippets: values.yaml 1. Image image: registry: my.private.repo repository: dolphinscheduler tag: 3.2.2 pullPolicy: IfNotPresent image: registry: my.private.repo repository: dolphinscheduler tag: 3.2.2 pullPolicy: IfNotPresent 💡 Pre-push utility images to your private repo to avoid network hiccups. 2. External MySQL mysql: enabled: false # disable embedded MySQL externalMysql: host: mysql.prod.local port: 3306 username: ds_user password: ds_password database: dolphinscheduler mysql: enabled: false # disable embedded MySQL externalMysql: host: mysql.prod.local port: 3306 username: ds_user password: ds_password database: dolphinscheduler 💡 Always disable the built-in DB; prod uses external MySQL (future plan: migrate to PostgreSQL). 3. LDAP Auth ldap: enabled: true url: ldap://ldap.prod.local:389 userDn: cn=admin,dc=company,dc=com password: ldap_password baseDn: dc=company,dc=com ldap: enabled: true url: ldap://ldap.prod.local:389 userDn: cn=admin,dc=company,dc=com password: ldap_password baseDn: dc=company,dc=com 💡 Single sign-on via corporate LDAP simplifies permission management. 4. Shared Storage sharedStoragePersistence: enabled: true storageClassName: nfs-rwx size: 100Gi mountPath: /dolphinscheduler/shared sharedStoragePersistence: enabled: true storageClassName: nfs-rwx size: 100Gi mountPath: /dolphinscheduler/shared 💡 storageClassName must support ReadWriteMany, or multiple Workers can’t share. storageClassName must ReadWriteMany 5. HDFS hdfs: defaultFS: hdfs://hdfs-nn:8020 path: /dolphinscheduler rootUser: hdfs hdfs: defaultFS: hdfs://hdfs-nn:8020 path: /dolphinscheduler rootUser: hdfs 💡 Ensure big-data paths like /opt/soft exist beforehand. /opt/soft 6. Zookeeper zookeeper: enabled: false # disable embedded ZK externalZookeeper: quorum: zk1.prod.local:2181,zk2.prod.local:2181,zk3.prod.local:2181 zookeeper: enabled: false # disable embedded ZK externalZookeeper: quorum: zk1.prod.local:2181,zk2.prod.local:2181,zk3.prod.local:2181 💡 When using external ZK, disable the built-in one and verify version compatibility. Pitfalls & Maintenance Battles We stepped on plenty of rakes: Image issuesBase image too fat → slow CIModule deps diverged → duplicate installsMySQL driver path wrong → startup failuresCustom JAR forgot to overwrite old ones → runtime exceptionsPort & script mismatches between modulesHelm values.yaml gotchassharedStoragePersistence.storageClassNamemust be RWX-capableStorage size, mountPath, and config path indentation errorsDisable defaults you don’t need (e.g., built-in ZK) and mind version requirementsUpgrade & maintenance costEvery new DolphinScheduler release forces us to diff our custom patches, rebuild every module image, and re-test. Version drift in config keys makes upgrades and rollbacks fragile, stretching release cycles and burning team hours. Image issues Image issues Base image too fat → slow CI Module deps diverged → duplicate installs MySQL driver path wrong → startup failures Custom JAR forgot to overwrite old ones → runtime exceptions Port & script mismatches between modules Helm values.yaml gotchas Helm values.yaml gotchas sharedStoragePersistence.storageClassNamemust be RWX-capable sharedStoragePersistence.storageClassName Storage size, mountPath, and config path indentation errors Disable defaults you don’t need (e.g., built-in ZK) and mind version requirements Upgrade & maintenance cost Upgrade & maintenance cost Every new DolphinScheduler release forces us to diff our custom patches, rebuild every module image, and re-test. Version drift in config keys makes upgrades and rollbacks fragile, stretching release cycles and burning team hours. Roadmap & Thoughts To cut long-term OPEX, we’re standardizing: Migrate metadata DB from MySQL → PostgreSQLMove to vanilla community images instead of custom onesShift remaining prod workloads to K8sIntroduce full CI/CD with Prometheus + Grafana observability Migrate metadata DB from MySQL → PostgreSQL Move to vanilla community images instead of custom ones vanilla community images Shift remaining prod workloads to K8s Introduce full CI/CD with Prometheus + Grafana observability K8s gives DolphinScheduler far better elasticity, isolation, and portability than bare metal ever could. Custom images and configs did hurt, but as we converge on community releases and standardized ops, pain will fade and velocity will rise. Our end goal: a highly available, easily extensible, unified scheduling platform that truly unlocks cloud-native value. If you’re also considering moving your scheduler to K8s, hit the comments or join the DolphinScheduler community—let’s dig together!