paint-brush
NetEaseMail Slashes Dev Time by More Than Half With DolphinSchedulerby@williamguo
New Story

NetEaseMail Slashes Dev Time by More Than Half With DolphinScheduler

by William GuoMarch 28th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

NetEase Mail faced challenges managing lots of data and tasks, so they brought in DolphinScheduler to streamline the process.
featured image - NetEaseMail Slashes Dev Time by More Than Half With DolphinScheduler
William Guo HackerNoon profile picture

With the rapid development of the Internet, email—one of the most important tools for information exchange—has seen an ever-growing demand for data processing and task scheduling. As a leading email service provider in China, NetEase Mail introduced the DolphinScheduler platform to better cope with the challenges of processing massive amounts of data and scheduling tasks. It has carried out in-depth deployment and optimization practices in actual use.

Project Background and Selection

Development History of NetEase Mail

Since its launch in 1997, NetEase Mail has undergone multiple important development stages, evolving from 126 Mail, 163 Mail to Mail Master, continuously enriching its product lines and service offerings. Today, NetEase Mail has built a diversified business system that includes free mail, enterprise mail, VIP mail, and more, providing stable, secure, and efficient email services to a massive number of users.


From the figure above, it can be seen that NetEase Mail introduced Apache DolphinScheduler in 2023.

Data Application Scenarios

In the daily operations of NetEase Mail, there is a need to process massive amounts of business log data. These data require permanent cold backup as well as hot backup storage for over half a year, and business logs are processed and stored separately as offline (HDFS) and real-time (ClickHouse). Meanwhile, to ensure business availability, critical links such as core email sending/receiving and user login authentication mechanisms require efficient data processing and task scheduling support.

Background of Selection and Advantages of DolphinScheduler

In the face of ever-growing data processing needs, the NetEase Mail team compared multiple open-source scheduling platforms. DolphinScheduler stood out due to its support for various scheduling mechanisms, strong stability, high ease of use, and rich functionality. It is able to support a variety of data scenarios, meeting the complex and diverse task scheduling requirements of NetEase Mail.


Platform Deployment and Current Usage

The Upgrade and Transformation Path Based on DS

After introducing the DolphinScheduler platform, the NetEase Mail team embarked on a continuous path of optimization and upgrade transformation. Starting with preliminary research and selection comparisons, then resolving issues encountered during usage, and finally performing secondary development to continuously improve the platform’s ease of use and user experience, the team has been relentlessly advancing the platform’s perfection. After the selection was confirmed in March 2023, the initial transformation based on the alarm functionality was completed in December 2023, and by March 2024, stable operation and ongoing exploration were achieved.

Data Architecture and Current Usage

Currently, within the NetEase Mail department, the DolphinScheduler platform has been deployed with a 3-Master, 5-Worker architecture and runs on Kubernetes (k8s). The platform supports over 1,200 offline scheduling tasks for services such as anti-spam, risk control, AI, and more, with daily executions exceeding 50,000 times.


Image description

At the data architecture level, by integrating DolphinScheduler with Streampark, a combination of offline scheduling and real-time Flink task processing has been achieved. Meanwhile, self-developed components such as the metadata center and data portal form the data service layer, providing support for data management and services.

The introduction of the platform has significantly improved data development efficiency, reduced operation and maintenance costs, and ensured the stable output of business data, thereby strongly supporting rapid business iteration and innovation.

Task Types and Application Support

The DolphinScheduler platform has become the primary task scheduling platform within the department, offering a wide variety of task types including Spark, Shell, SQL, SeaTunnel, Python, and more. These tasks provide solid data support for downstream applications such as metadata management, BI reports, and data R&D, meeting the data processing needs of different business scenarios.


Image description

Integration of Data Distribution Functionality and Optimization Practices

Integration of Data Distribution Functionality

Regarding data scheduling and distribution, the NetEase Mail team identified several issues that needed to be addressed to improve the efficiency and capability of data processing, including:


  • Frequent data processing requests

Frequent data processing requests from non-R&D personnel require the support of data developers.


  • Low willingness to build intermediate tables

The product and QA teams show little willingness to build intermediate tables, as it takes a long time for them.


  • There are thresholds in data synchronization configuration

The configuration of data synchronization tasks between heterogeneous data sources presents certain thresholds, requiring specialized data development support.


  • The synchronization task development process is relatively long

The complete development process for data synchronization tasks is lengthy, involving steps such as table creation, synchronization configuration creation, and scheduling task creation.


To address these issues, the NetEase Mail team integrated a data distribution functionality into the DolphinScheduler platform.


Image description

The overall approach is Intermediate Table Creation + Synchronization Task Configuration Generation + Scheduling Task Creation, which enhances the efficiency of intermediate data processing and provides a one-stop process for constructing data processing tasks.


This functionality offers two modes—quick configuration and custom configuration. It can automatically generate table DDL statements based on form parameters and execute table creation logic, simultaneously generating intermediate processing SQL and data synchronization task configurations, and finally calling internal Dolphin methods to create workflows and scheduling tasks.


The implementation of this functionality has reduced the average processing time of intermediate data from 1 hour to within 20 minutes, increasing development efficiency by 67%, and it has already supported over 40 online data processing tasks, with the number of users continuously increasing.

Optimizing Offline Scheduling Methods to Enhance Fault Recovery Capability

The original scheduling mechanism, when handling dependent task failures, would cause the dependent nodes to directly fail, making it impossible to rerun the incomplete data chain with one click. This increased the operational burden on data developers and prolonged fault recovery times.

Image description

To address this issue, the team redesigned the waiting and failure signal processing logic in the getModelDependResult method of the DependentExecute class in the dolphinscheduler-master module. When the dependency task status is FAILED, the status in the dependResultList is changed to WAITING, so that downstream tasks receive a WAITING status instead of FAILED.


This optimization enables one-click recovery and rerun of the data chain, reducing the cost of manual intervention, enhancing the recovery efficiency and intelligence of the task chain, accelerating the fault recovery speed of the data chain, and ensuring the timely recovery and output of business data.


Image description

Image description

SeaTunnel Component Integration and Optimization, Enhancing Data Synchronization Efficiency

To meet the large-scale data synchronization needs of new business, the team introduced the SeaTunnel data integration tool and deployed it on the k8s cluster in a separated cluster mode.


By optimizing the SeaTunnel plugins, NetEase Mail offers both custom and quick configuration modes, thereby lowering the usage threshold.


Image description

Implementation Principle

In terms of implementation, the solution supports both form-based configuration and custom configuration. The form-based configuration mode generates the interaction logic for SeaTunnel (ST) configuration through form parameters and supports the configuration of additional custom parameters as well as task-level JVM configurations.


Image description

The backend receives form parameters to generate the necessary configuration for the task execution context. It designs the IConfigGenerator interface to implement the logic for generating Source and Sink for each data source, and finally, the SeatunnelConfigGenerator class is used to generate the final ST configuration.

Deployment and Tuning

In terms of deployment and tuning, the SeaTunnel cluster adopts a 2-Master, 5-Worker architecture to achieve high availability and synchronization performance.


By optimizing the sharding logic for HDFS, HIVE, and MultiTable Source, the reader receives more balanced shards, which enhances data synchronization performance. The related optimizations have been submitted as PRs to the community.


Image description

In terms of parameter tuning, in response to the large-scale data synchronization requirement from HDFS to Doris, after researching and tuning the parameters of the DorisSink code, a data transfer rate of 2 million records per second was achieved, greatly enhancing data synchronization performance.


Image description

Project Practice Case: Task Migration and Resource Isolation Practice

Efficient Migration of Mammoth Platform Tasks to DolphinScheduler

The Mammoth platform is a big data platform software designed for internal use within NetEase Group, and some of the dependencies between Mammoth tasks are very complex, requiring analysis and organization. If done manually, the efficiency would be low, and omissions or mistakes would likely occur.


Eventually, after researching the representation of task dependencies on the DolphinScheduler platform, it was decided to migrate to DolphinScheduler.


Regarding the choice of migration method, if a manual migration approach were adopted, it would not only be time-consuming and labor-intensive but also prone to migration errors due to the complexity of task dependencies, thereby affecting the overall stability of tasks.


Based on various considerations, after discussion and research, NetEase Mail decided to adopt an automated synchronization solution. This solution automatically collects task metadata and task lineage from the old platform, converts it into the task configuration format of the Dolphin platform, and simultaneously adds task dependencies. Finally, workflows are quickly created through the Dolphin interface. To achieve this process, NetEase Mail used the PyDolphinScheduler synchronization tool officially provided by DolphinScheduler—a Python API that allows workflows to be defined using Python code, namely, “workflow as code.”


In addition, NetEase Mail also collected the metadata and lineage of Mammoth tasks through the metadata system, batch rewriting Mammoth tasks into DolphinScheduler tasks, and automatically adding dependency nodes according to the lineage.


Image description

This practice efficiently completed the migration of over 300 Mammoth tasks, ensuring a smooth business transition, achieving one-click batch migration, greatly saving labor costs, and providing valuable experience for similar task migration scenarios in the future.

Worker Group Isolation Practice

The next project practice is the resource isolation practice based on Worker grouping in DolphinScheduler.


In the daily use of the NetEase Mail DolphinScheduler platform, a key business scenario is to support QA’s online monitoring scheduling tasks. These tasks are characterized by high scheduling frequency, mostly at the minute level; there is a large number of tasks, with the number of monitoring-related tasks currently reaching 120+.


The execution of these monitoring scheduling tasks is currently mainly achieved through the SHELL task type by calling a self-developed ETL task processing tool jar package. This jar package provides additional extension functionalities during task execution, such as idempotency checks, timeout retries, etc.


However, this kind of invocation will start many JVM processes on the worker nodes. At this time, if some critical scheduling tasks are also scheduled to the same worker node, and an OOM killer is triggered due to insufficient memory, these critical tasks—generally having higher memory usage—may receive a higher OOM score and are more likely to be killed, resulting in unstable execution of the critical tasks.


In addition, some T+1 synchronization tasks executed in the early morning require a large amount of resources and are not suitable to be scheduled on the same node as other tasks, which may affect the execution performance and stability of other tasks on the same worker.


Therefore, the NetEase Mail team considered using a Worker group isolation solution to address the above issues.

Through Worker group isolation, the team separated high-frequency scheduling tasks such as real-time monitoring from other tasks, ensuring the stable execution of core OKR-related schedules and enhancing the overall stability and reliability of task execution.


Image description

In practical implementation, NetEase Mail created different Worker groups on the platform containing different Worker nodes, including groups such as Default, AI, Monitoring, Hadoop, and large-resource tasks, to avoid issues of task blocking and resource contention, optimize resource allocation, and enhance overall resource utilization and performance of the platform.


This practice effectively resolved issues such as task blocking caused by high scheduling frequency of QA monitoring tasks and worker node OOM problems triggered by the high resource demands of early morning T+1 tasks, thereby reducing operational risks.


Image description

Summary and Future Outlook

Practice Summary

By introducing the DolphinScheduler platform and conducting a series of deployment and optimization practices, the NetEase Mail team has achieved significant results in improving data development efficiency, reducing operational costs, and ensuring the stable output of business data. The platform’s secondary development closely aligns with business needs, emphasizing enhanced user experience and development efficiency, while the team continuously focuses on platform optimization and improvement and actively contributes to the community, driving the platform’s ongoing development.

Platform Value and Benefits

The DolphinScheduler platform plays an important role in NetEase Mail’s business. It not only enhances the efficiency and stability of data development but also meets the diverse needs of the business through optimized task scheduling processes, ensuring the timely output of data and strongly promoting the sustained development of the mail business.

Experience Sharing and Insights

During the process of platform deployment and optimization, the NetEase Mail team has accumulated rich practical experience. These experiences provide important reference value for other enterprises when choosing and using task scheduling platforms. The team emphasizes that secondary development should be closely integrated with actual business needs, always prioritizing user experience and development efficiency. At the same time, continuously focusing on platform optimization and improvement and actively participating in open-source community building can drive the platform’s continuous refinement and development.

Future Outlook

Looking ahead, the NetEase Mail team plans to continue exploring and advancing in the following directions:

  • Embrace AI: Integrate AI and LLM capabilities to achieve more intelligent and user-friendly data processing ETL processes, enhancing the automation and intelligence level of data processing.
  • Data Governance: Integrate DolphinScheduler scheduling data with the internal metadata center to achieve intelligent collection and analysis of data/task lineage and data maps, providing strong support for data governance.
  • Platform Optimization: Further optimize the performance and functionality of the DolphinScheduler platform to enhance its stability and reliability, better meeting the growing data processing demands.
  • Embrace DATA OPS: Achieve integration and consolidation between the DolphinScheduler platform and other data platform systems, promote the automation of data integration and transmission, and build a more efficient data ecosystem.

In Conclusion

The deployment and optimization practice of NetEase Mail based on the DolphinScheduler platform not only solved the current challenges of task scheduling and data processing but also laid a solid foundation for future development. Through continuous technological innovation and practical exploration, NetEase Mail will continue to provide users with higher-quality and more efficient email services, while also contributing more to the development of the open-source community.