How We Built A Cross-Region Hybrid Cloud Storage Gateway for ML & AI at WeRide

by Bin FanAugust 16th, 2020

Too Long; Didn't Read

The original content was published on Alluxio's Engineering Blog (Disclaimer: The author is a Founding Member @Alluxio). In 2020, data on the scale of terabytes is generated daily and we foresee this to grow by a factor of 10 in the following year. WeRide is a globally distributed company with offices located in multiple cities including San Jose in the US and Guangzhou, Beijing, Shanghai, and Anqing in China. The new data access architecture provides a localized cache per location to eliminate redundant requests to S3.

Company Mentioned

featured image - How We Built A Cross-Region Hybrid Cloud Storage Gateway for ML & AI at WeRide

In this blog, guest writer Derek Tan, Executive Director of Infra & Simulation at WeRide, describes how engineers leverage Alluxio as a hybrid cloud data gateway for applications on-premises to access public cloud storage like AWS S3.

The new data access architecture provides a localized cache per location to eliminate redundant requests to S3. In addition to removing the complexity of manual data synchronization, Alluxio directly serves data to engineers working with the same data in the same office, circumventing transfer costs associated with S3. The original content was published on Alluxio's Engineering Blog (Disclaimer: The author is a Founding Member @Alluxio).

Data Challenges at WeRide

WeRide is a company that creates L4 autonomous driving algorithms in the smart mobility industry. Like all self-driving cars companies, data is continuously collected from live road tests for model training, algorithm testing, and simulations.

Thus far, WeRide has accumulated two million kilometers of autonomous driving mileage and the rate of data collection will only increase as more testing vehicles are in service. In 2020, data on the scale of terabytes is generated daily and we foresee this to grow by a factor of 10 in the following year.

In addition to data collected from test drives, applications such as simulation, SIL (Software in the loop) tests, and model benchmarking also produce terabytes of data daily. As our technology advances, the output from these additional applications will also continue to grow to cover larger datasets with more corner cases to handle.

WeRide is a globally distributed company with offices located in multiple cities including San Jose in the US and Guangzhou, Beijing, Shanghai, and Anqing in China. Data is generated and consumed in parallel by different teams across offices. We use AWS S3 as the data lake to share across different offices.

When designing a new algorithm for our self-driving cars or fixing a bug in an existing one, our engineers need to test the algorithm against existing data. Given our data architecture, this caused bottlenecks such as:

Slow iteration in development: Prior to developing or debugging, developers need to download the latest data from the cloud to their local environment. This is often constrained by download speeds and network bandwidth.
High and unnecessary egress costs: Each time data is downloaded from S3, there is a charge on the egress data transfer. Typically to debug one issue, the data transfer cost adds up to $5. This cost is further multiplied if multiple people are collaborating, even though they are downloading the same data.
Error-prone Data synchronization: At WeRide, we built a custom data uploading process that copies data to the cloud and retains a local copy stored in NAS or HDFS. The local copy is necessary to give engineers faster access to data but this causes issues with data synchronization. Currently we maintain the local copies by running a cron job to clean up local data on a regular basis.

Previous architecture is shown below

A New Architecture Using Alluxio

After some investigation, we realize the following architecture will provide great benefit:

Always reference S3 as the single source of truth to eliminate data confliction across different offices.
Deploying a local caching system on top of S3 for workloads on-demand at each office to speed up the development.

However, building an in-house caching system from scratch can be expensive and unnecessary for WeRide’s business needs. We decided to explore existing technologies to meet our needs and fulfill the following requirements:

It is a low or no-cost mature technology that is battle-tested for large scale data access
It is ready-to-use with easy integration and does not introduce new ETL jobs
It allows us to scale by utilizing better hardware when budget allows

With the above criteria in mind, Alluxio became a top choice to accelerate our data access. In addition to being compatible with S3, it provides an easy access interface via its POSIX and HTTP endpoints. As an open source technology, we can incorporate it into our system without adding additional business costs.

The new architecture with Alluxio is shown below

In each office, we deployed Alluxio as a small on-premise cluster, using S3 as the source of truth. Road test data is directly uploaded to the local Alluxio cluster, which can be immediately used by the engineers in the same office. Meanwhile, Alluxio automatically uploads the road test data to S3 in the background. As engineers in other offices want to use road test data, they can make a request via their local Alluxio cluster. The data will either be returned immediately if cached by Alluxio or fetched from S3 if not. To further reduce the fetch time of new data from S3, we worked with the Alluxio team to implement a distributed load command which can open multiple simultaneous connections to download data. This feature was added in the Alluxio 2.1.0 release.

With Alluxio, application data fetched from the cloud is also cached locally. This was previously not possible if the data was not uploaded from the same office. In the common scenario where an engineer wants to review a simulation result by another engineer in the same office, the data is immediately available.

Using the new implementation with Alluxio, we observe the following improvements:

Reduced the complexity of data synchronization by having a single interface to access data and removed the need to maintain a custom locally copy
Out-of-the-box solution for in-office cache of the cloud data
Fast access to data is a critical factor of engineering productivity
Reduced S3 data-out cost of downloading redundant data

Next Steps

Data transfer via Alluxio is now a critical component of connecting in-office local data with data in the cloud. To further improve the system, we are working with the Alluxio team to add features relating to data transfer policies. Capabilities such as throttling upload bandwidth during working hours or prioritization of certain file types would help our engineers.

Conclusion

WeRide aims at delivering L4 autonomous driving technology for the future. Data access is a critical part of developing smart mobility. Adopting Alluxio as a localized cache layer eliminates redundant requests to S3 while removing the complexity of data synchronization, reduces $5 per issue per engineer in data transfer. We look forward to further collaboration with our friends at Alluxio to achieve our data access goal economically.

(Disclaimer: The author is a Founding Member @Alluxio)