I lead a Big Data team with an electric vehicle manufacturer and I have tried a fair share of the OLAP tools available on the market. Read below to hear what I think you need to know about the tools including pros and cons of numerous OLAP tools and my hands on OLAP experience.
Back in 2017, looking for an OLAP tool on the market was like seeking a tree on an African prairie—there were only a few of them. As we looked up and scanned the horizon, our eyes lingered on Apache Druid and Apache Kylin. We landed on Druid because we were already familiar with it, while Kylin, despite its impressively high query efficiency in pre-computation, had a few shortcomings.
Kylin’s shortcomings:
As for Apache Druid, it used columnar storage, supported both real-time and offline data ingestion, and delivered fast queries.
On the flip side, Druid:
In 2019 we tried TiDB. Long story short, here are its pros and cons:
Pros:
Cons:
We did our research into ClickHouse and Apache Doris. We were impressed by ClickHouse's awesome standalone performance, but stopped looking further into it when we found that:
Apache Doris, on the other hand, ticked a lot of the boxes on our requirement list:
To summarize, Apache Doris appeared to be an ideal substitute for Apache Druid + TiDB.
Here is a diagram to show you how data flows through our OLAP system:
We pool data from our business system, event tracking, devices, and vehicles into our big data platform.
We enable CDC for our business data. Any changes in such data will be converted into a data stream and stored in Kafka, ready for stream computing. As for data that can only be imported in batches, it will go directly into our distributed storage.
Instead of integrating, streaming, and batch processing, we adopted Lambda architecture. Our business status quo determines that our real-time and offline data come from different links. In particular:
Instead of using the Flink/Spark-Doris Connector, we use the Routine Load method to transfer data from Flink to Doris, and Broker Load from Spark to Doris. Data produced in batches by Flink and Spark will be backed up to Hive for usage in other scenarios. This is our way to increase data efficiency.
In terms of data services, we enable auto-generation of APIs through data source registration and flexible configuration so we can manage traffic and authority via APIs. In combination with the K8s serverless solution, the whole thing works great.
In the data application layer, we have two types of scenarios:
Like most companies, we built our own Customer Data Platform (CDP):
Usually, a CDP is made up of a few modules:
We wanted to achieve real-time + offline integration, fast grouping, quick aggregation, multi-table Join, and federated queries in our CDP. And here is how they are done:
Real-Time + Offline
We have real-time tags and offline tags and we need them to be placed together. Plus, columns on the same data might be updated at different frequencies. Some basic tags (regarding the identity of customers) should be updated in real time, while other tags (age, gender) can be updated daily. We want to put all the atomic tags of customers in one table because that brings the least maintenance costs and can largely reduce the number of required tables when we add self-defined tags.
So how do we achieve this?
We use the Routine Load method of Apache Doris to update real-time data, and the Broker Load method to batch import offline data. We also use these two methods to update different columns in the same table, respectively.
Fast Grouping
Basically, grouping is to combine a certain group of tags and find the overlapping data. This can be complicated. Doris helped speed up this process by SIMD optimization.
Quick Aggregation
We need to update all the tags, re-compute the distribution of customer groups, and analyze effects on a daily basis. Such processing needs to be quick and neat. So we divide data into tablets based on time so there will be less data transfer and faster computation. When calculating the distribution of customer groups, we pre-aggregate data at each node and then collect them for further aggregation. In addition, the vectorized execution engine of Doris is a real performance accelerator.
Multi-Table Join
Since our basic data is stored in multiple data tables, when CDP users customize the tags they need, they need to conduct multi-table Join. An important factor that attracted us to Apache Doris was its promising multi-table Join capability.
Federated Queries
Currently, we use Apache Doris in combination with TiDB. Records about customer reach will be put in TiDB, and data regarding credit points and vouchers will be processed in TiDB, too, since it is a better OLTP tool. As for more complicated analysis, such as monitoring the effectiveness of customer operation, we need to integrate information about task execution and target groups. This is when we conduct federated queries across Doris and TiDB.
This is our journey from Apache Druid, TiDB, and Apache Doris (and a short peek into ClickHouse in the middle). We looked into the performance, SQL semantics, system compatibility, and maintenance costs of each of them and ended up with the OLAP architecture we have now. If you have the same aspects of concern as us, this might be a reference for you.