The future of e-commerce lies in refined operation, as all industry players are leveraging data analytics to group their customers and provide tailored services. This is especially the case with Lifease, an e-commerce brand targeting middle-class consumers.
In this post, I write about how we carry out refined operations, based on our own Data Management Platform (DMP).
Let's start from the basics.
As the raw material of our DMP, the data sources of Lifease include:
These constitute our data assets, from which we derive a number of tags to describe the customers' age, address, preferred products, the device they use, etc.
Using these tags as filters, we pick out a group of customers that match certain characteristics (the "Grouping" process). Then we observe the pattern of the target group.
From the data user's perspective, they mainly use the DMP for two purposes: tag query and grouping.
Tag Query: Sometimes they find a certain customer (or a group of customers) and check what tags they are attached to.
Grouping: After grouping, they might want to check to see if a certain customer is among a specified group, in order to find grounding for their marketing decisions. They might also pull the grouping result set from the DMP to their own business system for further development. Most of the time, they will start analyzing the shopping mode of the target group directly.
Firstly, we, as data platform engineers, define the tags and the rules for grouping.
Next, we define the domain-specific language (DSL) that describes what we want to do, so we can submit computation tasks to Apache Spark.
Then, computing results will be stored in Apache Hive and Apache Doris.
Lastly, data users can perform whatever queries they need in Hive and Doris.
An abstraction of the DMP can be seen as a four-layered architecture:
Metadata Management: All meta-information about the tags is stored in the source data tables;
Computation & Storage: This layer is supported by Spark, Hive, Doris, and Redis;
Scheduling: This is like a command center of the DMP. It arranges the tasks throughout the whole platform, such as aggregating data into basic tags, converting data of basic tags into SQL semantics for queries based on the DSL rules, and passing computing results from Spark to Hive and Doris;
Service: This is where data users conduct grouping, profile analysis, and check the tags.
Tags are the most important elements of our DMP, so I'm going to take a whole chapter to introduce them.
Every tag in our DMP goes through a five-phased lifecycle:
Every day, the grouping results will be updated and the outdated data will be cleaned. Both processes are automatic. We have also partially automated the tag production process. Our next priority is to realize 100% automation of it.
To derive tags from data, we need to turn our raw data into more organized and structured ones first:
ODS (Operational Data Store): This layer contains the user login history, event tracking logs, transaction data, and the binlogs from various databases.
DWD (Data Warehouse Details): Data processed by the ODS layer will be sent here to form user login tables, user event tables, and order information sheets.
DM (Data Market): The DM layer stores the aggregated data from DWD.
Data in DM will be paired by “and”, “or”, and “xor” logic to produce tags that are more comprehensible for data users.
Tags can be categorized differently based on different metrics:
We made a list of what would look like an ideal data warehouse for us:
It was hard to find a single tool that met all these needs, so we tried a mixture of multiple tools.
We stored part of our offline and real-time data in Hbase for basic tag queries, most of the offline data in Hive, and double-wrote the rest of our real-time data into Kudu and ES for real-time grouping and data query. The grouping results would be produced by Impala and then cached in Redis.
As you can imagine, complicated storage could lead to tricky maintenance. Also, double writing adds to the risk of data inconsistency since one of them might fail.
So, in Storage Architecture 2.0, we introduced Apache Doris and Apache Spark. The whole data pipeline was a Y-shaped diagram.
We stored offline data in Hive, and the basic tags and real-time data Doris. Then, based on Spark, we conducted federated queries across Hive and Spark. The query results would be stored in Redis.
With this new architecture, we compromised a little bit of performance for much lower maintenance costs.
P.S.
For your reference, here is a summary of the applicable scenarios for the various engines that we've used or investigated.
Some of our data queries demand high performance, such as grouping checks and customer group analysis.
Grouping check means to see if certain users are categorized into one or several given groups. It is a two-step check:
Check-in static group packets: perform pre-computations and store the results in Redis; use Lua script for bulk check and thus increase performance.
Check-in real-time behavior groups: extract data from contexts, API, and Apache Doris for rule-based judgment. Meanwhile, we have tried to increase performance by asynchronous checks, quick short-circuit calculation, query statement optimization, and join table quantity control.
Customer group analysis is done to figure out the behavioral path of consumers. It entails joining queries across the group packets and multiple tables. Apache Doris does not support path analysis functions so far, but its computing model is friendly to user-defined function development, so we have built a UDF for this purpose and it works well.
The newly introduced data warehouse, Apache Doris, can be applied in multiple scenarios, including point queries, batch queries, behavioral path analysis, and grouping. In-pointf queries and small-scale join queries. It delivers a high performance of over 10,000 QPS and an RT99 of less than 50ms. Apart from its strong scalability and easy maintenance, we also benefit from the much simpler tag models brought by the integration of real-time and offline data.
In this post, I zoomed in on various parts of our DMP, and explained how the data tags, storage and query worked.
I believe that a good tag system, faster query speeds, and more targeted consumer operation are the recipe for successful refined operations. So our follow-up efforts will be put into these aspects.
Tag system:
Storage & computation performance:
Precision marketing:
I'm writing this piece to share our practice in the data engineering community and hopefully, collect some valuable suggestions, so if you've got any ideas, meet me in the comment section.