Over the last decade, the growth and success of machine learning have been phenomenal, driven primarily by the availability of vast amounts of data and advanced computational power. This surge was brought about by the digitization of various sectors, leading to an explosion of digital data—everything from social media posts to online transactions and sensor data.
Advances in machine learning techniques, particularly in deep learning, facilitated the development of more complex and versatile models. Consequently, machine learning applications have become ubiquitous, contributing to improved efficiencies and capabilities across numerous sectors, including healthcare, finance, and transportation.
Working with these massive amounts of event-based data and deriving value from the events themselves and their context in time – other events happening nearby – is still difficult. Doing so in a real-time or streaming manner is harder still. You often must use complex, low-level APIs or work around the limitations of a higher-level query language designed to solve very different problems, like SQL.
To tackle these challenges, we introduce a new abstraction for time-based data, called timelines. Timelines organize data by time and entity, offering an ideal structure for event-based data with an intuitive, graphical mental model. Timelines simplify reasoning about time by aligning your mental model with the problem domain, allowing you to focus on what to compute, rather than how to express it.
In this post, we present timelines as a natural way to organize event-based data and extract value – directly, and as the inputs for machine learning and prompt engineering. We delve into the timeline concept, its interaction with external data stores (inputs and outputs), and the lifecycle of queries using timelines.
This post is the first in a series about Kaskada, an open-source event-processing engine designed around the timeline abstraction. We’ll follow up with an explanation of how Kaskada builds an expressive temporal query language on the timeline abstraction, how the timeline data model enables Kaskada to efficiently execute temporal queries, and, finally, how timelines enable Kaskada to execute incrementally over event streams.
Dealing with a large variety of events is difficult when treating each kind of event as a separate, unordered table of data – the way SQL views the world. It is difficult to understand what motivated Carla to make purchases, or what caused Aaron to engage with other users via messages.
By organizing the data – by time and user – it becomes much easier to spot patterns. Aaron sends messages after winning. Carla makes purchases when frustrated by a series of losses. We also see that Brad may have stopped playing.
By organizing the data – by time and user – it becomes much easier to spot patterns. Aaron sends messages after winning. Carla makes purchases when frustrated by a series of losses. We also see that Brad might have stopped playing.
By organizing the events in a natural way – by time and user – we were able to identify patterns. This same organization allows us to express feature values computed from the events and use those to train and apply machine learning models or compute values to use within a prompt.
Reasoning about time – for instance, cause-and-effect between events – requires more than just an unordered set of event data. For temporal queries, we need to include time as a first-class part of the abstraction. This allows reasoning about when an event happened and the ordering – and time – between events.
Kaskada is built on the timeline abstraction: a multiset ordered by time and grouped by entity. Timelines have a natural visualization, shown below. Time is shown on the x-axis and the corresponding values on the y-axis. Consider purchase events from two people: Ben and Davor. These are shown as discrete points reflecting the time and amount of the purchase. We call these discrete timelines because they represent discrete points.
The time axis of a timeline reflects the time of a computation’s result. For example, at any point in time we might ask “what is the sum of all purchases?” Aggregations over timelines are cumulative – as events are observed, the answer to the question changes. We call these continuous timelines because each value continues until the next change.
Compared to SQL, timelines introduce two requirements: ordering by time and grouping by entity. While the SQL relation – an unordered multiset or bag – is useful for unordered data, the additional requirements of timelines make them ideal for reasoning about cause-and-effect. Timelines are to temporal data what relations are to static data.
Adding these requirements mean that timelines are not a fit for every data processing task. Instead, they allow timelines to be a better fit for data processing tasks that work with events and time. In fact, most event streams (eg., Apache Kafka, Apache Pulsar, AWS Kinesis, etc.) provide ordering and partitioning by key.
When thinking about events and time, you likely already picture something like a timeline. By matching the way you already think about time, timelines simplify reasoning about events and time. By building in the time and ordering requirements, the timeline abstraction allows temporal queries to intuitively express cause and effect.
Timelines are the abstraction used in Kaskada for building temporal queries, but data starts and ends outside of Kaskada. It is important to understand the flow of data from input, to timelines, and finally to output.
Every query starts from one or more sources of input data. Each input – whether it is events arriving in a stream or stored in a table, or facts stored in a table – can be converted to a timeline without losing important context such as the time of each event.
The query itself is expressed as a sequence of operations. Each operation creates a timeline from timelines. The result of the final operation is used as the result of the query. Thus, the query produces a timeline which may be either discrete or continuous.
The result of a query is a timeline, which may be output to a sink. The rows written to the sink may be a history reflecting the changes within a timeline, or a snapshot reflecting the values at a specific point-in-time.
Before a query is performed, each input is mapped to a timeline. Every input – whether events from a stream or table or facts in a table – can be mapped to a timeline without losing the important temporal information, such as when events happened. Events become discrete timelines, with the value(s) from each event occurring at the time of the event. Facts become continuous timelines, reflecting the time during which each fact applied. By losslessly representing all kinds of temporal inputs, timelines allow queries to focus on the computation rather than the kind of input.
After executing a query, the resulting timeline must be output to an external system for consumption. The sink for each destination enables configuration of data writing, with specifics depending on the sink and the destination (see
There are several options for converting the timeline into rows of data, affecting the number of rows produced:
A full history of changes helps visualize or identify patterns in user values over time. In contrast, a snapshot at a specific time is useful for online dashboards or classifying similar users.
Including events after a certain time reduces output size when the destination already has data up to that time or when earlier points are irrelevant. This is particularly useful when re-running a query to materialize to a data store.
Including events up to a specific time also limits output size and enables choosing a point-in-time snapshot. With incremental execution, selecting a time slightly earlier than the current time reduces late data processing.
Both “changed since” and “up-to” options are especially useful with incremental execution, which we will discuss in an upcoming article.
The history – the set of all points in the timeline – is useful when you care about past points. For instance, this may be necessary to visualize or identify patterns in how the values for each user change over time. The history is particularly useful for outputting training examples to use for creating a model.
Any timeline may be output as a history. For a discrete timeline, the history is the collection of events in the timeline. For a continuous timeline, the history contains the points at which a value changes – it is effectively a changelog.
A snapshot – the value for each entity at a specific point in time – is useful when you just care about the latest values. For instance, when updating a dashboard or populating a feature store for connecting to model serving.
Any timeline may be output as a snapshot. For a discrete timeline, the snapshot includes rows for each event happening at that time. For a continuous timeline, the snapshot includes a row for each entity with that entity’s value at that time.
This blog post highlighted the importance of temporal features when creating ML models from event-based data. The time and temporal context of events is critical to seeing patterns in activity. This post introduced the timeline abstraction, which makes it possible to work with the events and the temporal context. Timelines organize data by time and entity, providing a more suitable structure for event-based data compared to multisets.
The timeline abstraction is a natural progression in stream processing, allowing you to reason about time and cause-and-effect relationships more effectively. We also explored the flow of data in a temporal query, from input to output, and discussed the various options for outputting timelines to external systems.
Rather than applying a tabular (static) query to a sequence of snapshots, Kaskada operates on the history (the change stream). This makes it natural to operate on the time between snapshots, rather than only on the data contained in the snapshot. Using timelines as the primary abstraction simplifies working with event-based data and allows for seamless transitions between streams and tables.
You can
By Ben Chambers and Therapon Skoteiniotis, DataStax