Simba Khadder is an ex-Googler and the creator of StreamSQL.io, a Feature Store for Machine Learning
I've worked on teams building ML-powered product features, everything from personalization to propensity paywalls. Meetings to find and get access to data consumed my time, other days it was consumed building ETLs to get and clean that data. The worst situations were when I had to deal with existing microservice oriented architectures. I wouldn't advocate that we stop using microservices, but if you want to fit in a ML project in an already in-place strict microservice oriented architecture, you're doomed.
In this article, I'll describe why microservice oriented architectures suck for machine learning. I'll then lay out how companies like AirBnB and Uber used a feature store like StreamSQL to manage it.
Microservices have become the de-facto architectural choice for tech companies. Microservices allow small teams in large companies to build small, independent components. Teams can solve problems and meet requirements without having to retrofit into a giant monolith. It lets teams move faster. However, when overused, turning a user session token into a user profile can trigger twenty network calls.
I saw this video on Twitter recently, and it inspired me to write that post. The video is hilarious and captures this concept perfectly:
Product Manager Why is it so hard to display the birthday date on the profile page?
Engineer First we have to call the Bingo service to get the UserID, then we call Papaya to turn it into a user session token. We validate those with LMNOP, then we can pull the user info down from Racoon. But Racoon isn't guaranteed to have [the birthdate], so next we call ...*
Example: Reddit Recommender System
Microservices are exceptionally painful when building machine learning models on complex data sources like user behavior. In these cases, to make a prediction, you need to pull from a multitude of microservices, that in turn pull from another multitude of services to get all the contextual information they need.
For example, if you were to build a personalized reddit feed, you may want to know all communities a user is a part of, the top posts in these communities, any posts they've clicked on and liked, and more details.
Adding additional inputs can provide more signal for the model to pick up on. For example, certain users may have drastically different behavior on the weekends than weekdays. Providing the model with the additional input - day of the week - allows it to pick up on those trends.
Each logical input to a machine learning model is called a feature. The process of hypothesizing, building, and testing inputs to a model is called feature engineering. Typically, feature engineering is one of the most time-consuming tasks for ML teams.
Creatively coming up with new features is only part of the process; most of the time is spent finding the data you need, learning its peculiarities and edge cases, and building data pipelines to clean and transform it into a usable form.
In a Microservice-based architecture, the only way to collect certain data is via an API call. A single model may have a wide variety of features that require API calls.
For example, the model may need to query the user service for the communities a user is a part of, then it needs to query the top posts in each of those communities, and then query the likes that each of those garnered.
Since systems are almost never built for this sort of usage, it will set in motion a domino effect of downstream network calls. Unlike a web front end, ML models are not flexible around missing features. It will have to wait for all of the requests to complete or risk providing garbage results.
ML models themselves are quite computationally heavy and very slow; mix this with microservices, and it becomes impossible to serve real-time recommendations.
Getting Training Data Sucks
Machine Learning models work by mimicking what they observe. To do this, a model requires a dataset of observed results and the inputs at the point in time of the observation to train on. Generating a training set in a microservice-based architecture sucks.
In our Reddit example, we'd want an set of user upvotes along with the feature set at the moment they performed the upvote. Though it's far from ideal, we can technically get user upvotes and all post data by scraping the user and post microservices.
However, point-in-time correctness is what makes this problem impossible in many cases. To train, we would need to know what rank the post was in the subreddit; however, the subreddit microservice is unlikely to support retroactive queries.
One option is to avoid the microservices entirely by breaking encapsulation and reading from database dumps. We could now by-pass the APIs and plug the datasets directly into Apache Spark or some other batch processing system.
We can join tables, and work with the data in a far more convenient way. It also makes it possible to generate training datasets as fast as Spark can crunch the data. While using a data lake seems reasonable at first, it leads to the ML services being dependent on the schema of the raw data. The schema will surely change over time.
Microservices are retired and replaced. Data inconsistencies, errors, and issues pile up. Eventually, each ML team has to maintain data pipelines that have devolved into archaic messes of spaghetti code to bandaid the situation. The pipelines are fragile to changes and take a vast amount of time and resources.
Another option is to utilize an event steaming platform like Apache Kafka, Apache Pulsar, or Segment to allow ML teams to subscribe to the event streams that they need. Many of the pitfalls for data lakes also apply to event streaming.
However, unlike data lake dumps, event streams tend to have higher quality data. Since event streams often power mission-critical services, teams are held to a higher standard regarding data quality and documentation. Conversely, data lakes are exclusive to ML and analytics teams and are not held to a high standard.
Event stream processing suffers from the cold start problem. Event streaming platforms are rarely configured to retain events for long, often it's for as little as one week. If you want to generate a new feature you may be stuck with only the last week of data to generate a training set. The cold start problem is further pronounced when creating a new stateful feature.
A stateful feature requires you to aggregate events over some window of time. For example, the number of posts a user made in the last week. In these situations, it could take weeks to even start generating the training data set.
Feature engineering is an iterative process. You generate a hypothesis, build an experiment, and run a test. You either merge it into the main model or scrap it.
The faster iterations happen, the faster the model gets better. If it takes weeks to perform a single test, ML teams lose their ability to perform their jobs. Teams end up wasting away plumbing data pipes and playing politics to get access to data, rather than building better models.
The microservice problem isn't new or unique. Intriguingly, many companies independently landed on the same solution to the problem. AirBnB built Zipline, Uber built Michelangelo, and Lyft built Dryft. These systems are collectively called Feature Stores.
What is a Feature Store?
A feature store provides a standardized way for data scientists to define features. The feature store handles generating training data and providing online features for serving. It abstracts away the data engineering from the ML workflow. Under the hood, it coordinates multiple big data systems to seamlessly process both incoming events and a past events. If you're interested in the specific technologies, [here's the layout of our feature store infrastructure.
In the original "Reddit" microservice architecture, each service owned its own data. The posts microservice is the source of truth for data about posts, the user microservice is the source of truth for data about users, and so on.
A feature store attempts to create its own views of this data in its own internal data structures. It does this by processing a stream of domain events into materialized state. Domain events are logical events such as when a user upvotes an article or when a new post is created. A materialized view is the result set from running a query on an event stream. So, if we wanted the model to know the number of posts that a user upvoted, then we could create a materialized view with this logic:
SELECT user, COUNT(DISTINCT item) FROM upvote_stream GROUP BY user;
All the materialized views exist in the same feature store and are preprocessed for ML use. We've merged all the microservice data that we care about into a monolithic data store. This eliminates the problems related to fetching real-time features from microservices. Now features can be fetched in one round trip. The feature store has the added benefit of being fault resistant, since the materialized views are stored in a highly-available and eventually-consistent data store. We own our own business logic in creating features since we are processing the raw events ourselves, so we are loosely coupled to the business logic in each microservice.
Event Sourcing to generate Training Data
Our features are kept consistent with incoming events. It's essentially a mirror of the microservices tables, but already preprocessed for use in ML and all in one place. Unlike the microservices, a feature store persists every incoming event indefinitely into a log. If we were to replay every single event in the log to a blank feature store, we would end up with the exact same materialized state. The event log becomes the source of truth for the system. This design pattern is called Event Sourcing.
Event Sourcing lets us generate a training dataset for our models. To illustrate how this is done, let's take the Reddit example where we want to predict the next post that a user will upvote. The relevant domain events are streamed into the feature store, which then updates the model input features. Observed outcomes should also be streamed to the feature store, in this case every user upvote.
Since the feature store maintains a log of every event and the logic to turn an event stream into state, it can get the state of the features at any point in time. To generate a training set, it loops through the observations and generates the feature set at that point. By combining the two, it ends up with a training dataset.
for observed in observations:
features = feature_store.get_features(observed.userId)
yield (features, observed.postId)
A Feature Store allows us to decouple from the microservice architecture, and own our own features. However, building and maintaining a feature store is not free. Teams should consider the following points before deploying feature store infrastructure.
System-Wide Event Streaming
A feature store requires that domain events are passed through an event streaming platform like Kafka or Pulsar. This allows the feature store to materialize its state independently of the microservices. Persisting the event log allows it to materialize the features at any point in time.
Moving a large microservice-based system to use event streams is a monumental shift. Routines have to be injected to capture important events. This may require updating old, mission-critical microservices with a new dependency and new error conditions to catch.
Another option is to use Change-Data-Capture semantics from each database to turn updates to a stream. However, the feature store is then vulnerable to schema changes inside a microservice's database.
Handling Event Schema Changes
A feature store is still dependent on the schema of the event streams. If a stream changes its schema, or a microservice misbehaves and uploads garbage data, it can incapacitate the feature store downstream. Event streams schemas should be treated with the same care as a database schema. Migration procedures should be clear and tested. Events should be written with an extensible format like Protobuf or JSON.
Storage and Compute Capacity
Processing and storing huge amounts of data is not free. In many cases, the feature store will repeat computations performed by individual microservices. A feature store trades infrastructure cost and complexity for developer velocity and ease of use. Building and maintaining a feature store requires both money and specialized engineers.
Documenting and Sharing Input Features and Data Sources
ML features are often applicable in many different use cases. Reddit may have many different ML-powered features that all use a user's activity to make decisions. Discovering and understanding features that others have built can speed up ML development time and provide inspiration in feature engineering. Since feature stores are a relatively new piece of architecture, teams will have to document how features should be published and shared.