We’ve all heard of events and event-driven programming, but in my experience, events are not used nearly enough in our (software) lives. We often don’t appreciate how powerful this tool can be in our toolbox, and consequently we don’t take advantage of it when we really should.
In this post, I’ll discuss the lifecycle of a “fix” at Stitch Fix and how we use events to model it. I’ll suggest that we should think of events as a first-class citizen in our system, because they help us both to decouple parts of the system and to reason about them independently. Lastly, I’ll talk about how common it is to use events in the real world, and use the metaphor of software development itself to help develop intuition about event-based systems.
At Stitch Fix, the engineering team builds and operates more than 70 individual applications and services, serving every aspect of Stitch Fix’s business. We have applications for our Merchandising team which is responsible for buying the clothes, our Warehouse Operations team which stores and ships them, and our thousands-strong Styling team which chooses them for our clients. We have applications used by our clients to schedule their Fixes, rate them, and pay for them. And we have applications used by our Client Experience team to help give our clients the best clothing purchase experience we can. Almost all of these applications and services operate on one or more of the core entities in our business — clients, items, fixes, etc. — in one way or another.
To walk through just one motivating example, a nascent fix is created when a client tells us she’d like to receive it on a particular day (scheduling). Based on where and when that fix is shipping, we assign it to one of our warehouses around the US (warehouse assignment), and we make it available to be styled by one of our 3500 stylists (stylist assignment). The stylist selects 5 items she expects the client to enjoy (styling), and we’re ready to send it out. The warehouse team picks, packs, and ships the fix (shipping), and the client receives it on her doorstep. She keeps what she wants, and returns what she doesn’t (checkout). And the cycle is complete.
We just described a moderately complicated workflow, with many individual steps, all operating on a single fix. Looked at through the lens of software engineering, the straightforward way to model this is as a state machine, with individual events that indicate that the fix has transitioned from one state to another. And that’s exactly how we implement it. Here’s a simplified representation of this workflow:
Several things immediately come to mind:
Notice that if we only had the standard tools of the classic 3-tier architecture at our disposal, we’d be in trouble. We’re all familiar with these three fundamental application building blocks:
I strongly believe that events represent a fourth fundamental building block:
In a [micro]services architecture like we have at Stitch Fix, a given application or service might be a producer of events, a consumer of events, or both. For example, the styling application listens for the _stylist_assigned event, and displays all the information needed to the stylist for her to style the fix. When she is done with the fix, she clicks the “Ship It” button, which (among other things) publishes the _styled event. The warehouse services listen for that event, and can start their work, etc.
Consuming these events and producing these events are first-class parts of these applications and services; they need these events to do their jobs. So when we talk about the “interface” of one of these services, let’s make this explicit. A service interface includes:
More generally, a service’s interface includes any mechanism for getting data into or out of the service. As a service owner or a service consumer, we forget this at our own peril.
The producer of an event publishes it, and zero or more consumers subscribe to it. Maybe no one is listening; maybe one is; maybe many are. The producer does not know or care. This gives the nice property that the producer and consumer are completely decoupled from one another. We can add more consumers, remove them, or scale them up and down — without the producer being any the wiser.
Once we represent all the interesting state transitions for our entity as events, we can use those events as a record of what happened to that entity, and when. This is hugely valuable when we want to go back and see what went on. It’s common for our client experience team (think “customer support”, but with more smiles and empathy) to look up the history of a fix when they are trying to help out a client. It’s common for our data science team to use the events around a fix to optimize various aspects of our workflow. And it’s common for our engineering team to use this as a debugging and diagnosis tool.
Taking this idea to its logical conclusion, we could imagine *only* retaining the events and never bothering to store the current state in any permanent way. After all we can always simply reconstruct it by playing the events forward. This is such a clever idea that people have already thought of it — it is called “event sourcing” (see the many writings of Greg Young, Chris Richardson, etc.), and it has a lot of wonderful resilience properties, particularly in distributed systems. In fact, there are entire software systems based on this idea (Event Store, Kafka, Akka Persistence, etc.). It’s also, of course, exactly how double-entry booking works in accounting, so even these (very smart) people were preempted by the Medici 700 years ago.
I often hear that it’s hard for developers to think in terms of events. It can feel a bit counterintuitive if you’re used to building the classic three-tier. What can trip people up is that at any given moment something might have happened in one part of our system, but the effects of that action are not yet visible in the other areas of our system. We use words like “eventual consistency” and “asynchrony” here, and they’ve earned their reputation for being hard to reason about. But I’d like to suggest that you have a lot more intuition about events than you think you do. If you can think about your problem as a workflow, you’re more than halfway there.
So let’s take an example that will be familiar to every software engineer. Imagine a typical modern software development process, where we write code, check it into source control, test it, stage it, and deploy it. We often talk about this as a “lifecycle” or “pipeline” or “workflow”. Well, that sounds a lot like events. Let’s see.
This seems super-familiar — we do this every day! And it’s not just a tangential thing off in the corner; for most of us, this is our job.
Think if this did not behave like an asynchronous workflow, and we did all of this synchronously. Imagine that every time you hit return in your IDE, your code would be deployed to production. I’m all for continuous delivery, believe me, but this would be crazy.
So this idea that you can’t reason about systems where there is an “inconsistency” between one part of the system and another is a little silly. For much of the day, your code has one version on your laptop, and another “stale” version in production, and everything works just fine.
Thanks for reading this far. If you have, I hope I’ve got you thinking that you should reach for the “event” tool in your toolbox more often. It can help you in many ways.
And if you’re having trouble reasoning about events, just think about what you do for 8, 10, 12 hours every day.
If you are interested in deepening your knowledge of events and event-driven systems, check out: