13,232 reads

Events As First-Class Citizens

by Randy ShoupJanuary 4th, 2018

Too Long; Didn't Read

We’ve all heard of events and event-driven programming, but in my experience, events are not used nearly enough in our (<a href="https://hackernoon.com/tagged/software" target="_blank">software</a>) lives. We often don’t appreciate how powerful this tool can be in our toolbox, and consequently we don’t take advantage of it when we really should.

Companies Mentioned

featured image - Events As First-Class Citizens

Image Source

We’ve all heard of events and event-driven programming, but in my experience, events are not used nearly enough in our (software) lives. We often don’t appreciate how powerful this tool can be in our toolbox, and consequently we don’t take advantage of it when we really should.

In this post, I’ll discuss the lifecycle of a “fix” at Stitch Fix and how we use events to model it. I’ll suggest that we should think of events as a first-class citizen in our system, because they help us both to decouple parts of the system and to reason about them independently. Lastly, I’ll talk about how common it is to use events in the real world, and use the metaphor of software development itself to help develop intuition about event-based systems.

The Lifecycle of a Fix

At Stitch Fix, the engineering team builds and operates more than 70 individual applications and services, serving every aspect of Stitch Fix’s business. We have applications for our Merchandising team which is responsible for buying the clothes, our Warehouse Operations team which stores and ships them, and our thousands-strong Styling team which chooses them for our clients. We have applications used by our clients to schedule their Fixes, rate them, and pay for them. And we have applications used by our Client Experience team to help give our clients the best clothing purchase experience we can. Almost all of these applications and services operate on one or more of the core entities in our business — clients, items, fixes, etc. — in one way or another.

To walk through just one motivating example, a nascent fix is created when a client tells us she’d like to receive it on a particular day (scheduling). Based on where and when that fix is shipping, we assign it to one of our warehouses around the US (warehouse assignment), and we make it available to be styled by one of our 3500 stylists (stylist assignment). The stylist selects 5 items she expects the client to enjoy (styling), and we’re ready to send it out. The warehouse team picks, packs, and ships the fix (shipping), and the client receives it on her doorstep. She keeps what she wants, and returns what she doesn’t (checkout). And the cycle is complete.

We just described a moderately complicated workflow, with many individual steps, all operating on a single fix. Looked at through the lens of software engineering, the straightforward way to model this is as a state machine, with individual events that indicate that the fix has transitioned from one state to another. And that’s exactly how we implement it. Here’s a simplified representation of this workflow:

Request a fix
-> Fix is _scheduled
Assign fix to warehouse
-> Fix is _hizzy_assigned (we call our warehouses “hizzies”; don’t ask)
Assign fix to a stylist
-> Fix is _stylist_assigned
Style the fix
-> Fix is _styled
Pick the items for the fix
-> Fix is _picked
Pack the items into a box
-> Fix is _packed
Ship the fix via a shipping carrier
-> Fix is _shipped
Fix travels (as actual atoms!) to the client
-> Fix is _delivered
Client decides what to keep and return, pays for her fix
-> Fix is _checked_out

Several things immediately come to mind:

As the fix moves happily along, different applications or services do something with it, by enriching it with more (meta)data, by connecting it up with something else, or by doing something physical with the goods or the packaging. That doing of the something will take the fix from one state to another (state transitions).
The next application or service can only do its work when the previous one has done its, and consequently needs to know when to step up (events).
We can’t skip any of these steps, or there would be something missing — in many cases, quite literally. Said another way, from a given state, it’s only possible to go to a subset of the other states (state machine).
If we just remembered the current state of things — where the fix is right now — we’d be missing a lot. We want to be able to ask where the fix is right now, but we also want to know where it has been, how long it was there, and when it moved to the next step. So if we only stored its current-state in some database table and nothing else about it, we’d be stuck. Instead we need to also record all the steps along the way.

Events as a First-Class Citizen

Notice that if we only had the standard tools of the classic 3-tier architecture at our disposal, we’d be in trouble. We’re all familiar with these three fundamental application building blocks:

Presentation: The user interface where the user interacts with the system
Application: The “business logic” where we do the work, typically statelessly
Persistence: The place where we store things, typically in a database

I strongly believe that events represent a fourth fundamental building block:

Event: The statement that an interesting thing happened, or, according to Wikipedia, “a significant change in state”

In a [micro]services architecture like we have at Stitch Fix, a given application or service might be a producer of events, a consumer of events, or both. For example, the styling application listens for the _stylist_assigned event, and displays all the information needed to the stylist for her to style the fix. When she is done with the fix, she clicks the “Ship It” button, which (among other things) publishes the _styled event. The warehouse services listen for that event, and can start their work, etc.

Consuming these events and producing these events are first-class parts of these applications and services; they need these events to do their jobs. So when we talk about the “interface” of one of these services, let’s make this explicit. A service interface includes:

Synchronous request-response operations (e.g., for use, this is REST/JSON, but it could just as easily be over gRPC, Thrift, etc.)
Events the service produces
Events the service consumes
Bulk reads and writes (e.g., an ETL that extracts data from the service into an analytics system)

More generally, a service’s interface includes any mechanism for getting data into or out of the service. As a service owner or a service consumer, we forget this at our own peril.

Events as Decoupling

The producer of an event publishes it, and zero or more consumers subscribe to it. Maybe no one is listening; maybe one is; maybe many are. The producer does not know or care. This gives the nice property that the producer and consumer are completely decoupled from one another. We can add more consumers, remove them, or scale them up and down — without the producer being any the wiser.

Events as Record

Once we represent all the interesting state transitions for our entity as events, we can use those events as a record of what happened to that entity, and when. This is hugely valuable when we want to go back and see what went on. It’s common for our client experience team (think “customer support”, but with more smiles and empathy) to look up the history of a fix when they are trying to help out a client. It’s common for our data science team to use the events around a fix to optimize various aspects of our workflow. And it’s common for our engineering team to use this as a debugging and diagnosis tool.

Taking this idea to its logical conclusion, we could imagine *only* retaining the events and never bothering to store the current state in any permanent way. After all we can always simply reconstruct it by playing the events forward. This is such a clever idea that people have already thought of it — it is called “event sourcing” (see the many writings of Greg Young, Chris Richardson, etc.), and it has a lot of wonderful resilience properties, particularly in distributed systems. In fact, there are entire software systems based on this idea (Event Store, Kafka, Akka Persistence, etc.). It’s also, of course, exactly how double-entry booking works in accounting, so even these (very smart) people were preempted by the Medici 700 years ago.

Events are How the Real World Works™

I often hear that it’s hard for developers to think in terms of events. It can feel a bit counterintuitive if you’re used to building the classic three-tier. What can trip people up is that at any given moment something might have happened in one part of our system, but the effects of that action are not yet visible in the other areas of our system. We use words like “eventual consistency” and “asynchrony” here, and they’ve earned their reputation for being hard to reason about. But I’d like to suggest that you have a lot more intuition about events than you think you do. If you can think about your problem as a workflow, you’re more than halfway there.

So let’s take an example that will be familiar to every software engineer. Imagine a typical modern software development process, where we write code, check it into source control, test it, stage it, and deploy it. We often talk about this as a “lifecycle” or “pipeline” or “workflow”. Well, that sounds a lot like events. Let’s see.

Write code
-> code is _submitted
Test code
-> code is _tested
Deploy it to a staging server
-> code is _staged
Deploy it to production
-> code is _deployed

This seems super-familiar — we do this every day! And it’s not just a tangential thing off in the corner; for most of us, this is our job.

Think if this did not behave like an asynchronous workflow, and we did all of this synchronously. Imagine that every time you hit return in your IDE, your code would be deployed to production. I’m all for continuous delivery, believe me, but this would be crazy.

So this idea that you can’t reason about systems where there is an “inconsistency” between one part of the system and another is a little silly. For much of the day, your code has one version on your laptop, and another “stale” version in production, and everything works just fine.

Conclusion

Thanks for reading this far. If you have, I hope I’ve got you thinking that you should reach for the “event” tool in your toolbox more often. It can help you in many ways.

And if you’re having trouble reasoning about events, just think about what you do for 8, 10, 12 hours every day.