Mark Hopson

@markhopson

How Toyota Guides The Evolution Of Our Infrastructure (And Makes Me Appreciate Jira)

November 24th 2018

Reflecting on my company’s journey to Microservices, DevOps, and the Perfect Value Stream

In the Manufacturing industry, the Value Stream Map (VSM) is a popular tool used to measure performance. The VSM was made famous by Toyota, and is essentially a flow chart that describes the necessary steps to produce value.

For example, consider this simple VSM for a Chair Factory.

An example VSM for a Chair Factory.

This example illustrates the 4 steps our factory needs to produce a chair after receiving an Order: Welding, Painting, Assembling, and Inspection.

There are a couple of things worth pointing out here.

  1. The steps must be done sequentially. In this example, Painting must come after Welding for each chair.
  2. The steps can be done in parallel for different work items. In this example, that means Painting and Welding can occur simultaneously, but for 2 different chairs.
  3. The total time for the VSM is the sum of all the steps. In this example, each chair requires 7 days from start to finish.
  4. The output rate is how fast a work item can flow through the VSM. In this example, our factory could produce 1 chair every 3 days if each step always had a chair to work on.
Our Chair Factory working at full capacity can produce 1 chair every 3 days.

The last point is especially noteworthy because it illustrates the difference between total time and output rate. When deciding how to improve a process, the output rate explains why we should focus on the longest step (or biggest bottleneck). In this example, our workflow is stymied only by Inspection.

This example has been about chairs, but Value Steam Maps can also describe software delivery. Consider another VSM below; it’s similar to making chairs.

An example VSM for software delivery.

In this software example, Deploy is the longest step, which means the output rate is at least 3 days per Feature Request despite how fast the other steps are. In other words, this VSM is telling us that Deploy is effectively our only constraint.

Both the above examples are oversimplified Value Stream Maps, but they illustrate why only the largest bottlenecks matter. And as I reflect on how my company’s architecture has evolved over the years, I’ve realized it’s these bottlenecks that have steered some of our biggest changes.

The Story Of How Our Infrastructure Was Guided By Bottlenecks

In the beginning, my company’s journey began with a few developers and not much else. Coordination was easy since the team was small, and there were no legacy systems to support because none existed. This provided the agility to figure out a business and product. Our Value Stream was clear.

Just a couple of developers and a new Rails project

Then our company grew. Users increased and product evolved, which meant new systems were needed. But managing these systems required us to take time away from development. This became our first bottleneck. Consequently, we introduced tools like Capistrano and Puppet to manage our systems, and so our first big bottleneck was fixed.

More systems required required tools like Puppet and Capistrano to manage them.

But our company continued to grow. More developers came, along with more deployments, hot-fixes, and configuration. Our automation helped, but many tasks still required human-assistance from our lone Sysadmin, who soon became overwhelmed. This became our second big bottleneck, and as a result more Sysadmins were hired and an Infrastructure (or SRE) team was formed.

More people and teams required an SRE team to deal with all their requests

At this point, we mostly used Ruby and MySQL, and our SRE team allowed us to scale. But soon we would need JavaScript and Scala, with Kafka clusters and Cassandra databases. This turned into our next bottleneck because each new technology we introduced spawned unique requirements that hindered our SRE team with system specific procedures. Consequently, we migrated our systems onto Docker, and a container orchestration (or Microservice) platform, to give our Sysadmins a common interface to support our varied systems as our technology choices expanded.

More diverse technologies required Docker to make it easier to support diverse technologies.

Containers allowed us to support a breadth of technologies, but it also encouraged developers to be more involved with the operation of their systems. This meant that an increasing number of teams were asking Sysadmins for changes to their containers. Teams were building faster, and becoming more DevOps, but this dependency on the SRE team became the new bottleneck. Recognizing this trend in our organization led to the creation of a self-serve portal for developers to apply specific changes without a Sysadmin (even in production). Initially, this portal allowed users to set CPU and memory limits for their containers, the most popular request at the time, but features were added as other common requests (or bottlenecks) emerged.

An early mock of our self-serve infrastructure portal for teams.

This portal helped drive DevOps across our company by enabling teams to work freely, but also left us facing new challenges around cost management and access control, and needless to say, these challenges will not be the end. It’s likely that once these challenges are solved, we’ll find more bottlenecks again. The bottlenecks will never cease, but rest assured, that’s not a bad thing.

The story of our infrastructure isn’t new, but what I’ve found enlightening is recognizing the endlessness of our journey. The perfect infrastructure doesn’t exist (the same can be said of any software product), but that doesn’t mean we should stop trying to improve. Reaching the finish line isn’t as important as moving forward.

Continuous Bottlenecks Lead Towards Continuous Improvement

Having bottlenecks to fix is a good thing. Bottlenecks represent an improvement that can be made to how we work. They guide our efforts by forcing us to constantly ask ourselves how we create value, and what that value is.

In my favourite scene from The Social Network, Eduardo asks Mark when he’ll be done building Facebook so he can start monetizing it. Mark aptly replies that Facebook will never be done, similar to how fashion will never be done.

The Facebook will never be done

This scene resonated with me because I think it applies not only to Facebook, but to all forms of craft, whether it be software, art, music, or something else. Lessons can always be learned, growth can always be had, and in the case of software delivery, Value Streams uncover the bottlenecks that lead to the most meaningful improvements.

I was told once that building software is like building railroads; the process begins with pioneering, followed by building, then finally optimization. While this comparison is sometimes useful, the finality of “optimization” felt lacking. Instead, I prefer comparing software development to Star Trek, and its mission of continual exploration, and like exploring the universe, building (and improving) is never done. We just need to decide which direction to go.

The Real Difference Between Building Chairs And Writing Software

Manufacturing and software share a continuous journey of improvement, but this is where the similarities end when it comes to the Value Stream Map. Manufacturing deals with the physical world. On a factory floor, you can see the work items, and the workflow itself. It might even be obvious which steps are bottlenecks and error prone. This is not so easy with software.

With software, the work is electronic, and the factory floor is a room full of computers. It’s hard to see what someone is tasked, or what the workflow is. Software is invisible, which is what really differentiates it from manufacturing, and why tools like Jira are so important for this type of work.

Progress is easier to spot in an assembly line than an office.

Jira is a task management system for software teams to describe and track work items, their progress, ownership, time spent, and other information that’s vital to understanding inefficiencies and building a meaningful Value Stream Map. These task management system are often thought of as being meant only for planning future work, but a large part of their value is derived from how they chronicle the past, so we can rationalize it to make better decisions in the future.

Creating a ticket to track a work item in Jira

A lot of work has been done to understand developer productivity and its driving factors. In the age of DevOps, this has resulted in recognizing Deployment Frequency, Change Lead Time, and other such developer metrics. However, in my experience nothing is as universally effective as simply journaling work and time. I strongly believe the benefits individuals receive from journaling and reflection, such as the ones below, applies to organizations too.

The single most important thing for any software company to focus on is understanding how it creates value. Many companies get lost choosing between technologies, or organizational structures, but none of this intrinsically matter. The only thing that should matter is how value is created, and the obstacles that get in the way. And by citing these experiences (with Jira or something else), and reflecting upon them, we’re able to not only recognize the obstacles in our way, but reveal the path forward.

More Related Stories