Complexity. Software. Resiliency.

Written by kfanous | Published 2021/08/28
Tech Story Tags: software-development | software-engineering | complexity | chaos-engineering | design-systems | technology | philosophy | hackernoon-top-story

TLDR Complexity: The Emerging Science at the Edge of Order and Chaos -- a book I highly recommend. Complex systems display collective behavior that emerges from interactions between their parts. Software development exhibits many of the properties of complex systems: emergence, non-linearity, and adaption. Software teams, most notably OSS projects, are typically autonomous and self-managed. More importantly, the process of building software is at its heart based on social interactions. This requires software teams to be able to adapt to their changing environment.via the TL;DR App

Complex Systems: The Beginning

I recently finished reading Complexity: The Emerging Science at the Edge of Order and Chaos-- a book I highly recommend.

The book was my first foray into the world of complex systems and provided a good overview on the topic as well as the Santa Fe Institute (an independent, nonprofit theoretical research institute located in Santa Fe and dedicated to the study of complex systems).

As I was reading the book, I couldn’t help but think about the properties of complex systems and their prevalence in software development, at least for large software projects.

Complex systems, not to be confused with complicated ones, are systems that possess the following properties:

  • They are non-linear, meaning that a small change to the system’s input can result in a disproportional change to the system’s output.
  • They are emergent, meaning that the properties the whole system exhibits differ from those shown by the individual parts comprising the system.

Thus, complex systems display collective behavior that emerges from the interactions between their parts. The other main properties are adaptation and feedback loops implying that these systems change and adapt to their environment.

Complex systems are very common, in fact, we live in one. Our climate is an example of a complex system. So is the entire human body and even a single human cell…

Cities are complex systems and so is the universe.

It’s worth noting that there is a difference between complicated and complex systems.

A car engine is a complicated system, but it is not complex.

A car’s engine is composed of many parts that behave according to some specification. These parts operate in concert to ultimately provide the functionality of an engine.

There are no emergent properties to be found in a car’s engine. Nor does it adapt to its environment, and, thankfully it doesn’t exhibit non-linear properties.

Software Development: A Complex System?

I believe, mostly based on my experience, that the process of building fairly large software is an example of a complex system. In my opinion software development exhibits many of the properties of complex systems: emergence, non-linearity, and adaption.

If you’ve ever worked on a software team, you might have witnessed how the team interactions are unique. A software team’s characteristics aren’t the average of the individuals comprising this team, rather they are a unique amalgam of these individuals’ characteristics.

Software teams, most notably OSS projects, are typically autonomous and self-managed. More importantly, the process of building software is at its heart based on social interactions.

Individuals working together on a software project will have to find the most optimal way for them to interact. Examples of these interactions include knowledge sharing, the team structure and hierarchy, coding guidelines and more. In short, the team’s behavior, principles and characteristics emerge from the interactions between the individuals.

Every software project I have worked on is subject to changing requirements and assumptions.

This requires software teams to be able to adapt to their changing environment. These changes can also be immensely disruptive.

What can at first appear to be a seemingly simple change can in fact result in a significant amount of work, or even worse introduce catastrophic failures. This is yet again another characteristic of complex systems: non-linearity.

Ok, so what?

Maybe software development is indeed a complex system, maybe it’s not. I’m not here to make this argument or provide irrefutable evidence to sway you that it is. However, like any system, complex or not, you probably want your software development “system” to be resilient.

“resilience determines the persistence of relationships within a system and is a measure of the ability of these systems to absorb changes of state variables, driving variables, and parameters, and still persist.” C. S. Holling

Resilience is basically the ability of a system to absorb a shock to it and quickly bounce back to a functional, equilibrium state.

Oftentimes, the shocks that can bring a seemingly indestructible system down are initially very minor, even inconsequential. Consider the collapse of the Soviet Union or the collapse of many of the Arab regimes in the spring of 2010.

Chaos Engineering: Making the Software Team Resilient

I can think of two main shocks to the software development process: changing requirements and personnel departures. I’ll be ignoring the former in this article and focusing on the latter, which, arguably, can have a larger negative impact and is one that I think is overlooked.

You can never truly plan for employee departures, they are typically abrupt, especially if the employee leaves on her own volition. Employees can leave for many reasons: moving, taking time off, a new job and many more.

However, an employee departure represents a shock to your software development system. The shock, or impact, that an employee's departure induces can take many forms.

Perhaps the departing employee is an expert in one or more subsystems of the software and his or her departure will result in a significant knowledge gap. This expertise is not limited to knowledge about various modules of subsystems of the software.

It can also include other skills. The departing employee could be an expert debugger, your go-to person for debugging complex and hard to reproduce bugs. They could also be an excellent system designer and your teams rely on them for system design and architecture. Obviously, an employee leaving their team will result in the team having to pick up their work which can result in delays to releasing the feature the team is working on.

Regardless of the skill that person possesses, each time an employee departs can result in a shock to your software development “system”. This shock can be limited to the team the individual is working on, or for more senior engineers the impact could be much wider.

One way to minimize the impact of shocks, like the ones that are induced by departing employees, is to artificially introduce them. You know that you will be losing people in any given year, so perhaps practicing the impact of these departures, will better position you and your teams to handle them when they actually do happen. No, I am not suggesting random firings!

This, in principle, is what a recent article by Dan Lebrero is trying to address. Dan introduces the concept of a Lucky Lotto, shown below. Note, that the rules were later on modified by Dan. I encourage you to read his article in full.

Welcome to Akvo’s Lucky Lotto!

Starting last week of September, we are going to start running our own Akvo’s Lucky 
Lotto. All of you will have a chance to win, and your team to enjoy the results of 
your disappearance.

Rules:
1- Every Monday a random person will win the Lucky Lotto.
2- The winner will work on some side project.
3- The winner will be completely unavailable to colleagues and to the rest of Akvo 
   for the week.
4- Everybody, including product managers, gets one ticket every week, even if you 
   don’t want it.
5- Every time that rule 3 must be broken, the winner must make a note (I will share 
   some doc to do this).

Above copied with permission.

A reasonable assumption to make if such a process was implemented, is that the disruption it would initially cause would be almost identical to the real event.

Said otherwise, if Bob wins the Lucky Lotto of the week, then Bob’s impact on his team is almost the same as if Bob had actually left the company. However, in time and as more of these shocks are introduced to the system, the impact should diminish. Teams will adapt, learn and become resilient.

I’d imagine that the relationship between disruption to the team, resiliency and the number of shocks induced could look like the graph below.

Initially, the disruptions are very strong and the resiliency build-up is slow. However, with more practice, the disruption starts to diminish and the resiliency starts to ramp up. Both reach a certain plateau or asymptote during which the impact of a departure results in little disruption or resiliency.

Did any of this work?

I obviously have no data to back this up. In fact, I haven’t even tried to run a lotto system like the one Dan introduced. Dan does share a few results shown below.

Three months running the Lucky Lotto showed several instances of a bus factor of 
one, and gave the teams the opportunity to step up, learn and cover for the missing 
person’s skills. As an example, our one and only Android developer won the Lotto 
the same week that the team was going to fix some major performance issue on the 
communications between the app and the server. It was a great learning experience for 
the team.

For the Lucky Lotto winner, it was a very enjoyable week, to either learn something 
new (Kubernetes, backend development, our deployment pipeline, Cypress, Clojure, …), work on those long desired dev improvements that we never had time for, or to do something different from the usual churn.

These days were a great mirror into where I actually spend my time and if that is thebest way to handle the tasks.

In addition to the knowledge sharing, we got some cross-pollination and broader-team 
building as some winners decided to work with the other product team during their 
Lotto week.

Above copied with permission.

I don’t know what the long-term impacts of a process like the Lucky Lotto are. Perhaps it does indeed result in more resilient teams, which I truly hope it does.

I do know that as an industry, we spend far more time focusing on trying to make our software resilient and ignore the much-needed resiliency of our processes and teams.

We introduce random failures to our software. We bring databases down. We pull disks out of servers. Reboot servers randomly. We do all of this to observe how our software behaves in response to failures or shocks to the system.

Perhaps it’s time we focus on making our individuals, teams, and processes more resilient too? Lastly, if you’ve tried anything remotely similar to this lotto approach, or other methods, I would love to hear from you.

References

I used a few references during my preparation for this article, which are all listed below.

* Article main photo by Daniele Levis Pelusi on Unsplash

First seen here.


Written by kfanous | VP of Engineering. Presently @ strongDM
Published by HackerNoon on 2021/08/28