How to Deal With Complexity When Designing Software Systems

What is it all about?

Every day, every moment during our engineering career, we encounter many different problems of various complexity and situations where we need to make a decision or postpone it due to lack of data. Whenever we build new services, construct infrastructure, or even form development processes, we touch a huge world of various challenges.

It is challenging, and perhaps even impossible, to list all the problems. You will encounter some of these issues only if you work in a specific niche. On the other hand, there are many that we all must understand how to solve, as they are crucial for building IT systems. With a high probability, you will encounter them in all projects.

In this article, I will share my experiences with some of the problems I have encountered while creating software programs.

What is Cross-Cutting Concern?

If we look into Wikipedia, we will find the following definition

In aspect-oriented software development, cross-cutting concerns are aspects of a program that affect several modules, without the possibility of being encapsulated in any of them. These concerns often cannot be cleanly decomposed from the rest of the system in both the design and implementation, and can result in either scattering (code duplication), tangling (significant dependencies between systems), or both.

It greatly describes what it is, but I want to extend and simplify it a little bit:

A cross-cutting concern is a concept or component of the system/organisation that affects (or 'cuts across') many other parts.

The best examples of such concerns are system architecture, logging, security, transaction management, telemetry, database design and there are many others. We are going to elaborate on many of them later in this article.

On the code level, cross-cutting concerns are often implemented using techniques like Aspect-Oriented Programming (AOP), where these concerns are modularized into separate components that can be applied throughout the application. This keeps the business logic isolated from these concerns, making the code more readable and maintainable.

Aspects Classification

There are many possible ways how to classify aspects by segmenting them with different properties like scope, size, functionality, importance, target, and others, but in this article, I am going to use a simple scope classification. By this, I mean where this specific aspect is directed whether it is the whole organisation, a particular system, or a specific element of that system.

So, I am going to split aspects into Macro and Micro.

By Macro aspect I mean mainly considerations we follow for the whole system like chosen system architecture and its design (monolithic, microservices, service-oriented architecture), technology stack, organization structure, etc. Macro aspects are related mainly to strategic and high-level decisions.

In the meantime, the Micro aspect is much closer to the code level and development. For instance, which framework is used for interacting with the database, the project structure of folders and classes, or even specific object design patterns.

While this classification is not ideal, it helps to structure an understanding of possible problems and the importance and impact of solutions we apply to them.

In this article, my primary focus will be on the macro aspects.

Macro Aspects

Organisation structure

When I just started to learn about software architecture, I read many interesting articles about Conway’s law and its impact on organisational structure. Especially this one. So, this law states that

Any organisation that designs a system (defined broadly) will produce a design whose structure is a copy of the organisation’s communication structure.

I have always believed that this concept is indeed very universal and represents the Golden Rule.

Then I started to learn Eric Evans’s Domain-Driven Design (DDD) approach for modeling systems. Eric Evans emphasises the importance of Bounded Context identification. This concept involves dividing a complex domain model into smaller, more manageable sections, each with its own limited set of knowledge. This approach aids in effective team communication, as it reduces the need for extensive knowledge of the entire domain and minimises context switching, thus making conversations more efficient. Context switching is the worst and most resource-consuming thing ever. Even computers are struggling with it. Although it is unlikely to achieve a complete absence of context switching, I reckon that is what we should strive for.

Returning to Conway’s Law, I have found several issues with it.

The first issue I've encountered with Conway's Law, which suggests that system design mirrors organisational structure, is the potential for forming complex and comprehensive Bounded Contexts. This complexity arises when the organisational structure is not aligned with domain boundaries, leading to Bounded Contexts that are heavily interdependent and loaded with information. It leads to frequent context-switching for the development team.

Another issue is that organisational terminology leaks to the code level. When organisational structures change, it necessitates codebase modifications, consuming valuable resources.

Thus, following Inverse Conway Maneuver helps to build the system and organisation that encourage desired software architecture. However, it is noteworthy to say that this approach won’t work very well in already-formed architecture and structures since changes at this stage are prolonged, but it is exceptionally performing in startups since they are quick to introduce any changes.

Big Ball of Mud

This pattern or “anti-pattern“ drives building a system without any architecture. There are no rules, no boundaries, and no strategy on how to control the inevitable growing complexity. Complexity is the most formidable enemy in the journey of building software systems.

To avoid constructing such type of a system, we need to follow specific rules and constraints.

System architecture

There are myriad definitions for Software Architecture. I like many of them since they cover different aspects of it. However, to be able to reason about architecture, we need naturally to form some of them in our minds. And it is noteworthy to say that this definition may evolve. So, at least for now, I have the following description for myself.

Software Architecture is about decisions and choices you make every day that impact the built system.

To make decisions you need to have in your “bag” principles and patterns for solving arising problems, it is also essential to state that understanding the requirements is key to building what a business needs. However, sometimes requirements are not transparent or even not defined, in this case, it is better to wait to get more clarification or rely on your experience and trust your intuition. But anyway, you cannot make decisions properly if you do not have principles and patterns to rely on. That is where I am coming to the definition of Software Architecture Style.

Software Architecture Style is a set of principles and patterns that designate how to build software.

There are a lot of different architectural styles focused on various sides of the planned architecture, and applying multiple of them at once is a normal situation.

For instance, such as:

Monolithic architecture
Domain-driven design
Component-based
Microservices
Pipe and filters
Event-driven
Microkernel
Service-oriented

and so on…

Of course, they have their advantages and disadvantages, but the most important thing I have learned is that architecture evolves gradually while depending on actual problems. Starting with the monolithic architecture is a great choice for reducing operational complexities, very likely this architecture will fit your needs even after reaching out Product-market Fit (PMI) stage of building the product. At scale, you may consider moving towards an event-driven approach and microservices for achieving independent deployment, heterogeneous tech stack environment, and less coupled architecture (and less transparent in the meantime due to the nature of event-driven and pub-sub approaches if these are adopted). Simplicity and efficiency are close and have a great impact on each other. Usually, complicated architectures impact the development speed of new features, supporting and maintaining existing ones, and challenging the system’s natural evolution.

However, complex systems often require complex and comprehensive architecture, which is inevitable.

Fairly, this is a very very broad topic, and there are many great ideas about how to structure and build systems for natural evolution. Based on my experience, I have worked out the following approach:

Almost always begins with the monolithic architecture style since it eliminates most of the problems that arise due to the nature of distributed systems. It also makes sense to follow modular monolith to focus on building components with clear boundaries. Applying a component-based approach could help them communicate with each other by using events, but having direct calls (aka RPC) simplifies things in the beginning. However, it is important to track dependencies between components since if component A knows a lot about component B, perhaps, it makes sense to merge them into one.
When you come closer to the situation when you need to scale your development and system, you could consider following the Stangler pattern to gradually extract components that need to be deployed independently or even scaled with specific requirements.
Now, if you have a clear vision of the future, which is a bit of incredible luck, you could decide on the desired architecture. At this moment, you could decide on moving towards microservices architecture by also applying Orchestration and Choreography approaches, incorporating CQRS pattern for independent scale write and read operations, or even deciding to stick with monolithic architecture if it fits your needs.

It is also vital to understand the numbers and metrics like DAU (Daily Active Users), MAU (Monthly Active Users), RPC (Request Per Second), and TPC (Transaction Per Second) since it could help you to make choices because architecture for 100 active users and 100 million active users are different.

As a final note, I would say that architecture has a significant impact on the product’s success. Poorly designed architecture for the products is required in scaling, which very likely leads to failure since customers will not wait while you scale the system, they will choose a competitor, so we need to be ahead of potential scaling. Although I admit that sometimes it could not be a lean approach, the idea is to have a scalable but not already scaled system. On the other hand, having a very complicated and already scaled system with no customers or plans to get many of them will cost you money on your business for nothing.

Technology stack selection

Selecting a technology stack is also a macro-level decision since it influences hiring, system natural evolution perspectives, scalability, and system performance.

This is the list of basic considerations for choosing a technology stack:

Project requirements and complexity. For instance, a simple web application can be built with the Blazor framework if your developers have experience with it, but due to the lack of matureness of WebAssembly, choosing React and Typescript for long-term success could be a better decision
Scalability and Performance Needs. If you anticipate receiving a large amount of traffic, opting for ASP.NET Core over Django could be a wise choice due to its superior performance in handling concurrent requests. However, this decision depends on the scale of traffic you expect. If you need to manage potentially billions of requests with low latency, the presence of Garbage Collection could be a challenge.
Hiring, Development Time, and Cost. In most cases, these are the factors we need to care about. Time to Market, Maintenance cost, and Hiring stability drive your business needs without obstacles.
Team Expertise and Resources. The skill set of your development team is a critical factor. It is generally more effective to use technologies that your team is already familiar with unless there is a strong reason to invest in learning a new stack.
Matureness. A strong community and a rich ecosystem of libraries and tools can greatly ease the development process. Popular technologies often have better community support, which can be invaluable for solving problems and finding resources. Thus, you could save resources and focus mainly on the product.
Long-Term Maintenance and Support. Consider the long-term viability of the technology. Technologies that are widely adopted and supported are less likely to become obsolete and generally receive regular updates and improvements.

How having multiple technology stacks could affect business growth?

From one perspective, introducing one more stack could scale your hiring, but on the other hand, it brings extra maintenance costs since you need to support both stacks. So, as I said previously, in my point of view, only extra need should be an argument for incorporating more technology stacks.

But what is about the principle of selecting the best tool for a specific problem?

Sometimes you have no other choice but to bring new tools to solve a specific problem based on the same considerations aforementioned, in such cases, it makes sense to select the best solution.

The creation of systems without high coupling to a specific technology could be a challenge. Still, it is helpful to strive for a condition where the system is not tightly coupled to technology, and it will not die if tomorrow, a specific framework or tool becomes vulnerable or even deprecated.

Another important consideration is related to open-source and proprietary software dependencies. Proprietary software gives you less flexibility and the possibility to be customised. Still, the most dangerous factor is vendor lock-in, where you become dependent on a vendor's products, prices, terms, and roadmap. This can be risky if the vendor changes direction, increases prices, or discontinues the product. Open-source software reduces this risk, as a single entity does not control it. Eliminating a single point of failure on all levels is a key to building reliable systems for growth.

Single Point of Failure (SPOF)

A single point of failure (SPOF) refers to any part of a system that, if it fails, will cause the entire system to stop functioning. Eliminating SPOFs at all levels is crucial for any system requiring high availability. Everything, including knowledge, personnel, system components, cloud providers, and internet cables, can fail.

There are several basic techniques we could apply to eliminate single points of failure:

Redundancy. Implement redundancy for critical components. This means having backup components that can take over if the primary component fails. Redundancy can be applied across different layers of the system, including hardware (servers, disks), networking (links, switches), and software (databases, application servers). If you are hosting everything in one Cloud Provider and even having backups there, consider building a regular additional backup in another to reduce your lost cost in case of disaster.
Data Centers. Distribute your system across multiple physical locations, such as data centres or cloud regions. This approach protects your system against location-specific failures like power outages or natural disasters.
Failover. Apply a failover approach for all your components (DNS, CDN, Load balancers, Kubernetes, API Gateways, and Databases). Since issues can arise unexpectedly, it's crucial to have a backup plan to replace any component with its clone as needed swiftly.
High availability services. Ensure your services are built to be horizontally scalable and highly available from the start by adhering to the following principles:
- Practice service statelessness and avoid storing user sessions in in-memory caches. Instead, use a distributed cache system, such as Redis.
- Avoid reliance on the chronological order of message consumption when developing logic.
- Minimise breaking changes to prevent disrupting API consumers. Where possible, opt for backwards-compatible changes. Also, consider cost since sometimes, implementing a breaking change may be more cost-effective.
- Incorporate migration execution into the deployment pipeline.
- Establish a strategy for handling concurrent requests.
- Implement service discovery, monitoring, and logging to enhance reliability and observability.
- Develop business logic to be idempotent, acknowledging that network failures are inevitable.
Dependency review. Regularly review and minimise external dependencies. Each external dependency can introduce potential SPOFs, so it's essential to understand and mitigate these risks.
Regular knowledge share. Never forget the importance of spreading knowledge within your organisation. People can be unpredictable, and relying on a single individual is risky. Encourage team members to digitise their knowledge through documentation. However, be mindful of over-documenting. Utilise various AI tools to simplify this process.

Conclusion

In this article, we covered several key Macro aspects and how we can deal with their complexity.

Thank you for reading! See you next time!