Hello! I want to talk about why large tech corporations are obsessed with creating proprietary solutions for their infrastructure. The answer appears to be obvious: it is nothing more than the NIH syndrome. But this answer is far from complete, not to mention objective.
I am the CTO of the Yandex Platform Engineering team and our goal is to assist engineers in building the entire development cycle, from code writing to operating services, in order to make it more efficient. This includes process streamlining: we not only develop as-a-service offerings, but also assist in implementing them within the company. This works on the scale of Yandex: our services are used by thousands of developers across the company.
To solve a problem, we often develop proprietary tools rather than implementing ready-made ones. For example, while still a programmer on the team, I worked on a quantitative monitoring system in C++ and Python and helped scale it to tens of billions of processed metrics. So, based on my own experience, I know what motivations and development paths lead to the emergence of in-house tools: below, I will attempt to identify the fundamental reasons for their creation using our solutions as examples.
Setting the task. The goal of our internal Runtime Cloud, or RTC, is to provide internal users with simple deployment and traffic management tools. RTC users are the same engineers who develop Yandex services. And they need to launch tens of thousands of applications they have created somewhere, send user requests there, balance the load, and deal with incidents, among other things.
The need for an internal cloud emerged in the early 2010s, when the number of services was already in the hundreds and the total number of allocated cores grew by tens of percent per year. Having dedicated servers for each service became prohibitively expensive, and we required tools that would allow us to run applications from multiple services on a single server. We had several requirements for these tools at the start:
Essentially, we needed Kubernetes (and, over time, RTC came very close). But here's the catch: k8s was only announced in 2014. Apache Mesos existed at the time, but it was in its infancy.
Implementation of basic functions. We started solving the problem with a kind of MVP - a simple set of tools, which was more like a set of building blocks that automate routine actions, for example:
Over time, it became possible to put together a full-fledged service layout graph using these building blocks (similar to continuous delivery). After a certain number of iterations, Nanny, a system for managing services running at RTC, appeared in 2013.
Another fundamental aspect of Nanny was the implementation of application isolation based on resource consumption. We initially launched applications from various services without resource isolation, which resulted in a large number of operational issues and incidents.
At the time, the only ready-made solutions were LXC, which had stopped developing by then, and Docker, which could not use IPv6-only and restarted all containers when updating dockerd, making it impossible to update dockerd without affecting the user. As a result, we began developing our
Solving utilization problems. At the time, resource management in the internal cloud was accomplished through commits to the repository. However, this slowed Yandex's development and conflicted with the task of increasing utilisation: to solve it, we needed to place our map reduce system in the clouds, namely
To bring YTsaurus to RTC, the ability to manage pods dynamically rather than through repository commits was required. Therefore, in 2018, we created
New growing pains. During the same time period, k8s evolved into a much more mature solution, becoming one of the AWS services in 2017. But it still did not meet all of our requirements:
YTsaurus actively used the ability to create nested Porto containers rather than creating a single scheduler. Of course, we could add support for the same dual stack in k8s ourselves. However, the Linux kernel development experience has shown that not everything can be sent to open source, and we strive to keep the delta from the upstream kernel to a minimum in order to simplify updating to new versions.
Our solution today. The architecture of the RTC is very similar to that of Kubernetes. The user declaratively describes their service in the form of some specification that describes how to launch the specified application and in which data centres. Each data centre has its own installation of Yandex Planner, which serves as a database for all cluster objects on the one hand and as a pod scheduler on the other. Each server in the data centre runs an agent that receives pod specifications from Yandex Planner and launches them using our proprietary Porto containerization system.
Currently, RTC has launched tens of thousands of services, allocating over 5 million cores to over 100,000 servers. Every day, over 100,000 changes are made to service specifications.
Plans. What if k8s can handle our scale? Especially since the k8s ecosystem started to outperform us in terms of functionality at some point. Wouldn't it be better to switch to k8s and hope that ready-made tools will eventually provide the volume we require? In practise, we continue to be a niche consumer for k8s because only a small percentage of companies operates at such a large scale, each of which has its own in-house cloud solutions.
Another critical point to remember is the migration issue. According to the July 2018
In 2021, we estimated how much it would cost to move from one deployment system to another when choosing our development strategy. Yandex's migration to vanilla k8s would be an extremely costly task, costing hundreds of man-years.
In this simple manner, we ended up with our inner cloud, which we are unlikely to be able to abandon in the next 5 years, even if we set such a goal.
What should be done about the lack of internal cloud functionality compared to k8s? In practise, our customers can use Managed Kubernetes in Yandex Cloud. This option is primarily used for projects where strict compliance requirements must be met - this is a small proportion of teams, less than 1%. For the reasons stated above, the rest of the population does not see much benefit in moving.
At the same time, we are actively looking at k8s and considering how to get closer to generally accepted standards. We are already actively experimenting with k8s in some tasks, such as cloud bootstrapping or organising IaaC at the scale of the entire Yandex. Ideally, we would like to reuse the k8s interface while maintaining our own implementation that is as tailored to our needs as possible. All that is left is to figure out how to do it in practise.
Problems and solution requirements. Our monorepository, Arcadia, shares the same primary goal as our internal cloud: to provide convenient development tools. This includes an entire development ecosystem in the case of the repository:
Arcadia emerged around the same time as Yandex's internal cloud. One of the reasons for the creation of the monorepository was the need to reuse code within Yandex. This was hampered at the time by the presence of several build systems. A unified system with support for efficient distributed builds was required to work on the scale of the entire Yandex. It should also be stable and usable.
Implementation of a unified build system. Our proprietary ya make build system debuted in 2013, when it was only for C++ code. Before ya make, we used CMake, but its speed prevented it from scaling to the scale of a monorepository. The proprietary ya make worked much faster with Arcadia.There were no other open source options that could solve our problem: for example, Bazel was released much later, in 2015.
Version control system scaling. Yandex previously used SVN as its version control system. Although SVN had a large capacity, it was still limited and difficult to maintain. Furthermore, we were aware that we would eventually run into the limitations of SVN's capabilities and convenience. For example, heuristics were used to implement the ability to download only the required portion of the repository or selective checkout. As a result, in 2016, we began experimenting with other version control systems besides SVN.
Mercurial was the first choice. But the main issue we had was speed. We tried for a year and a half to get Mercurial into production, but the results were disappointing. For example, we eventually had to rewrite parts of Mercurial to support FUSE, or we would have had to bring the entire repository to each developer's laptop.
Eventually, it turned out that it was cheaper to write an in-house solution from scratch, and in 2019, Arc appeared - a new version control system for Arcadia users with a git-like UX. Arc's foundation is FUSE (filesystem in user space) rather than selective checkout. In addition, YDB acts as a scalable database, which greatly simplifies Arc's operation when compared to Mercurial.
We are often asked why we did not use git. Because it also has scale and functionality limitations: if we only import the Arcadia trunk into git, git status will take minutes at this scale. At the same time, there was no stable FUSE implementation built on top of git: VFS for Git is no longer being developed, and EdenFS was eventually turned into Sapling, but this happened much later.
Solution's current state and plans for the future. To begin development, an internal user simply needs to create a folder in our monorepository, write code, and tell ya make how to build his application by adding the build manifest. As a result, the user receives pull requests, CI configuration, and the ability to reuse any code in the company.
In terms of scale, trunk currently contains 10 million files, the repository as a whole exceeds 2 TiB, and more than 30 thousand commits are made each week.
As a result, in the ecosystem we created, we must create many components from scratch. However, it is now moving towards compliance with global standards. Arc, for example, supports working with Git for a predefined set of projects.
So why do big tech companies have to invent their own solutions, and why can't they be replaced with ones that adhere to a generally accepted standard?
Innovation. Large corporations are frequently required to develop solutions to problems that will only become commonplace in the future. This is how innovative solutions with the potential to become market standards may emerge.
It is not always the case that a problem solved by a company faces anyone other than the company itself. Sometimes big tech's experience with a specific problem helps the entire industry avoid that problem by taking a completely different development path. It is not always possible to predict market development, and as a result, different examples of proprietary solutions have had very different outcomes.
ClickHouse is an example of a truly successful innovative project that has greatly enriched the field of online analytical processing (OLAP). However, this is not the case for all projects. The Porto, which began as an open source project, failed to gain traction for a variety of reasons. Although some of its features, such as the ability to create nested containers, remain unique.
Scale. This point is similar to the previous one in some ways, because not every company faces the scalability issue. There was a time when 640 kbytes was more than enough for everyone, wasn't there?
In fact, the exponential increase in system load was one of the most important reasons for the development of Arcadia and the internal cloud. This is why Arc and Yandex Planner were developed. Arc was created in response to the need for a user-friendly VCS that can allow users to work with a monorepository containing tens of millions of files in a trunk without difficulty. Yandex Planner was created in response to the need to work effectively with clusters of tens of thousands of nodes and millions of pods.
Public tools continue to have scaling issues (after all, this is a relatively rare scenario, and investing in it is frequently simply unprofitable).
Inertia. Consider an in-house tool that solves a problem within a company. A company that actively uses this tool would devote resources to better tailoring it to its needs, eventually transforming it into a highly specialised tool. This process can last for years.
Now consider the possibility that, at some point, a universally accepted standard for resolving that particular problem emerges. In this case, specialisation may still be an important factor in deciding on an in-house solution. Consider build systems. We use ya make at Arcadia, though there is Bazel from Google. They are conceptually similar, but when you get into the details, many important scenarios are implemented differently, because the load patterns for each workload can be drastically different. As a result, the resources already expended will almost certainly have to be reinvested in order to customise a new generally accepted standard.
Migrations. If the previous section addressed the issue of adapting the project to users, let us now address the issue of migrating the users themselves. In my opinion, migration should be called the next most important problem in tech after naming. If we assume we already have an in-house company tool and want to replace it with a standardised one, we will inevitably need migrations.
We know many examples of migrations from our experience developing an internal cloud. Large-scale migrations take time, so both tools must be supported concurrently for extended periods of time. If this process involves a large number of users, management issues are unavoidable. It is certainly worthwhile to try to migrate without user participation, but this is not always possible.
Business Continuity. To be frank, this point has recently gained sufficient importance. Previously, a much smaller number of companies took it seriously due to concerns about vendor lock-in. Trusting critical processes to a vendor who can terminate collaboration at any time is risky. JetBrains is a prime example of this, having restricted the use of its IDEs to certain companies. Another case in point is Github Enterprise, which has begun to suspend Russian-based user accounts.
In-house solutions are typically immune to this issue. On the one hand, there are still open source solutions. On the other hand, there are no guarantees that the open source model will be with you all the way: for example, Corona, Facebook's in-house developed improvement to Hadoop MapReduce scheduling software, appeared in the first place due to the inability to commit the patches required to scale Hadoop upstream.
At the same time, the legal aspect of the issue affects open source: for example, commits in golang or k8s necessitate the signing of a CLA. Will this continue to be an issue?
NIH. Yes, in addition to objective reasons, it is possible that the decisions made are not pragmatic. That is NIH syndrome at its finest.
For example, in an attempt to eliminate the influence of batch on compute, we attempted to create our own scheduler in the Linux kernel. In practise, nothing good came of it; one could have made do with the Linux kernel's existing capabilities. However, the higher the labour costs, the more effort is put into elaborating and solving the problem, and the lower the likelihood of suffering from NIH syndrome.
To summarise, as you can see, in-house solutions for large companies are frequently required. The majority of them will merge with yet-to-mature unified global standard solutions in the future, while the rest will become history. In any case, deciding between a proprietary solution and a ready-made one remains a difficult question that cannot be answered without first understanding the context and estimating the cost of such a project.