TLDR; Scaling teams are hard. A platform team done right can help ease the hardships.
At Conde Nast International we grew from a team of 20 engineers to less than 100 in less than a year. We found out that building out a system that will be used in many markets has a lot of moving parts and repetition. For example rebuilding the infrastructure and application configuration. Adding third party add-on software. Building the application using CDN redirects. DNS registration and configuration.
There were many AWS accounts utilized by many teams. Tracking usage monitoring was a mess. Every developer need to think about where and how to run their system. They usually do so independent of each other. They have to think about monitoring and logging. This also includes miscellaneous subsystems like queuing logging deploying and routing traffic.
There were a lot of things we could have done to make the process much easier and smoother. This is the primary reason why we decided to build an infrastructure team then later on a platform team.
The Platform team
The platform team is not a part of the product teams but instead acts an engineering efficiency team. This means that the platform team’s main clientele is the product teams. That said product team also needs to learn about the platform in general. Then raise issues and feed backs for Continuous Improvement (the new CI). This should not mean that platform team is isolated with the rest of organization. But rather a vital player to the success of the organization.
Infrastructure management is one of the responsibility of the platform team. Ensuring best practices and deep understanding of infrastructure in the cloud or onsite. For example making sure that infrastructure will be audit-able. This can be implemented in many ways. But the most common way of implementation is infrastructure as code (IAC).
IAC is enabled by infrastructure as a service (IaaS). The Platform team handles building IAC using tools which are open sourced. this means that the platform being built is an abstraction of the tools. These tools are loosely connected and the integration of these tools is the platform. Think of it like a platform as a service (PaaS) but closer to what the business use-cases.
We knew why we were building the platform team. Now we had to lay the foundation on which the platform team is built. Unlike product teams which usually have a visible goal and mandates. Platform team have more non-functional requirements, we had to define this in-depth.
Here is my personal take on how to build a successful platform team.
People are prone to errors. Automation within the platform allows us to be more confident when executing a piece of code. This allows us to isolate any bugs and errors within the code. and then do continuous deployment.
Automated tests are important whatever is not tested is not yet fully implemented. There are many kinds of tests needed depending on what kind of software. For example integration unit end-to-end fuzzing pen testing.
Security is paramount fuzzing and automated security testing should be a priority. This to prevent CORS attack SQL injection and other. Having this will lessen the attack surface.
Use the principle of least privilege whenever giving access. At the same time make sure to balance this with ease of entry. A developer using a platform that needs access every 5 seconds is bad for interpersonal relations. A platform team should be enablers not barriers. This means going at great length to build relationships and enabling efficiency in the team.
Everything that has to be done twice should be automated. Keep to DRY Principle as much as possible.
The platform should be automated to remove cognitive overhead. Also help us to be more stable as a platform. This is not an alternative to documentation and post mortems but rather a result of them.
A big part of automation is deployment strategy and measuring deployments using metrics. Finally plotting the metrics against customer adoption.
Use smart deployments and understand when to they apply . Example of this are the following. Blue green deployments, a/b testing, automatic rollbacks and zero knowledge rollbacks.
Building a highly efficient platform is important. This will allow us to move faster. Fixing bugs fast to build efficiency. And building features on necessity basis. Reusing code and creating reference implementation is key. This will help the wider business to get a higher lead time to market as well as a competitive advantage. Make sure to document any known unknowns and edge cases. Common problems and escalation paths.
Efficiency in the platform also means failing fast and fixing it. The platform should be as transparent as possible when showing errors. Errors will then lead to faster debugging and deployments. Efficiency lies in iterating small features rather than a big deployment.
Having a system of escalation for knowledge base is not a hindrance. Instead it is a place to start whenever you feel lost. This with good relationships yields productive results and more efficient cooperation. Helping teams to share knowledge. They will gain experience with each other and it is a good way to build a highly efficient team.
Sufficient and continuous documentation is important. Training is needed for developers. The overhead of training new developers should be taken into account. Each new technology we adopt has an overhead. This needs careful consideration if the overhead is worth the value of adoption. Interactive training labs and developers portal is useful. A place where we can do discovery of mvp and reference implementation. All this will help us achieve self sufficiency.
All new engineers should build something using the platform on their first month. This can be a part of initial orientation of new hires. This will also let us uncover issues within the self service nature of the platform. Also retraining for each new part of the platform. Doing DIY discovery within the platform is encouraged. Reinventing the wheel and using shadow IT is actively discouraged. Maintenance of many implementation of the same thing is wasteful and unneeded.
Monitoring metrics and alerting tracing are powerful tools. SRE can be initially a part of platform function embedded within the core platform team. This will help SRE to understand the underlying implementation of the platform.
The most important part of the platform is that it is built for developers. Striving to balance building out best practices and fostering interpersonal communication. A self service platform means you will have the know-how. Then understanding the value of having a platform. This means that developers will sometimes have frustrations. Feedbacks should be taken into account while iterating platform development. There should be a way to give feedback to platform developers and how the platform is doing in general. Without this the platform lives in isolation with the rest of the company. Adoption will be strenuous at best. People want to use and adopt something they feel good about using. After-all software development is a people centric type of project. Communication, interactions motivation is important part of development. We have to perfect this together with the business requirements and deadlines. A non existent perfect platform is of no use to anybody. A semi functional and unsecured platform is a curse to any company.
Finally there will always be things that are outside of the platform scope. This should always be decided on a case to case basis. Knowing that people still need it at the end of the day and you will need to redirect the request at another team. Possibly escalate it.
In many ways the success or failure of an platform team lies in the decision it makes. The platform team will have to make decisions that affects other teams. This happens while building the foundations of the platform. For example the language the tools and frameworks we use.
The authority of the platform team lies not in the enforcement of standards. But in the subtle steering of the development team into one decision or the other. For instance building a recommendation for logging. To make it compatible with a logging parser while shipping logs. Building out hard and fast standards is not the responsibility of the platform team. The development team themselves should have the prerogative to pick and choose. Like tools frameworks and languages it deem suitable for its own use cases. Having said that there are fundamental forces that needs to be decided beforehand. For example usage of cloud or multiple cloud providers.
Vendor lock in is both a gift and a curse for platform teams. Gift in the sense that these decisions have been made by other teams. This means teams have built their ecosystem of tools around a decision. Curse also since we have to live these decisions within the lifecycle of an application. Or add an additional overhead of migration. A platform team should have visibility and authority over the wider organization to have a better chance of success.
Advocacy and evangelism
DevOps is a culture not a role. The platform team should be able to evangelize this.
The usual point of failure in software development is the lack of understanding how the application will perform under production environment conditions.
The first technical team that facilitates cross team communication usually is the platform team. Then the advocacy for code reuse and best practices by default falls to the platform team. Performance and reliability becomes the primary concern of the platform team.
Engineering efficiency is the constant advocacy of the platform team. The entire purpose of building the platform team is for engineers to build more with less cognitive overhead. The details that could be reused and automated usually falls to the platform team.
With the authority to make changes to the fundamental building blocks of each of the systems. A bug or a vulnerability within any of these will cause a cascading problem. The rest of the engineering team will then be affected.
Accountability as a team is important to make sure that whenever the team is making a breaking change the rest of the team is informed.
Blameless post mortem is a requirement to make each of the member feel safe to make changes. Building a better system at the same time taking ownership of the system. The responsibility to push for a support model and operations then goes to the platform and SRE team.
The experience and expertise needed on platform team depends on the structure of the company.
For example some companies have a functioning SRE team that takes care of the operation and operationalization of each application. This means that the creation of support model is not entirely the responsibility of platform team.
Vendor management is also a task that can be delegated to application support teams.
But generally here are the expertise you will need within your platform team:
- container orchestration and containerization
- cloud management
- vendor management
- pipeline management
- dns and cdn configuration
- server configuration
- git and scm
- observability (logging monitoring tracing)
- operationalization (runbooks and support escalation post mortems alerting)
- soft skill and people management
- software defined infrastructure (infrastructure as code)
- collaboration with other teams and negotiation with the management
- common workflow and architecture management
- developer training and teaching
- documentation development
Ideally your engineers will have a domain expertise and then have good ?working knowledge of other domains.
My suggestion in building out the expertise is to switch between members such that there are domain level experts. Then do a handover or do extreme table pair programming. Such that redundancy is built-in to the team structure.
Given the huge remit of the platform team. We can safely assume that there would need to be a big team for doing all these activities in parallel. Some of the task can be delegated to the application team. Although that would add additional overhead for development. We could also split out this team but that when handled improperly could lead to further misalignment.
Having a semi flat structure with multiple senior and principal engineer that could make agile decision is recommended. Also having a big team like this with many moving parts would mean that a technical lead role is unsuitable rather having a engineering manager for platform and solution architect is essential.
The solution architect could lay out the roadmap of the platform team. Then coordinate that with the rest of the engineering teams. In this process we can also understand the needs of organization. And then plan what capabilities we need. Finally the solution architect can help lead the selection of technologies to add to the platform.
The engineering manager could help with communicating and building out relationships. This is important for number of reasons. First being a true cross functional team the number of request will be high. Second the prioritization of tasks will be crucial in building out capabilities.
The platform team is a new concept enabled by new technologies coming to market. A good example of this is kubernetes and its ubiquitousness. This new team can help the business easily build out capabilities. Scaling teams are hard having a new team of enablers will help the team scale faster with less friction. This is my personal take based on experience what needs to be at the core of that team and which expertise is needed to be built into it.
Work with us Check out this job at Condé Nast International: https://www.linkedin.com/jobs/view/839478085