Optimizing Agents In Devops Architecture

Agents are typically processing units used to execute a task within a job. They receive execution details from the server in the form of configuration and then execute those sets of steps. This software pattern is used widely today by a lot of applications. For instance build, test and deploy tooling like Azure DevOps, Jenkins, buildkite etc. These are typically either available as SaaS or on premise installation. These tools let you build, test, deploy your software and/or configuration continuously.

Now these tooling services need their own set of services/hardware to ensure it can run the pipelines. Here is the high level architecture diagram:

When you push your code, a build job creation request is sent to the CICD server. The CICD server will select an agent from pools of agents, and assign the job to it. The CICD server acts as an orchestrator, its responsibility is to schedule jobs for the agents, keeping track of jobs, retrying failed jobs, rescheduling jobs on agent failure etc. An agent is a machine or a container that executes the job. These agents are often hosted on different OS. Agents can be ephemeral or persistent. Based on the job definition, an agent with the appropriate OS will be selected. This detached architecture makes scalability easier, new agents can be easily added to scale up the build system.

How does a controller select an agent?

The agents would poll for the next available job from the controller. The controller would assign the job in First in First Out (FIFO) order for the matching agent. If you specify an agent to a particular job, it will wait for the specific agent to become available. If the job is labeled to select agents to match for parameters like the OS and/or particular .NET version/ simulator or databases, the controller would wait for the agent with such configurations to become available.

Types of agents

Persistent agents: When persistent agents are created, they register themselves with the controller and are added to the agent pool. These agents would then periodically poll for jobs. Once it receives a job, it would execute the job and return job status to the server.

Persistent agents can be configured to preserve workspace data between builds. This can speed up the build process. But it can not always be guaranteed. Even if you configure your agents to persist data, the controller can not guarantee to select the same agent for an incremental build or for a different stage in the same build. Additionally there is some level of security risk involved with persistent agents. If one of the jobs infected the agent all the subsequent jobs executed on that agent will be infected. Additional care should be taken that no private information is stored on the agent as it will be shared for subsequent sets of jobs.

Ephemeral agents: Some devops solutions allow you to use ephemeral agents of the VM image or container image you provide. Some devops tooling allows these containers to be spun up onto kubernetes or a small binary executing on VM that can spin up containers for job execution. These agents are discarded after the job finishes. Since these are ephemeral they have to download your source code, dependent binaries and tooling for each job execution this latency will add on to job execution time. These can not be used for incremental builds. With these agents the cloud service provider will manage the infrastructure for you. It can scale up depending on the load.

Now we know what agents are, lets see different ways to configure them for better performance:

Configure and label based on their role: Agents can be labeled based on their configuration Windows/Linux/Mac-OS or JDK version and also deploy/build/test etc. You can customize each agent based on their intended role. A build agent would need a version control system, compiler and runtime environment. A dedicated build agent will not be running tests so we do not need testing tools or browsers (if it runs UI testing) installed on them. Keep the relevant software installed and assign hardware based on need.
Assign jobs to relevant agents: Once you have separated agents based on their role, assign jobs to relevant agents. By making sure your Windows build job which needs MSBuild go to an agent labeled [OS : Windows, runtime: MSBuild] improves performance as it will have most of the dependencies installed and ready to go. Same applies for a Selenium test job routing to an agent which has a browser installed.
Monitor performance: If your build jobs are taking more time than test jobs then consider adding agents in the build pool. If you can’t add extra hardware then analyze other pools and move a few agents from other lightweight pools to build pools.
Maintenance: For persistent agents, you have to ensure that regular maintenance is being performed and the OS/Softwares are patched.

These steps mainly apply for build tools where you manage the agents. If you use ephemeral agents, it becomes challenging. In such a case you want to look closely in pipelines and workspace settings to optimize it.

Some additional best practices for agents:

Set up retention policy: Over a long run persistent agent can run out of disk space. Setting a retention policy for agents would save us from that risk. Agents typically delete old jobs and save disk space.
Inject parameters: Build, Test and Release tooling should work on injection principle. Its dependency should be injected into it rather than querying. This will help reduce the attack vector as well as make it easier to migrate tooling.
Avoid polling: Polling typically creates additional overhead and resource wastage. Try to use event driven architecture for build, test and release triggers.
Don't put large binary files in source repository: Don’t put large binary files in your source code, use blobs or package management. Every build, test and release don't need these files. They can be downloaded on demand.
Use caching: Smart caching will help build times and test times. Use cache whenever possible but careful not to cache any secret and add some cache validation mechanism to protect from attacks.
Use antivirus software on agents: Since agents perform critical tasks it is important to secure them. Occasionally antivirus software might provide false alarms for test and build output as viruses. Continuous careful auditing of these binaries should be performed with help of your security team.

Conclusion:

Agents are an integral part of the build system. Selecting agents can play a huge role in performance of your end to end build, test and release pipeline. Understanding agents and how they play a part in the build system would help you design your pipelines more efficiently. These suggestions can be applied based on careful analysis of your system requirements.