Scaling GitHub Actions on AWS with ForgeMT’s Security and Multi-Tenancy

Introduction GitHub Actions is the go-to CI/CD tool for many teams. But when your organization runs thousands of pipelines daily, the default setup breaks down. You hit limits on scale, security, and governance — plus skyrocketing costs. GitHub-hosted runners are easy but expensive and don’t meet strict compliance needs. Existing self-hosted solutions like Actions Runner Controller (ARC) or Terraform EC2 modules don’t fully solve multi-tenant isolation, automation, or centralized control. ForgeMT, built inside Cisco’s Security Business Group, fills that gap. It’s an open-source AWS-native platform that manages ephemeral runners with strong tenant isolation, full automation, and enterprise-grade governance. This article explains why ForgeMT matters and how it works — providing a practical look at building scalable, secure GitHub Actions runner platforms. Why Enterprise CI/CD Runners Fail at Scale At large organizations, scaling GitHub Actions runners encounters four key bottlenecks: Fragmented Infrastructure: Teams independently choose their CI/CD tools: Jenkins, Travis, CircleCI, or self-hosted runners—which accelerates local delivery but creates duplicated effort, configuration drift, and fragmented monitoring. Without a unified platform, scalability, security, and reliability degrade. Weak Tenant Isolation: Runners run untrusted code across teams. Without strong isolation, one compromised job can leak credentials or escalate attacks across tenants. Poor audit trails slow breach detection and hinder compliance. Scalability Limits: Static IP pools cause IPv4 exhaustion, and manual provisioning delays runner startup. Without elastic scaling, resources are wasted or pipelines queue up, killing developer velocity. Maintenance and Governance Overhead: Uneven patching weakens security, infrastructure drift complicates troubleshooting, and audits become expensive and error-prone. Secure scaling demands centralized governance, consistent policy enforcement, and automation. Fragmented Infrastructure: Teams independently choose their CI/CD tools: Jenkins, Travis, CircleCI, or self-hosted runners—which accelerates local delivery but creates duplicated effort, configuration drift, and fragmented monitoring. Without a unified platform, scalability, security, and reliability degrade. Weak Tenant Isolation: Runners run untrusted code across teams. Without strong isolation, one compromised job can leak credentials or escalate attacks across tenants. Poor audit trails slow breach detection and hinder compliance. Scalability Limits: Static IP pools cause IPv4 exhaustion, and manual provisioning delays runner startup. Without elastic scaling, resources are wasted or pipelines queue up, killing developer velocity. Maintenance and Governance Overhead: Uneven patching weakens security, infrastructure drift complicates troubleshooting, and audits become expensive and error-prone. Secure scaling demands centralized governance, consistent policy enforcement, and automation. In short, enterprises fail to scale GitHub Actions runners without a platform that: Centralizes multi-tenancy Automates lifecycle management Provides enterprise-grade observability and governance Centralizes multi-tenancy Automates lifecycle management Provides enterprise-grade observability and governance But beware—over-centralization can kill flexibility and introduce new challenges. Why GitHub Actions — And Why It’s Not Enough at Enterprise Scale GitHub Actions is popular because it offers: Deep GitHub integration: triggers on PRs, branches, and tags with no extra logins, plus automatic secret and artifact handling. Extensible ecosystem: thousands of marketplace actions simplify workflow creation. Flexible runners: GitHub-hosted runners for convenience, or self-hosted for control, cost savings, and compliance. Granular security: native GitHub Apps, OIDC tokens, and fine-grained permissions enforce least privilege. Rapid scale: pipelines at repo or org level enable smooth CI/CD growth. Deep GitHub integration: triggers on PRs, branches, and tags with no extra logins, plus automatic secret and artifact handling. Deep GitHub integration: Extensible ecosystem: thousands of marketplace actions simplify workflow creation. Extensible ecosystem: Flexible runners: GitHub-hosted runners for convenience, or self-hosted for control, cost savings, and compliance. Flexible runners: Granular security: native GitHub Apps, OIDC tokens, and fine-grained permissions enforce least privilege. Granular security: Rapid scale: pipelines at repo or org level enable smooth CI/CD growth. Rapid scale: However, GitHub Actions alone can’t meet enterprise-scale demands. Enterprises require: Strong tenant isolation and centralized governance across thousands of pipelines. A unified platform to avoid fragmented infrastructure and scaling bottlenecks. Fine-grained identity, network controls, and compliance enforcement. Automation for onboarding, patching, and auditing to reduce operational overhead. Strong tenant isolation and centralized governance across thousands of pipelines. A unified platform to avoid fragmented infrastructure and scaling bottlenecks. Fine-grained identity, network controls, and compliance enforcement. Automation for onboarding, patching, and auditing to reduce operational overhead. Cloud providers like AWS supply identity, networking, and automation building blocks—IAM/OIDC, VPC segmentation, EC2, EKS (needed to build secure, scalable, multi-tenant CI/CD platforms). Existing Solutions and Why They Fall Short Actions Runner Controller (ARC) runs ephemeral Kubernetes pods as GitHub runners, scaling dynamically with declarative config and Kubernetes-native integration. But: Actions Runner Controller (ARC) Kubernetes namespaces alone don’t provide strong security isolation. No native AWS IAM/OIDC integration. Lacks onboarding, governance, and audit automation. Network policy management is manual, increasing operational overhead. Kubernetes namespaces alone don’t provide strong security isolation. No native AWS IAM/OIDC integration. Lacks onboarding, governance, and audit automation. Network policy management is manual, increasing operational overhead. Terraform AWS GitHub Runner Module provisions EC2 self-hosted runners with customizable AMIs, integrating well with IaC pipelines. However: Terraform AWS GitHub Runner Module Typically deployed per team, causing fragmentation. No native multi-tenant isolation. Requires manual IAM and account setup. No onboarding or patching automation. Typically deployed per team, causing fragmentation. No native multi-tenant isolation. Requires manual IAM and account setup. No onboarding or patching automation. Commercial Runner-as-a-Service options offer simple UX, automatic scaling, and vendor-managed maintenance with SLAs, but: Commercial Runner-as-a-Service High costs at scale. Vendor lock-in risks. Limited multi-tenant isolation. Often don’t meet strict compliance requirements. High costs at scale. Vendor lock-in risks. Limited multi-tenant isolation. Often don’t meet strict compliance requirements. Where ForgeMT Fits In ForgeMT combines the best of these approaches to deliver an enterprise-ready platform: Orchestrates ephemeral runners seamlessly. Uses AWS-native identity and network isolation (IAM/OIDC). Built-in governance with full lifecycle automation. Designed for large, security-focused organizations. Orchestrates ephemeral runners seamlessly. Uses AWS-native identity and network isolation (IAM/OIDC). Built-in governance with full lifecycle automation. Designed for large, security-focused organizations. ForgeMT doesn’t reinvent ARC or EC2 modules but extends them with: Strict multi-tenant isolation: Each team runs in a separate AWS account to contain blast radius. IAM/OIDC enforces least privilege. Calico CNI manages Kubernetes network segmentation. Full automation: Tenant onboarding, runner patching, centralized monitoring, and drift remediation happen automatically, cutting manual toil and errors. Centralized control plane: One dashboard securely manages all tenants with governance, audit logs, and compliance-ready traceability. Cost optimization: Spot instances, warm pools, and autoscaling based on real-time metrics and spot prices reduce costs without sacrificing availability. Open-source transparency: 100% open source—no vendor lock-in, no license fees, full customization freedom. Strict multi-tenant isolation: Each team runs in a separate AWS account to contain blast radius. IAM/OIDC enforces least privilege. Calico CNI manages Kubernetes network segmentation. Strict multi-tenant isolation: Full automation: Tenant onboarding, runner patching, centralized monitoring, and drift remediation happen automatically, cutting manual toil and errors. Full automation: Centralized control plane: One dashboard securely manages all tenants with governance, audit logs, and compliance-ready traceability. Centralized control plane: Cost optimization: Spot instances, warm pools, and autoscaling based on real-time metrics and spot prices reduce costs without sacrificing availability. Cost optimization: Open-source transparency: 100% open source—no vendor lock-in, no license fees, full customization freedom. Open-source transparency: Architecture Overview At its core, ForgeMT is a centralized control plane that orchestrates ephemeral runner provisioning and lifecycle management across multiple tenants running on both EC2 and Kubernetes. Key Components Terraform module for EC2 runners — provisions ephemeral EC2 runners with autoscaling, spot/on-demand, and ephemeral lifecycle. Actions Runner Controller (ARC) — manages EKS-based runners as Kubernetes pods with tenant namespace isolation. OpenTofu + Terragrunt — Infrastructure as Code managing tenant/account/region deployments declaratively. IAM Trust Policies — secure runner access with ephemeral credentials via role assumption. Splunk & Observability — centralized logs and metrics per tenant. Teleport — secure SSH access to ephemeral runners for auditing and debugging. EKS + Calico CNI — scalable pod networking with strong tenant segmentation and minimal IP usage. EKS + Karpenter — demand-driven node autoscaling with spot and on-demand instances, plus warm pools. Terraform module for EC2 runners — provisions ephemeral EC2 runners with autoscaling, spot/on-demand, and ephemeral lifecycle. Terraform module for EC2 runners Actions Runner Controller (ARC) — manages EKS-based runners as Kubernetes pods with tenant namespace isolation. Actions Runner Controller (ARC) OpenTofu + Terragrunt — Infrastructure as Code managing tenant/account/region deployments declaratively. OpenTofu + Terragrunt IAM Trust Policies — secure runner access with ephemeral credentials via role assumption. Splunk & Observability — centralized logs and metrics per tenant. Teleport — secure SSH access to ephemeral runners for auditing and debugging. EKS + Calico CNI — scalable pod networking with strong tenant segmentation and minimal IP usage. EKS + Karpenter — demand-driven node autoscaling with spot and on-demand instances, plus warm pools. ForgeMT Control Plane The control plane is the platform’s brain — managing runner provisioning, lifecycle, security, scaling, and observability. Centralized Orchestration: Decides when and where to spin up ephemeral runners (EC2 or Kubernetes pods). Multi-Tenant Isolation: Isolates each tenant via dedicated AWS accounts or Kubernetes namespaces, IAM roles, and network policies. Security Enforcement: Applies hardened runner configurations, automates ephemeral credential rotation, and enforces least privilege. Scaling & Optimization: Integrates with Karpenter and EC2 autoscaling to scale runners up/down with demand and cost awareness. Observability & Governance: Streams logs and metrics to Splunk; provides audit trails and compliance dashboards. Centralized Orchestration: Decides when and where to spin up ephemeral runners (EC2 or Kubernetes pods). Centralized Orchestration: Multi-Tenant Isolation: Isolates each tenant via dedicated AWS accounts or Kubernetes namespaces, IAM roles, and network policies. Multi-Tenant Isolation: Security Enforcement: Applies hardened runner configurations, automates ephemeral credential rotation, and enforces least privilege. Security Enforcement: Scaling & Optimization: Integrates with Karpenter and EC2 autoscaling to scale runners up/down with demand and cost awareness. Scaling & Optimization: Observability & Governance: Streams logs and metrics to Splunk; provides audit trails and compliance dashboards. Observability & Governance: Runner Types and Usage Tenant Isolation Each ForgeMT deployment is single-tenant and region-specific. IAM roles, policies, VPCs, and services are scoped exclusively to that tenant-region pair. This hard boundary prevents cross-tenant access, simplifies compliance, and minimizes blast radius. EC2 Runners Ephemeral VMs booted from Forge-provided or tenant-custom AMIs. Jobs run directly on VMs or inside containers. IAM role assumption replaces static credentials. Terminated after each job to avoid drift or leaks. Ephemeral VMs booted from Forge-provided or tenant-custom AMIs. Jobs run directly on VMs or inside containers. IAM role assumption replaces static credentials. Terminated after each job to avoid drift or leaks. EKS Runners Managed by ARC as Kubernetes pods in tenant namespaces. Images pulled from Forge or tenant ECR repositories. Scales dynamically for burst workloads. Managed by ARC as Kubernetes pods in tenant namespaces. Images pulled from Forge or tenant ECR repositories. Scales dynamically for burst workloads. Warm Pools and Limits ForgeMT supports warm pools of pre-initialized runners to minimize cold start latency—especially beneficial for EC2 runners with slower boot times. Per-tenant limits enforce: Max concurrent runners Warm pool size Runner lifetime (auto-termination after jobs) Max concurrent runners Warm pool size Runner lifetime (auto-termination after jobs) These controls prevent resource abuse and keep costs predictable. Tenant Onboarding Deploying a new tenant is straightforward and fully automated via a single declarative config file, for example: gh_config: ghes_url: '' ghes_org: cisco-open tenant: iam_roles_to_assume: - arn:aws:iam::123456789012:role/role_for_forge_runners ecr_registries: - 123456789012.dkr.ecr.eu-west-1.amazonaws.com ec2_runner_specs: small: ami_name: forge-gh-runner-v* ami_owner: '123456789012' ami_kms_key_arn: '' max_instances: 1 instance_types: - t2.small - t2.medium - t2.large - t3.small - t3.medium - t3.large pool_config: [] volume: size: 200 iops: 3000 throughput: 125 type: gp3 large: ami_name: forge-gh-runner-v* ami_owner: '123456789012' ami_kms_key_arn: '' max_instances: 1 instance_types: - c6i.8xlarge - c5.9xlarge - c5.12xlarge - c6i.12xlarge - c6i.16xlarge pool_config: [] volume: size: 200 iops: 3000 throughput: 125 type: gp3 arc_runner_specs: dind: runner_size: max_runners: 100 min_runners: 1 scale_set_name: dependabot scale_set_type: dind container_actions_runner: 123456789012.dkr.ecr.eu-west-1.amazonaws.com/actions-runner:latest container_requests_cpu: 500m container_requests_memory: 1Gi container_limits_cpu: '1' container_limits_memory: 2Gi volume_requests_storage_type: gp2 volume_requests_storage_size: 10Gi k8s: runner_size: max_runners: 100 min_runners: 1 scale_set_name: k8s scale_set_type: k8s container_actions_runner: 123456789012.dkr.ecr.eu-west-1.amazonaws.com/actions-runner:latest container_requests_cpu: 500m container_requests_memory: 1Gi container_limits_cpu: '1' container_limits_memory: 2Gi volume_requests_storage_type: gp2 volume_requests_storage_size: 10Gi gh_config: ghes_url: '' ghes_org: cisco-open tenant: iam_roles_to_assume: - arn:aws:iam::123456789012:role/role_for_forge_runners ecr_registries: - 123456789012.dkr.ecr.eu-west-1.amazonaws.com ec2_runner_specs: small: ami_name: forge-gh-runner-v* ami_owner: '123456789012' ami_kms_key_arn: '' max_instances: 1 instance_types: - t2.small - t2.medium - t2.large - t3.small - t3.medium - t3.large pool_config: [] volume: size: 200 iops: 3000 throughput: 125 type: gp3 large: ami_name: forge-gh-runner-v* ami_owner: '123456789012' ami_kms_key_arn: '' max_instances: 1 instance_types: - c6i.8xlarge - c5.9xlarge - c5.12xlarge - c6i.12xlarge - c6i.16xlarge pool_config: [] volume: size: 200 iops: 3000 throughput: 125 type: gp3 arc_runner_specs: dind: runner_size: max_runners: 100 min_runners: 1 scale_set_name: dependabot scale_set_type: dind container_actions_runner: 123456789012.dkr.ecr.eu-west-1.amazonaws.com/actions-runner:latest container_requests_cpu: 500m container_requests_memory: 1Gi container_limits_cpu: '1' container_limits_memory: 2Gi volume_requests_storage_type: gp2 volume_requests_storage_size: 10Gi k8s: runner_size: max_runners: 100 min_runners: 1 scale_set_name: k8s scale_set_type: k8s container_actions_runner: 123456789012.dkr.ecr.eu-west-1.amazonaws.com/actions-runner:latest container_requests_cpu: 500m container_requests_memory: 1Gi container_limits_cpu: '1' container_limits_memory: 2Gi volume_requests_storage_type: gp2 volume_requests_storage_size: 10Gi Enter fullscreen mode Exit fullscreen mode The ForgeMT platform uses this config to: Provision tenant-specific AWS accounts and resources. Set IAM roles with least privilege trust policies. Configure GitHub integration and runner specs. Enforce tenant limits and runner types. Provision tenant-specific AWS accounts and resources. Set IAM roles with least privilege trust policies. Configure GitHub integration and runner specs. Enforce tenant limits and runner types. This automation enables zero-touch onboarding with no manual AWS or GitHub setup required by the tenant. zero-touch onboarding Extensibility ForgeMT lets tenants customize their environments and control runner access: Custom AMIs for EC2 runners with tenant-specific tooling. Private ECR repositories to host container images for VMs or Kubernetes. Tenant IAM roles with trust policies so ForgeMT runners assume them securely without static keys. Advanced access patterns like chained role assumptions or resource-based policies for complex needs. Custom AMIs for EC2 runners with tenant-specific tooling. Custom AMIs Private ECR repositories to host container images for VMs or Kubernetes. Private ECR repositories Tenant IAM roles with trust policies so ForgeMT runners assume them securely without static keys. Tenant IAM roles Advanced access patterns like chained role assumptions or resource-based policies for complex needs. Advanced access patterns This lets each team tune cost, security, and performance independently without affecting core platform stability. Security Model ForgeMT’s foundation is strong isolation and ephemeral execution to reduce risk: Dedicated IAM roles, namespaces, and AWS accounts per tenant. No cross-tenant visibility or access. Ephemeral runners destroyed immediately after job completion to prevent credential or data leakage. Temporary credentials via IAM role assumption replace static AWS keys. Fine-grained access control configurable by tenants for resource permissions. Full audit trail of provisioning, execution, and shutdown logged via CloudWatch → Splunk. Meets CIS Benchmarks and internal security policies. Dedicated IAM roles, namespaces, and AWS accounts per tenant. Dedicated IAM roles, namespaces, and AWS accounts No cross-tenant visibility or access. No cross-tenant visibility or access. Ephemeral runners destroyed immediately after job completion to prevent credential or data leakage. Ephemeral runners Temporary credentials via IAM role assumption replace static AWS keys. Temporary credentials via IAM role assumption Fine-grained access control configurable by tenants for resource permissions. Fine-grained access control Full audit trail of provisioning, execution, and shutdown logged via CloudWatch → Splunk. Meets CIS Benchmarks and internal security policies. Debugging in a Secure, Ephemeral World Ephemeral runners mean persistent debugging isn’t possible by design, but ForgeMT offers: Live debugging with Teleport: Keep runners alive temporarily via workflow tweaks to enable SSH into running jobs. Reproducible reruns: Failed jobs can be rerun identically from GitHub UI. Log-based troubleshooting: Access runner telemetry, syslogs, and job logs centrally without infrastructure exposure. Kubernetes support: Same debugging mechanisms apply to EKS runners, preserving isolation and auditability. Live debugging with Teleport: Keep runners alive temporarily via workflow tweaks to enable SSH into running jobs. Live debugging with Teleport: Reproducible reruns: Failed jobs can be rerun identically from GitHub UI. Reproducible reruns: Log-based troubleshooting: Access runner telemetry, syslogs, and job logs centrally without infrastructure exposure. Log-based troubleshooting: Kubernetes support: Same debugging mechanisms apply to EKS runners, preserving isolation and auditability. Kubernetes support: Conclusion ForgeMT is likely overkill for small teams. Start simple with ephemeral runners (EC2 or ARC), GitHub Actions, and Terraform automation. Only scale up when you hit real pain points. ForgeMT shines in multi-team environments where tenant isolation, governance, and platform automation are mission-critical. For solo teams, it just adds unnecessary complexity. ForgeMT addresses the major enterprise challenges of running GitHub Actions runners at scale by delivering: Strong multi-tenant isolation Fully automated lifecycle management and governance Flexible runner types with cost-aware autoscaling and warm pools Secure, ephemeral environments that meet compliance needs An open-source, extensible platform for customization Strong multi-tenant isolation Fully automated lifecycle management and governance Flexible runner types with cost-aware autoscaling and warm pools Secure, ephemeral environments that meet compliance needs An open-source, extensible platform for customization For organizations struggling to scale self-hosted runners securely and efficiently on AWS, ForgeMT provides a battle-tested, transparent platform that combines AWS best practices with developer-friendly automation. Dive Into the ForgeMT Project Ideas are cheap — execution is what counts. ForgeMT’s source code is public — check it out: 👉 https://github.com/cisco-open/forge/ https://github.com/cisco-open/forge/ ⭐️ If you find it useful, don’t forget to drop a star! 🤝 Connect Let’s connect on LinkedIn and GitHub. LinkedIn GitHub