The Essential Guide to Kubernetes Virtual Clusters

As Kubernetes matures into critical business technology infrastructure, the need for Kubernetes access for applications and engineers is also growing. It is neither feasible nor cost-efficient to always use whole physical Kubernetes clusters for each tenant in an organization. Although cluster sharing might not be common in critical/production setting, cluster sharing is common in almost all the organizations using Kubernetes in domains like testing, dev, sales, etc. , where different applications, teams, or even other departments run and operate in a shared environment.

Although Kubernetes cannot guarantee perfectly secure isolation between tenants, it does offer features that may be sufficient for specific use cases. Nevertheless, introducing any form of multi-tenancy adds a layer of complexity to your system and comes with some restrictions for the tenants.

In an enterprise environment, the tenants of a cluster are distinct teams within the organization. The concept of a tenant is not only used for cluster users, but also includes the workload set constituting computing, network, storage, and other resources. Although there are different constructs like role bindings, network policies, resource quotas and other access control mechanisms to enable multi-tenancy, the fundamental cluster orientation falls back to how the cluster is being shared and entities sharing the cluster.

There are certain tradeoffs between how the clusters are provisioned. Taking the last approach from the list below 'small single-use clusters' - what if we can follow this approach adding the missing advantages of cost-efficiency and ease of management?

Virtual clusters are Kubernetes clusters that run on top of other Kubernetes clusters and don't need any extra resources other than the cluster they are running on while having their dedicated control-plane and can be installed as traditional Kubernetes components (cost-efficiency + ease of management + resilience + application security).

The core idea of virtual clusters is to provision isolated Kubernetes control planes (e.g. API servers) that run on top of "real" Kubernetes clusters. Compared to fully separate "real" clusters, virtual clusters do not have their own node pools. Instead, they will schedule workloads inside the underlying cluster while having their control plane. All higher-level API objects such as Deployments and StatefulSets or any CRDs only live inside the vcluster, but the pods are scheduled to the underlying cluster, this enables operators to provision hundreds of clusters in minutes without worrying about the operational overhead and costs.

With virtual clusters operators/platform engineers can develop some simple tooling to enable users to request a cluster through a self-service portal and get them onboarded in minutes as there is no overhead of provisioning the machines, use some complex bootstrapping methods to install Kubernetes from scratch and configure complex policies to enable isolation. This is significantly cheaper than creating separate full-blown clusters and it offers better multi-tenancy and isolation than regular namespaces. So a general pod. creation workflow involves:

However, other low-level Kubernetes resources such as Services, Ingress, PV, PVC, etc. need to be synchronized to the underlying cluster. No limitations to Kubernetes cluster are introduced in this approach and the tenants will not be technically restricted compared to users of a single-tenant cluster and/or the tenants must consider the other tenants.

vcluster

Loft's vcluster has been part of their multi-tenancy product which had a free variant that supports minimal number of clusters and users. Recently Loft unbundled vcluster from Loft and open-sourced it. vcluster uses k3s server bundled in a single pod and with low resource consumption to provide separate control-plane for each virtual cluster. With vcluster users can create k3s-based virtual clusters that run inside a single namespace on top of full-blown Kubernetes clusters.

The two entities in vcluster are vcluster control plane and vcluster syncer. The control-plane component is based on k3s which enables a dedicated k8s-control-plane and the syncer is a substitution to scheduler which copies the pods and other resources to underlying host cluster. Higher level Kubernetes resources, such as Deployments, StatefulSets, CRDs, etc. are confined to the virtual cluster, whereas other low-level resources such as Pods, ConfigMaps mounted in Pods, etc. are synced to underlying host namespace.

Try it Out

The underlying cluster (host cluster/core cluster) is a baremetal based two node Kubernetes cluster (tainted master to allow scheduling workloads on master) bootstrapped using Kubeadm. Virtual clusters can be created irrespective of the platform or the managed Kubernetes service, the only requirement is to have Kubectl access with necessary permissions to operate in host cluster or a namespace.

vcluster supports multiple modes of deployment: Helm, using manifests with kubectl and a vcluster binary (deploys a helm chart when used). In the example topology below vcluster is deployed using Argo CD deployed on the core cluster where all the required manifests are procured from a Git repository (GitOps CD).

The repo below contain service, role & role-binding, service account and the statefulset manifests in

path:vcluster-red

required to create a virtual-cluster in namespace

vcluster-red

A new application is created using the repo above:

As shown below the stack deploys a

statefulset:vcluster-red

, roles and role-bindings required for the vcluster components to perform necessary actions on the host cluster and a CoreDNS pod.

Every vcluster runs on top of another Kubernetes cluster, called host cluster. Each vcluster runs as a regular StatefulSet inside a namespace of the host cluster. This namespace is called host namespace. Everything that is created inside the vcluster lives either inside the vcluster itself or inside the host namespace. Users can run multiple vclusters in a same namespace or run vclusters inside another vcluster (nested mode). The namespace 'vcluster-red' on host cluster is deployed with all the components above.

The vcluster is associated with a role that allows the vcluster to perform all actions on most of the low-level components (pods, services, pvc, etc.)

The

vcluster-red-0

pod controlled by statefulset 'vcluster-red' is a multi-container pod.

The virtual-cluster container contains API server, controller manager and a connection (or mount) of the data store. By default, vclusters use sqlite as data store and run the API server and controller manager of k3s. Each vlcuster has its own control plane consisting of: Kubernetes API (virtual cluster kubectl requests are pointed to this service), Data Store (place where API stores all the resources and other certificate information) and Controller Manager (creates objects in data store and replicate the same as required).

As shown below the virtual-cluster container hosts k3s server and when a virtual cluster is created the kubectl operations can be performed from this container (using the admin kube-config file in the cluster directory).

Syncer is a component which syncs/copies the objects from vcluster to the underlying cluster. Vclusters lack scheduler and syncer is the component which copies the pods that need to be scheduled from vcluster to the underlying cluster. Then, the host cluster will take care of the scheduling (same Kubernetes scheduling) and the vcluster will keep the pod on vcluster and host cluster in sync.

Vclusters use sqlite as data store (by default k3s uses sqlite replacing etcd), a persistent volume is used to store all the data on the node where

vcluster statefulset

is created. In this topology the host cluster uses

local-path

provisioner as storage controller. As shown below a persistent volume claim and persistent volume are created in host clusters

vcluster-red

namespace.

Data stored in the persistent volume includes tls certs and keys (similar to /etc/kubernetes in Kubeadm based clusters), tokens (node-token), cred and the db dump.

Apart from the syncer and vcluster (control-plane) components each vcluster has its own DNS service (CoreDNS by default) which allows pods in the vcluster to get the IP addresses of services that are also running in this vcluster.

Host clusters namespace hosting the virtual cluster will have the kubeconfig secret created. Users can also procure the kubeconfig of the virtual cluster from the api-server pod (

/<vcluster-name>

). When using 'vcluster' CLI binary (deploys a helm chart), the file is written to the root directory by default and users can use

'vcluster connect <cluster-name> -n <cluster-namespace>'

As shown below the server address in the kubeconfig maps to localhost:6443.

The vcluster-api-server service is created in the host cluster namespace (in this case vcluster-red ns), users can use NodePort or ingress to expose the service an use the same in the kubeconfig (

server: <host-cluster-host-ip>:<node-port>

or ingress service on the host cluster) above to access the clusters. The other easy way is to just port-forward the service, vcluster CLI binary connect uses port-forward where the session should be active to access the cluster (technically not a feasible situation if users are planning to use this in an organization setting).

Using the kubeconfig above with the server mapping to a NodePort in this topology, users can use the virtual cluster the same was as they use any other cluster. The kubeconfig of a specific virtual cluster provides admin capabilities to the specific cluster where users don't have to rely for any other special object based permissions.

As shown above, the virtual cluster is a completely isolated cluster where users have their own control-plane and technically have nothing to worry about permissions or the nodes where the workloads are scheduled. With this approach, the platform-team can just add some tooling to create the virtual cluster and provide the specific kubeconfig to the users/tenants and there is no need for complex configuration to enable isolation.

Nodes in Virtual Cluster

The vcluster, by default shows one node, which is the node that the vcluster pod is running on. Although pods synchronized from the vcluster to the underlying cluster may be scheduled on other nodes. This behavior can be customized using configurable flags passed to the vcluster -

--sync-all-nodes

where all the nodes from the host-cluster are synced. Apart from this vcluster also provides

--enforce-node-selector

and

--node-selector

flags to control scheduling of pods on the host cluster.

As the true scheduling is all done in the underlying host cluster, the vcluster will create fake nodes instead of copying the actual physical node config (by default this is set to true), this behavior can be changed using

--fake-nodes set to false

. If the nodes

Pod Scheduling and Host Cluster Sync

All high-level resources such as Deployments, StatefulSets, CRDs, etc. are confined to the virtual cluster and are not reflected or accessible from the underlying host cluster. As shown below the high-level resources custom-resource and deployment-1 are stored in the data store of the vcluster (sqlite by default). The syncer component in the kube-system namespace of the virtual cluster syncs the low-level objects (pods in this case) to the namespace in which the virtual cluster is running on the underlying host cluster which then is scheduled using the traditional Kubernetes scheduling pattern on the host cluster.

For example, taking a sample application with a Deployment, Service and ServiceAccount objects as shown below:

When the above manifests are deployed using kubectl (kubeconfig mapping to virtual cluster), the

virtual-cluster api-server

host cluster

namespace receives the request.

As shown below the above components are deployed on the 'vlcuster-red' where the requests were received by the api-server of the virtual cluster without any interference with the host cluster api-server. The tenants can access all the objects that they have created on the virtual cluster.

As shown below on the host cluster namespace (where the vcluster is running), no deployments, replica sets or service account that are created in the vcluster are listed.

The pods controlled by the deployment are synced to the underlying namespace of the host cluster.

As shown below now the underlying clusters host namespace contain the pods along with the control-plane components. All objects (that are synced to host cluster) that are created in vcluster will be placed in the namespace where the virtual cluster is running.

The syncer component takes care of the naming and labeling of pods that are synced to the host cluster. The syncer removes problematic fields (such as pod labels) and also transforms certain fields of these objects before sending them to the host cluster. As shown below the name will be appended with the vcluster namespace and a random id, the annotation includes the real name of the pod in the virtual cluster.

Services are the other low-level resources that are synced to host cluster namespace. As shown below the sample application service is synced and is exposed as NodePort, in this scenario tenants/users can use the underlying host cluster IP's or resolvable host-name to access the service running in the vcluster and the same implies to the ingress resources where the vclusters can have their own ingress-controller and the ingress resources are synced to host cluster or use a shared ingress controller running in the host cluster.

Service objects that are synced will be provided with specific labels and owner-references.

As shown below the service that is created on a vlcuster is accessible using the master-ip of the underlying host and the corresponding service node port. From the flow, the user/tenant just uses their vcluster kubeconfig to create an application/workload, validate the same over the vcluster and access the application using a ingress hostname or admin provided hostname, all other actions are masked from the user and they have nothing to care about where the cluster is running or some special on-boarding steps to access and use the cluster.

Storage

Since the vcluster's syncer synchronizes pods to the underlying host cluster to schedule them, vcluster users can use the storage classes of the underlying host cluster to create persistent volume claims and to mount persistent volumes.

In the current topology, the host cluster runs a 'local-path-provisioner (storage-controller)'.

A storage class called 'local-path' is created in the underlying host cluster and selected as default storage class.

A sample pod is created in the vcluster which mounts a persistent volume using a PVC (persistent volume claim).

As shown below a Pod, PV and PVC are created in the vcluster (the vcluster has no storage class or storage controller running). By default the PV objects created on vclusters are fake objects where the original information of the PV from host cluster is not copied, this can be changed by using '--fake-persistent-volumes' flag set to false in vcluster spec.

As shown below, the PVC is synchronized to the host cluster namespace.

The PV on the host cluster namespace is bound with the PVC created above.

PV and PVC objects are labelled with necessary vcluster information by the syncer component.

Users can use

--enable-storage-classes

set to true (by default false) flag in vcluster spec to sync storage classes from the host cluster to vcluster.

Networking

Resources such as Service and Ingress are synced from the virtual cluster to the host cluster in order to enable correct network functionality for the vcluster.

Traffic Between Pods and Services

As the pods are synced to underlying host cluster, they actually run inside the host namespace of the underlying cluster. This means that these pods and services have regular cluster-internal IP addresses provided by CNI running on the host cluster. The communication between pods and between pods and service are same as any other Kubernetes cluster with a CNI.

DNS Resolution

Each vcluster has its own DNS service (CoreDNS by default) which allows pods in the vcluster to get the IP addresses of services that are also running in this vcluster. The vcluster syncer ensures that the intuitive naming logic of Kubernetes DNS names for services applies and users can connect to these DNS names which in fact map the the IP address of the synchronized services that are present in the underlying host cluster.

Ingress Controller and Ingress Traffic

Vcluster also synchronizes Ingress resources by default (this can be changed by using '--disable-sync-resources=ingresses' flag in syncer component spec). Users can create an ingress in a vcluster to make a service in this vcluster available via a hostname/domain.

Users can use a dedicated ingress controller per virtual cluster or instead of having to run a separate ingress controller in each vcluster, as the ingress resources anyways will be synchronized to the underlying cluster (by default) users can use a shared ingress controller that is running on the host cluster.

In the current topology, the host cluster runs a 'nginx-ingress-controller'.

Two sample applications below are created along with an Ingress object on the vcluster (here the vcluster has no ingress-controller running).

As shown below, the Pods, SVC and Ingress objects are created on the vcluster. The Ingress object below shows the IP address procured from the underlying host cluster.

Services synced to host cluster namespace:

Ingress object created on vcluster synced to host cluster:

Logs from the ingress-controller on the underlying host cluster showing processing of synced ingress resource.

Ingress objects are labelled with necessary vcluster information by the syncer component.

As shown below the applications running on vcluster can be accessed using ingress (in this case using the Host-IP:<controller-port>) using the host IP's of the two Kubernetes hosts.

Multi-Tenancy SIG VirtualCluster Project

VirtualCluster is a multi-tenancy kubernetes-sig incubator project which was under development for almost a year. This project follows an operator approach to create and manage virtual clusters on an underlying host cluster. A new CRD VirtualCluster is introduced to model the tenant control plane. Moving forward the SIG's plan is to provide a CAPI provider for virtual clusters using this project.

Super Master here is an underlying host Kubernetes cluster which hosts multiple tenant masters (virtual clusters) and manages actual node resources. In a virtual cluster each Tenant will have a dedicated control-plane called Tenant Master, VirtualClusters use regular Kubernetes upstream api-server, etcd and controller-manager for control-plane components.

The tenant master lifecycle is managed by a tenant operator which runs on the super master. The tenant user operates tenant master directly without interfering with any components of the super master. The syncer is a controller component copies/synchronizes objects from tenant master to the super master which then will be scheduled by the scheduler of super master on actual nodes. vn-agent is a proxy which proxies all the requests from tenant master to nodes actual Kubelet. Conceptually the super master is a pod resource provider and all the tenant object control and management happens in the tenant master. All workload controllers, extensions, CRD's, etc. are installed on the tenant master.

The control-plane stack is made up of three components vc-manager, vc-syncer and vn-agent. The 'vc-manager' namespace in the super master holds these components, using this control-plane users can create and operate multiple tenant clusters (virtual clusters) on the super master.

Pod creation flow

Tenant Master is the virtual cluster and the Super Master is the underlying cluster which hosts the virtual clusters.

Try Out

The super master (underlying cluster) here is the same cluster used for the other project.

VirtualCluster offers a kubectl plugin which can be used to create the virtual clusters and procure the kubeconfig. In the sample scenario the installation is performed in two stages using the manifests from Git using Argo CD running on the host cluster. The two directories in the repo, one holds the control-plane manifests (vc-manager) and CRD's which are required to create virtual cluster objects in the super master and the second contain the manifests for creating the virtual cluster.

First, the control-plane components are deployed as shown below:

The stack mainly create three components in vc-manager namespace: vc-manager, vc-syncer and vn-agent. Apart from these creates required roles and role-bindings.

As shown below two new CRD's are introduced in the super master:

virtualclusters CRD uses the component config information from clusterversion object to create virtual clusters. The basic spec just contains elements such as cluster domain, pki expiry timeline, clusterversion name, etc.

clusterversion CRD specifies the tenant master component (etcd, api-server and controller-manager) configuration, vc-manager uses this spec when creating a virtual cluster.

As shown below all control-plane components are deployed on 'vc-manager' namespace of the host cluster. Optionally, users can configure 'vn-agent' to communicate with kubelet directly by using the same client cert/key used by the super master. By default vn-agent works in a suboptimal mode by forwarding all kubelet API requests to super master.

With the control-plane deployed, users can now deploy any number of virtual clusters in host cluster namespaces. A vc-sample-1 cluster is created in default namespace

The above stack just creates tow objects clusterversion and virtualcluster.

The vc-manager processes the request and creates the virtual cluster control-plane components in default namespace of the super master.

When a namespace is created on the tenant cluster (virtual cluster), a corresponding namespace is created on the super master with naming convention '

<namespace>-<unique_id>-<actual_name_of_the_namespace_on_virtual-cluster>

' and the synced objects are placed in respective namespaces on the super master. In the above scenario all the control-plane namespaces are created on the super master.

For example creating a 'test-namespace' on virtual cluster creates a '

default-78335e-vc-sample-1-test-namespace

Namespace mapping

The 'apiserver-svc' in default namespace maps to the api-server running in the tenant master, this can be exposed as a nodeport or ingress resource.

The namespace holds the admin-kubecofnig secrets which contains the kubeconfig for accessing the tenant master.

All synced resources will be labelled with required tenancy information by the syncer component.

Nodes in Virtual Cluster

In tenant control plane, node objects only show up after tenant Pods are created. The super cluster node topology is not fully exposed in the tenant control plane. This means the VirtualCluster does not support DaemonSet alike workloads in tenant control plane. Currently, the syncer controller rejects a newly created tenant Pod if its nodename has been set in the spec.

Sample pod scheduled on super master node:

All other constructs related to Networking and Storage are similar to the other project (vcluster) discussed above. VirtualClusters doesn't provide the CoreDNS for each tenant cluster by default. Hence, tenant should install CoreDNS in the tenant control plane if DNS is required. The syncer controller can then recognize the DNS service's cluster IP in super cluster and inject it into any Pod spec.dnsConfig.

WG-Multi Tenancy implementation of Virtual Clusters vs vcluster

Both the projects are almost homologous with few exceptions in the way they are implemented. Some of the significant differences are:

Although there are certain limitations with virtual clusters which might not suit specific use-cases (like handling critical workloads in production), using virtual clusters mitigates limitations introduced by namespace-based multi-tenancy like tenants unable to use CRDs, install Helm charts that use RBAC, or change cluster-wide settings such as the Kubernetes version, etc.

For example, to run 50 m3.xlarge (decent sized for Kubernetes master and nodes) instances for a year on AWS it costs ~$51000 (self managed K8s and on-demand instances), ~$38000 (self managed K8s and up-front payment), ~$51000 (EKS-on-demand), ~$37000 (EKS-Up-Front) - Other supporting services like EBS, LB, etc. are not added. In this scenario a 50 node cluster can cater 100-150 virtual clusters per year, there is a significant difference when compared with managing 100 separate clusters. Apart from infrastructure costs we should also consider significant operational costs.

Also published on: https://www.linkedin.com/pulse/kubernetes-virtual-clusters-enabling-hard-cost-gokul-chandra/