If you have been using kubernetes for a long time, then you know what it is resource quotas. But do you know it well enough? Do you know what mechanism is build on? If you did not - you will soon know. First if all, kubernetes is a container management platform. Therefore, we will delve into the mechanisms of the container. CGROUPS Cgroups is a Linux kernel mechanism that allows you to place processes in hierarchical groups for which the use of system resources can be limited. For example, the "memory" controller limits the use of RAM, the "cpuacct" controller takes into account the use of processor time. There are two versions of cgroup:  v1 and v2. cgroupv1: – guarantees the minimum and limits the minimum number of “CPU shares”. In order not to deprive any process. сpu - generates reports on the use of processor resources. Counts the usage of a process. cpuact - allows you to assign a process to certain cores. For example, reports that only certain processes have access to a certain kernel. cpuset - monitors and limits the amount of processor memory. memory – sets limits for reading and writing from block devices. blkio cgroup v2 is the next version of the Linux cgroup API. cgroup v2 provides a unified control system with enhanced resource management capabilities. cgroup v2 has several improvements over cgroup v1, for example: JAVA 15+ can use cgroup v2. Applications (using JAVA 15+) can be configured to use the container's quotas rather than all the resources available on the Kubernetes node. k8s supported. Enhanced resource allocation management and isolation across multiple resources Unified accounting for different types of memory allocations (network memory, kernel memory, etc). The kubelet automatically detects that the OS is running on cgroup v2 and performs accordingly with no additional configuration required. System requirements for using cgroup v2 CAPABILITIES Capabilities are the means to manage privileges, which in traditional Unix-like systems were only available to processes. Permissions for a process to make certain system calls. Only about 20 pieces Examples: - permission to change the UID and GUID of the file CAP_CHOWN - permission to send signals (sisterm, sigkill, etc.) CAP_KILL - permission to use ports with a number less than 1024 CAP_NET_BIND_SERVICE etc etc. Finally, about quotas. Mechanisms that allow you to limit the use of resources for a container (not for Pod) - a guaranteed amount of resources (If the node does not have enough free resources, then the scheduler does not place the Pod on the node). Requests - the maximum amount of the resource. Nothing is guaranteed. Those. the total size of limits can exceed the entire namespace quota. For example, you can set 999 trillion cores. • If you set only Limits, then automatically requests = limits. Limits If you specify only requests, then limits will not appear. - defines the memory limit for cgroup. If the container tries to use more memory than the Limit, then OOMkiller will kill one of the processes. Limit - with cgroups v1, they only affect the start of the pod. With cgroups v2 there are special memory.min and memory.low controllers. Exclusively allocated memory for the container, which no one else can use. Requests (ephemeral storage) counts as memory consumed by the container. Tmpfs #Container resources example apiVersion: v1 kind: Pod metadata: name: frontend spec: containers: - name: app image: images.my-company.example/app:v4 resources: requests: memory: "64Mi" cpu: "250m" ephemeral-storage: "2Gi" limits: memory: "128Mi" cpu: "500m" ephemeral-storage: "4Gi" How CPU requests work - used by cpu.share. The root cgroup (root) contains the number of CPUs * 1024 shares and inherits child cgroups in proportion to their cpu.shares and so on. Requests If all shares are occupied, but no one is using anything, then you can leave them. How CPU Limits Work - used by cfs_period_us and cfs_quota_us. Us is microseconds (mu). Unlike requests, limits are based on time spans. Limits – time period within which the quota usage is considered. Equals 100000mu (100ms). cfs_period_us – allowed amount of CPU time in us per period. cfs_quota_us Scenario 1 (left picture) : 2 thread and 200ms limit. No throttling Scenario 2 (right): 10 thread and 200ms limit. throttling starts after 20ms and only receive cpu power after 80ms. Let’s say you have configured 2 core as CPU limit; the k8s will translate this to 200ms. That means the container can use a maximum of 200ms CPU time without getting throttled. Here starts all misunderstanding. As I said above, the allowed quota is 200 ms, which means if you are running 10 parallel threads on 12 core machine (see the second figure) where all other pods are idle, your quota will exceed the limit in 20ms (10 * 20 ms = 200 ms), and all threads running under that pod will get throttled for next 80 ms. To make the situation worse, the scheduler has a bug that is causing unnecessary throttling and prevents the container from reaching the allowed quota. CPU Management Policy The CPU Manager policy is set with the --cpu-manager-policy kubelet flag or the cpuManagerPolicy. vim /etc/systemd/system/kubelet.service And add the folowing lines: --kube-reserved=cpu=1,memory=2Gi,ephemeral-storage=1Gi \
  --system-reserved=cpu=1,memory=2Gi,ephemeral-storage=1Gi \ --cpu-manager-policy=static \ Allows you to assign dedicated cores to containers (cpuset). Works if the pod has guaranteed qos. Type of requests value by cpu must be an integer. The role of K8S Scheduler in quotas distribution Responsible for placing pods on cluster nodes. 2 stages: – the scheduler selects suitable nodes •Scoring – evaluates suitable nodes and selects the most appropriate one. NodeResourcesFit is a scheduler plugin that checks resources on nodes. It checks which nodes have enough Pod resources. You can configure some resources not to be checked. Filtering - selects the best node. Scoring There are 3 strategies for choose: (default) – bets on the node that is the least utilized. LeastAllocated MostAllocated RequestToCapacityRatio #Example to use scoringStrategy apiVersion: kubescheduler.config.k8s.io/v1beta2 kind: KubeSchedulerConfiguration profiles: - pluginConfig: - args: scoringStrategy: resources: - name: cpu weight: 1 type: MostAllocated name: NodeResourcesFit Storage Resource Quota - Across all persistent volume claims, the sum of storage requests cannot exceed this value. Requests.storage - The total number of that can exist in the namespace. Persistentvolumeclaims PersistentVolumeClaims - Across all persistent volume claims associated with the <storage-class-name>, the sum of storage requests cannot exceed this value. <storage-class-name>.storageclass.storage.k8s.io/requests.storage - Across all persistent volume claims associated with the storage-class-name, the total number of that can exist in the namespace. <storage-class-name>.storageclass.storage.k8s.io/persistentvolumeclaims persistent volume claims For example, if an operator wants to quota storage with gold storage class separate from bronze storage class, the operator can define a quota as follows: bronze.storageclass.storage.k8s.io/requests.storage: 100Gi gold.storageclass.storage.k8s.io/requests.storage: 500Gi Ephemeral storage In release 1.8, quota support for local ephemeral storage is added as an alpha feature: - Across all pods in the namespace, the sum of local ephemeral storage requests cannot exceed this value.  The amount of free space that should be on the node at the time the container is launched. requests.ephemeralstorage - Across all pods in the namespace, the sum of local ephemeral storage limits cannot exceed this value. The maximum amount of ephemeral storage available to the pod. limits.ephemeral-storage - Same as requests.ephemeral-storage. EmptyDir except tmpfs, container logs, rw container layers. If this place runs out on one container, then it will end everywhere. ephemeral-storage Quite obscure quotas Count/resource – the maximum number of resources of this type in the namespace. Count/widget.example.com - example for widgets custom resource from example.com API group Typical object counts: count/persistentvolumeclaims count/services count/secrets count/configmaps count/replicationcontrollers count/deployments.apps count/replicasets.apps count/statefulsets.apps count/jobs.batch count/cronjobs.batch It is possible to configure the total number of objects that can exist in the namespace. Reasons for use Object Count Quota: Error protection - pod limit per node 110 pieces. To prevent bad practices PID limits Limits on the number of PIDs. If you create a lot of pids in the container, then the node will run out of PIDs. Global kubelet setting - different behavior is possible on different nodes if the settings are different. It is possible to prevent "Fork bomb". :(){ :|:& };: Quotas for extended resources Extended resources - any resource configured by the cluster operator, which is taken from outside. K8s knows nothing about him and does not work with him in any way. Node level - tied to the node, ie. each node has some amount of resource. Often controlled by Device Plugin. Cluster level - common for the entire cluster. But in ext resources  you can't use limits﻿. Only requests. Example of ﻿extended resources quota: #correct: requests.nvidia.com/gpu: "4" #not correct: limits.nvidia.com/gpu: "4" Network quotas and network Bandwidth You can set the network bandwidth for the pod in spec.template.metadata. to limit the network traffic of the container. annotations If the parameters are not specified, the network bandwidth is not limited by default. The following is an example: apiVersion: apps/v1 kind: Deployment metadata: name: nginx spec: template: metadata: annotations: # Ingress bandwidth kubernetes.io/ingress-bandwidth: 100M # Egress bandwidth kubernetes.io/egress-bandwidth: 1G spec: containers: - image: nginx imagePullPolicy: Always name: nginx Limited through a plug-in to the CNI - this is not a quota in the usual sense of k8s. Configurable via pod annotations - works based on Token Bucket Filter. Some other shared resources - ephemeral container storage usually has a shared file system. inode - file system cache, stores the relationship between files and directories in which they are located. Dentry cache Conclusion Why it is necessary. Because it: Reduces the influence of the container on each other. Provides cluster stability. Ensures predictability of container performance.

Different

Kubernetes Resource Quotas

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

Build Your First API Server From Scratch With JAVA and Minikube

105 Stories To Learn About K8s

1:1 CKA (Certified Kubernetes Administrator): An Essential Guide

188 Stories To Learn About Containers

419 Stories To Learn About Kubernetes

Build Your First API Server From Scratch With JAVA and Minikube

105 Stories To Learn About K8s

1:1 CKA (Certified Kubernetes Administrator): An Essential Guide

188 Stories To Learn About Containers

419 Stories To Learn About Kubernetes

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps