A few days ago, someone shared with me a project to run video transcoding jobs in Kubernetes.
During her tests, made on a default Kubernetes installation on bare metal servers with 40 cores & 512GB RAM, she allocated 5 full CPU cores to each of the transcoding pods, then scaled up to 6 concurrent tasks per node, thus loading the machines at 75% (on paper). The expectation was that these jobs would run at the same speed as a single task.
Their result was underwelming: while concurrency was going up, performance on individual task was going down. At maximum concurrency, they actually observed 50% decrease on single task performance.
I did some research to understand this behavior. It is referenced in several Kubernetes issues such as #10570, #171, in general via a Google Search. The documentation itself sheds some light on how the default scheduler work and why the performance can be impacted by concurrency on intensive tasks.
There are different methods to allocate CPU time to containers:
CPU Pining: if the host has enough CPU cores available, allocate 5 “physical cores” that will be dedicated to this pod/container;
Temporal slicing: considering the host has N cores collectively representing an amount of compute time which you allocate to containers. 5% of CPU time means that for every 100ms, 5ms of compute are dedicated to the task.
Obviously, pining CPUs can be interesting for some specific workloads but has a big problem of scale for the simple reason you could not run more pods than you have cores in your cluster. As a result, Docker defaults to the second one, which also ensures you can have less than 1 CPU allocated to a container.This has an impact on performance which also happens in HPC or any CPU intensive task.
Can we mitigate this risk? Maybe. Docker provides the cpuset option at the engine level. It’s not however leveraged by Kubernetes. However, LXD containers have the ability to be pined to physical cores via cpusets, in an automated fashion, as explained in this blog post by @stgraber.
This opens 2 new options for scheduling our workloads:
Let’s see how these compare!
You don’t all have the time to read the whole thing so in a nutshell:
Note: these results are in AWS, where there is a hypervisor between the metal and the units. I am waiting for hardware with enough cores to complete the task. If you have hardware you’d like to throw at this, be my guest and I’ll help you run the tests.
In this blog post, we will do the following:
For this blog post, it is assumed that
git clone https://github.com/madeden/blogpostscd blogposts/k8s-transcode
Our benchmark is a transcoding task. It uses a ffmpeg workload, designed to minimize time to encode by exhausting all the resources allocated to compute as fast as possible. We use a single video for the encoding, so that all transcoding tasks can be compared. To minimize bottlenecks other than pure compute, we use a relatively low bandwidth video, stored locally on each host.
The transcoding job is run multiple times, with the following variations:
We measure for each pod how long the encoding takes, then look at correlations between that and our variables.
When I want to do something with a video, the first thing I do is call my friend Ronan (@ronan_delacroix). He knows everything about everything for transcoding!
So I asked him something pretty straightforward: I want the most CPU intensive ffmpeg transcoding one liner you can think of.
He came back with not only the one liner, but also found a very neat docker image for it, kudos to Julien for making this.
All together you get:
docker run --rm -v $PWD:/tmp jrottenberg/ffmpeg:ubuntu \-i /tmp/source.mp4 \-stats -c:v libx264 \-s 1920x1080 \-crf 22 \-profile:v main \-pix_fmt yuv420p \-threads 0 \-f mp4 -ac 2 \-c:a aac -b:a 128k \-strict -2 \/tmp/output.mp4
The key of this setup is the -threads 0 which tells ffmpeg that it’s an all you can eat buffet.
For test videos, HD Trailers or Sintel Trailers are great sources. I’m using a 1080p mp4 trailer for source.
Transcoding maps directly to the notion of Job in Kubernetes. Jobs are batch processes that can be orchestrated very easily, and configured so that Kubernetes will not restart them when the job is done.
The equivalent to Deployment Replicas is Job Parallelism.
To add concurrency, I thought I first experimented with it. It proved a bad approach, making things more complicated than necessary to analyze the output logs. So I built a chart that creates many (numbered) jobs each running a single pod, so I can easily track them and their logs.
{{- $type := .Values.type -}}{{- $parallelism := .Values.parallelism -}}{{- $cpu := .Values.resources.requests.cpu -}}{{- $memory := .Values.resources.requests.memory -}}{{- $requests := .Values.resources.requests -}}{{- $multiSrc := .Values.multiSource -}}{{- $src := .Values.defaultSource -}}{{- $burst := .Values.burst -}}---{{- range $job, $nb := until (int .Values.parallelism) }}apiVersion: batch/v1kind: Jobmetadata:name: {{ $type | lower }}-{{ $parallelism }}-{{ $cpu | lower }}-{{ $memory | lower }}-{{ $job }}spec:parallelism: 1template:metadata:labels:role: transcoderspec:containers:- name: transcoder-{{ $job }}image: jrottenberg/ffmpeg:ubuntuargs: ["-y","-i", "/data/{{ if $multiSrc }}source{{ add 1 (mod 23 (add 1 (mod $parallelism (add $job 1)))) }}.mp4{{ else }}{{ $src }}{{ end }}","-stats","-c:v","libx264","-s", "1920x1080","-crf", "22","-profile:v", "main","-pix_fmt", "yuv420p","-threads", "0","-f", "mp4","-ac", "2","-c:a", "aac","-b:a", "128k","-strict", "-2","/data/output-{{ $job }}.mp4"]volumeMounts:- mountPath: /dataname: hostpathresources:requests:{{ toYaml $requests | indent 12 }}limits:cpu: {{ if $burst }}{{ max (mul 2 (atoi $cpu)) 8 | quote }}{{ else }}{{ $cpu }}{{ end }}memory: {{ $memory }}restartPolicy: Nevervolumes:- name: hostpathhostPath:path: /mnt---{{- end }}
The values.yaml file that goes with this is very very simple:
# Number of // tasksparallelism: 8# Separator nametype: bm# Do we want several input files# if yes, the chart will use source${i}.mp4 with up to 24 sourcesmultiSource: false# If not multi source, name of the default filedefaultSource: sintel_trailer-1080p.mp4# Do we want to burst. If yes, resource limit will double request.burst: falseresources:requests:cpu: "4"memory: 8Gi
That’s all you need.
Of course, all sources are in the repo for your usage, you don’t have to copy paste this.
Now we need to generate a LOT of values.yaml files to cover many use cases. The reachable values would vary depending on your context. My home cluster has 6 workers with 4 cores and 32GB RAM each, so I used
In the cloud, I had 16 core workers with 60GB RAM, so I did the tests only on 1 to 7 CPU cores per task.
I didn’t do anything clever here, just a few bash loops to generate all my tasks. They are in the repo if needed.
The method to deploy on MAAS is the same I described in my previous blog about DIY GPU Cluster
Once you have MAAS installed and Juju configured to talk to it, you can adapt and use the bundle file in src/juju/
juju deploy src/juju/k8s-maas.yaml
for AWS, use the k8s-aws.yaml bundle, which specifies c4.4xlarge as the default instances.
When it’s done, download he configuration for kubectl then initialize Helm with
juju show-status kubernetes-worker-cpu --format json | \jq --raw-output '.applications."kubernetes-worker-cpu".units | keys[]' | \xargs -I UNIT juju ssh UNIT "sudo wget https://download.blender.org/durian/trailer/sintel_trailer-1080p.mp4 -O /mnt/sintel_trailer-1080p.mp4"done
juju scp kubernetes-master/0:config ~/.kube/confighelm init
LXD on AWS is a bit special, because of the network. It breaks some of the primitives that are frequently used with Kubernetes such as the proxying of pods, which have to go through 2 layers of networking instead of 1.
As a result,
However, transcoding doesn’t require network access but merely a pod doing some work on the file system, so that is not a problem.
The least expensive path to resolve the issue I found is to deploy a specific node that is NOT in LXD but a “normal” VM or node. This node will be labeled as a control plane node, and we modify the deployments for tiller-deploy and kubernetes-dashboard to force them on that node. Making this node small enough will ensure no transcoding get ever scheduled on it.
I could not find a way to fully automate this, so here is a sequence of actions to run:
juju deploy src/juju/k8s-lxd-<nb cores per lxd>c-<max concurrency>.yaml
This deploys the whole thing and you need to wait until it’s done for the next step. Closely monitor juju status until you see that the deployment is OK, but flannel doesn’t start.
The adjust the LXD profile for each LXD node must to allow nested containers. In a near future (roadmapped for 2.3), Juju will gain the ability to declare the profiles it wants to use for LXD hosts. But for now, we need to build that manually:
NB_CORES_PER_LXD=4 #This is the same number used above to deployfor MACHINE in 1 2do./src/bin/setup-worker.sh ${MACHINE} ${NB_CORES_PER_LXD}done
If you’re watching juju status you will see that flannel suddenly starts working. All good! Now download he configuration for kubectl then initialize Helm with
juju scp kubernetes-master/0:config ~/.kube/confighelm init
We need to identify the Worker that is not a LXD container, then label it as our control plane node:
kubectl label $(kubectl get nodes -o name | grep -v lxd) controlPlane=truekubectl label $(kubectl get nodes -o name | grep lxd) computePlane=true
Now this is where it become manual we need to edit successively rc/monitoring-influxdb-grafana-v4, deploy/heapster-v1.2.0.1, deploy/tiller-deploy and deploy/kubernetes-dashboard, to add
nodeSelector:controlPlane: “true”
in the definition of the manifest. Use
kubectl edit -n kube-system rc/monitoring-influxdb-grafana-v4
After that, the cluster is ready to run!
We have a lot of tests to run, and we do not want to spend too long managing them, so we build a simple automation around them
cd srcTYPE=awsCPU_LIST="1 2 3"MEM_LIST="1 2 3"PARA_LIST="1 4 8 12 24 48"
for cpu in ${CPU_LIST}; dofor memory in ${CPU_LIST}; dofor para in ${PARA_LIST}; do[ -f values/values-${para}-${TYPE}-${cpu}-${memory}.yaml ] && \{ helm install transcoder --values values/values-${para}-${TYPE}-${cpu}-${memory}.yamlsleep 60while [ "$(kubectl get pods -l role=transcoder | wc -l)" -ne "0" ]; dosleep 15done}donedonedone
This will run the tests about as fast as possible. Adjust the variables to fit your local environment
Without any tuning or configuration, Kubernetes makes a decent job of spreading the load over the hosts. Essentially, all jobs being equal, it spreads them like a round robin on all nodes. Below is what we observe for a concurrency of 12.
NAME READY STATUS RESTARTS AGE IP NODEbm-12–1–2gi-0–9j3sh 1/1 Running 0 9m 10.1.70.162 node06bm-12–1–2gi-1–39fh4 1/1 Running 0 9m 10.1.65.210 node07bm-12–1–2gi-11–261f0 1/1 Running 0 9m 10.1.22.165 node01bm-12–1–2gi-2–1gb08 1/1 Running 0 9m 10.1.40.159 node05bm-12–1–2gi-3-ltjx6 1/1 Running 0 9m 10.1.101.147 node04bm-12–1–2gi-5–6xcp3 1/1 Running 0 9m 10.1.22.164 node01bm-12–1–2gi-6–3sm8f 1/1 Running 0 9m 10.1.65.211 node07bm-12–1–2gi-7–4mpxl 1/1 Running 0 9m 10.1.40.158 node05bm-12–1–2gi-8–29mgd 1/1 Running 0 9m 10.1.101.146 node04bm-12–1–2gi-9-mwzhq 1/1 Running 0 9m 10.1.70.163 node06
The same spread is realized also for larger concurrencies, and at 192, we observe 32 jobs per host in every case. Some screenshots of kubeUI and Grafana of my tests
KubeUI showing 192 concurrent pods
Compute Cycles at different concurrencies
LXD pining Kubernetes Workers to CPUs
Aoutch! About 100% on the whole machine
This is where it becomes a bit tricky. We could use an ELK stack and extract the logs there, but I couldn’t find a way to make it really easy to measure our KPIs.
Looking at what Docker does in terms of logging, you need to go on each machine and look into /var/lib/docker/containers/<uuid>/<uuid>-json.log
Here we can see that each job generates exactly 82 lines of log, but only some of them are interesting:
{“log”:”ffmpeg version 3.1.2 Copyright © 2000–2016 the FFmpeg developers\n”,”stream”:”stderr”,”time”:”2017–03–17T10:24:35.927368842Z”}
{“log”:”Input #0, mov,mp4,m4a,3gp,3g2,mj2, from ‘/data/sintel_trailer-1080p.mp4’:\n”,”stream”:”stderr”,”time”:”2017–03–17T10:24:35.932373152Z”}
{“log”:”[aac @ 0x3a99c60] Qavg: 658.896\n”,”stream”:”stderr”,”time”:”2017–03–17T10:39:13.956095233Z”}
For advanced performance geeks, line 64 also gives us the transcode speed per frame, which can help profile the complexity of the video. For now, we don’t really need that.
The raw log is only a Docker uuid, and does not help use very much to understand to what job it relates. Kubernetes gracefully creates links in /var/log/containers/ mapping the pod names to the docker uuid.
bm-1–0.8–1gi-0-t8fs5_default_transcoder-0-a39fb10555134677defc6898addefe3e4b6b720e432b7d4de24ff8d1089aac3a.log
So here is what we do:
for i in $(seq 0 1 ${MAX_NODE_ID}); do[ -d stats/node0${i} ] || mkdir -p node0${i}juju ssh kubernetes-worker-cpu/${i} "ls /var/log/containers | grep -v POD | grep -v 'kube-system'" > stats/node0${i}/links.txtjuju ssh kubernetes-worker-cpu/${i} "sudo tar cfz logs.tgz /var/lib/docker/containers"juju scp kubernetes-worker-cpu/${i}:logs.tgz stats/node0${i}/cd node0${i}/tar xfz logs.tgz --strip-components=5 -C ./rm -rf config.v2.json host* resolv.conf* logs.tgz var shmcd ..done
2. Extract import log lines (adapt per environment for nb of nodes…)
ENVIRONMENT=lxdMAX_NODE_ID=1echo "Host,Type,Concurrency,CPU,Memory,JobID,PodID,JobPodID,DockerID,TimeIn,TimeOut,Source" | tee ../db-${ENVIRONMENT}.csvfor node in $(seq 0 1 ${MAX_NODE_ID})docd node0${node}while read line; doecho "processing ${line}"
NODE="node0${node}"CSV_LINE="$(echo ${line} | head -c-5 | tr '-' ',')" # node it's -c-6 for logs from bare metal or aws, -c-5 for lxdUUID="$(echo ${CSV_LINE} | cut -f8 -d',')"JSON="$(sed -ne '1p' -ne '13p' -ne '82p' ${UUID}-json.log)"TIME_IN="$(echo $JSON | jq --raw-output '.time' | head -n1 | xargs -I {} date --date='{}' +%s)"TIME_OUT="$(echo $JSON | jq --raw-output '.time' | tail -n1 | xargs -I {} date --date='{}' +%s)"SOURCE=$(echo $JSON | grep from | cut -f2 -d"'")
echo "${NODE},${CSV_LINE},${TIME_IN},${TIME_OUT},${SOURCE}" | tee -a ../../db-${ENVIRONMENT}.csv
done < links.txtcd ..done
Once we have all the results, we load to Google Spreadsheet and look into the results…
Once the allocation is above what is necessary for ffmpeg to transcode a video, memory is a non-impacting variable at the first approximation. However, at the second level we can see a slight increase in performance in the range of 0.5 to 1% between 1 and 4GB allocated.
Nevertheless, this factor was not taken into account.
RAM does not impact performance (or only marginally)
Regardless of the deployment method (AWS or Bare Metal), there is a change in behavior when allocating less or more than 1 CPU “equivalent”.
Running CPU allocation under 1 gives the best consistency across the board. The graph shows that the variations are contained, and what we see is an average variation of less than 4% in performance.
Running jobs with CPU request <1 is optimal for concurrency
Interestingly, the heatmap shows that the worse performance is reached when ( Concurrency * CPU Counts ) ~ 1. I don’t know how to explain that behavior. Ideas?
2. Being above the line
As soon as you allocate more than a CPU, concurrency directly impacts performance. Regardless of the allocation, there is an impact, with concurrency 3.5 leading to about 10 to 15% penalty. Using more workers with less cores will increase the impact, up to 40~50% at high concurrency
As the graphs show, not all concurrencies are made equal. The below graphs show duration function of concurrency for various setups.
AWS with or without LXD, 2 cores / job
and 5 cores / job
When concurrency is low and the performance is well profiled, then slicing hosts thanks to LXD CPU pinning is always a valid strategy.
By default, LXD CPU-pinning in this context will systematically outperform the native scheduling of Docker and Kubernetes.
It seems a concurrency of 2.5 per host is the point where Kubernetes allocation becomes more efficient than forcing the spread via LXD.
However, unbounding CPU limits for the jobs will let Kubernetes use everything it can at any point in time, and result in an overall better performance.
When using this last strategy, the performance is the same regardless of the number of cores requested for the jobs. The below graph summarizes all results:
All results: unbounding CPU cores homogenizes performance
Concurrency impacts performance. The below table shows the % of performance lost because of concurrency, for various setups.
performance is impacted from 10 to 20% when concurrency is 3 or more
In the context of transcoding or another CPU intensive task,
These results are in AWS, where there is a hypervisor between the metal and the units. I am waiting for hardware with enough cores to complete the task. If you have hardware you’d like to throw at this, be my guest and I’ll help you run the tests.
Finally and to open up a discussion, a next step could also be to use GPUs to perform this same task. The limitation will be the number of GPUs available in the cluster. I’m waiting for some new nVidia GPUs and Dell hardware, hopefully I’ll be able to put this to the test.
There are some unknowns that I wasn’t able to sort out. I made the result dataset of ~1900 jobs open here, so you can run your own analysis! Let me know if you find anything interesting!