I am experiencing a lot of CPU throttling (see nginx graph below, other pods often 25% to 50%) in my Kubernetes cluster (k8s v1.18.12, running 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 GNU/Linux).
Due to backports, I do not know whether my cluster contains the Linux kernel bug described in https://lkml.org/lkml/2019/5/17/581. How can I find out? Is there a simple way to check or measure?
If I have the bug, what is the best approach to get the fix? Or should I mitigate otherwise, e.g. not use CFS quota (--cpu-cfs-quota=false
or no CPU limits) or reduce cfs_period_us
and cfs_quota_us
?
CPU Throttling Percentage for nginx (scaling horizontally around 15:00 and removing CPU limits around 19:30):
3
Answers
Since the fix was backported to many older Kernel versions, I do not know how to look up easily whether e.g.
4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 GNU/Linux
has the fix.But you can measure whether your CFS is working smoothly or is throttling too much, as described in https://gist.github.com/bobrik/2030ff040fad360327a5fab7a09c4ff1:
cfs.go
with suitable settings for its sleeps and iterations as well as CFS settings, e.g.docker run --rm -it --cpu-quota 20000 --cpu-period 100000 -v $(pwd):$(pwd) -w $(pwd) golang:1.9.2 go run cfs.go -iterations 100 -sleep 1000ms
burn
took 5ms. If not, your CFS is throttling too much. This could be e.g. due to the original bug 198197 (see https://bugzilla.kernel.org/show_bug.cgi?id=198197) or the regression introduced by the fix for bug 198197 (details see https://lkml.org/lkml/2019/5/17/581).This measurement approach is also taken in https://github.com/kubernetes/kops/issues/8954, showing that Linux kernel
4.9.0-11-amd64
is throttling too much (however, with an earlier Debian4.9.189-3+deb9u1 (2019-09-20)
than yourDebian 4.9.189-3+deb9u2 (2019-11-11)
).The CFS bug was fixed in Linux 5.4, exec
kubectl describe nodes | grep Kernel
or go to any of your Kubernetes nodes and execuname -sr
that will tell you the Kernel release you are running on.Recently, I’m working on debuging the cpu throttling issue, with the following 5 tests, I’ve tested out the bug in kernel (Linux version 4.18.0-041800rc4-generic)
This test case is intended to hit 100% throttling for the test 5000ms / 100 ms periods = 50 periods. A kernel without this bug should be able to have a CPU usage stats about 500ms.
Maybe you can try these tests to check whether your kernel will be throttlled.
[Multi Thread Test 1]
[Result]
[Multi Thread Test 2]
[Result]
[Multi Thread Test 3]
Result
For the following kubernetes test, we can use "kubectl logs pod-name" to get the result once the job is done
[Multi Thread Test 4]
Result
[Multi Thread Test 5]
Result
Feel free to leave any comment, I’ll reply as soon as possible.