Does a context switch factor of 10+ guest vs. host indicate some problem already?
Posted: 28. Dec 2017, 17:48
I have a problem with some workloads in one of my VMs, which result in very high CPU load and an almost unusable VM anymore: SSH connections are established very slowly, character input in the terminal is slow, using the cursor keys in apps like midnight commander is slow etc. When the problem occurs, the physical host context switches per second are getting pretty high, easily up to 125'000, 175'000 or even some spikes with 250'000. Before changing to KVM and reconfiguring the VM to use default settings, things got even worse with context switches of 500'000 and more per second. One thing I noticed it that the high number of context switches in the host is not directly reflected in the guest:
That physical host is hosting 3 VMs, where 2 of those are more or less idle all the time and only host some special OpenVPN server. Both VMs have context switches of ~25 per second. The 3rd VM hosts a lot of services like web servers, databases and other things, so is constantly doing at least a little bit, resulting in context switches of ~500 per second. Most of the time the assigned CPUs are idle, though. The load on the physical host is pretty low as well, but the context switches are ~5000 per second and seem pretty high to me. I would have expected them to be around the sum of what the VMs have, maybe with some additional overhead factor, but far less then 10.
If I put some problematic workload in the 3rd VM, which is e.g. restarting ClamAV daemon, I can see that the context switches in the host increase, while keeping almost the same in the VM. It's ~500-600 in the VM vs. ~10'000-20'000 and 40'000-50'000. While all the time ClamAV is eating 100% CPU and takes minutes instead of expected seconds to restart. If additional workload gets applied using the web services or such, context switches rise even more and the VM is starting to get unusable.
The host has 12 CPU cores with Hyperthreading enabled, so 24 logical cores. 12 cores are assigned to the VMs using 1/1/10. The VM with the problematic VM had only 6 cores assigned before, but the problem was exactly the same. Therefore I'm not convinced yet that some overcommit of CPUs is the problem here. Especially because another physical host with the same hardware and more VMs all consuming 12 cores in theory as well doesn't show the same problem when restarting ClamAV, even though it has ~7'500 per second by default already. ClamAV restarted in a VM on that host doesn't increase context switches that dramatically over a long period of time, but to only ~10'000 for few seconds and that's it.
So, what is the expected context switch rate of all VMs vs. physical host? Is 5'000 or 7'500 per second for a system with very little load in htop etc. a sign for some problem already? Can I somehow assign the context switches in the host to VMs or some completely unrelated things?
Thanks!
That physical host is hosting 3 VMs, where 2 of those are more or less idle all the time and only host some special OpenVPN server. Both VMs have context switches of ~25 per second. The 3rd VM hosts a lot of services like web servers, databases and other things, so is constantly doing at least a little bit, resulting in context switches of ~500 per second. Most of the time the assigned CPUs are idle, though. The load on the physical host is pretty low as well, but the context switches are ~5000 per second and seem pretty high to me. I would have expected them to be around the sum of what the VMs have, maybe with some additional overhead factor, but far less then 10.
If I put some problematic workload in the 3rd VM, which is e.g. restarting ClamAV daemon, I can see that the context switches in the host increase, while keeping almost the same in the VM. It's ~500-600 in the VM vs. ~10'000-20'000 and 40'000-50'000. While all the time ClamAV is eating 100% CPU and takes minutes instead of expected seconds to restart. If additional workload gets applied using the web services or such, context switches rise even more and the VM is starting to get unusable.
The host has 12 CPU cores with Hyperthreading enabled, so 24 logical cores. 12 cores are assigned to the VMs using 1/1/10. The VM with the problematic VM had only 6 cores assigned before, but the problem was exactly the same. Therefore I'm not convinced yet that some overcommit of CPUs is the problem here. Especially because another physical host with the same hardware and more VMs all consuming 12 cores in theory as well doesn't show the same problem when restarting ClamAV, even though it has ~7'500 per second by default already. ClamAV restarted in a VM on that host doesn't increase context switches that dramatically over a long period of time, but to only ~10'000 for few seconds and that's it.
So, what is the expected context switch rate of all VMs vs. physical host? Is 5'000 or 7'500 per second for a system with very little load in htop etc. a sign for some problem already? Can I somehow assign the context switches in the host to VMs or some completely unrelated things?
Thanks!