Unusual hangs during sleep() calls

Discussions about using Linux guests in VirtualBox.
xorbe
Posts: 39
Joined: 4. Apr 2013, 02:50
Primary OS: MS Windows other
VBox Version: OSE other
Guest OSses: openSUSE Tumbleweed

Unusual hangs during sleep() calls

Post by xorbe »

Basically just as the thread title states. A simple loop such as:

Code: Select all

while (1) {
  t0 = gettime();
  sleep(1);
  t1 = gettime();
  print t1-t0;
}
Will sleep upwards of 10 seconds occasionally (but more typically less than 2 seconds). In this case, it's Win10 host, openSUSE TW guest, on a Ryzen 4/8 cpu, with 2 cpus selected for the vm. Both host and guest are effectively idle, except for this tiny script that's timing each sleep(1) call. This has been a problem for quite a while actually. vbox 6.1.12, kernel 5.7.11, up-to-date Win10. Samsung 1TB nvme Pro storage.

Yes, sleep() is an imprecise beast, but it should not be sleeping 2-3 seconds on an otherwise idle system. What you'd normally expect is +/- 10ms perhaps at worst.

Known problem? Any known solutions?

(I'm kind of wondering how many other things this affects in a booted vbox Linux environment in general, it can't be limited to just my test script!)

An ideal solution would be to recompile the kernel due to a known offending kernel configuration wrt vbox.

Here's some actual code that measures sleep(1ms) into buckets:

Code: Select all

#!/usr/bin/perl

use Time::HiRes qw( nanosleep time );
$| = 1;
@c = (0,0,0, 0,0,0, 0,0,0, 0);
$max = 0;
$total = 0;

while (1) {
  for ($x = 0; $x<1000; ++$x) {
    $delta = time();
    nanosleep(1000000);
    $delta = int(1000*(time() - $delta));
    if    ($delta <   2.0) { ++$c[0]; }
    elsif ($delta <   4.0) { ++$c[1]; }
    elsif ($delta <   8.0) { ++$c[2]; }
    elsif ($delta <  16.0) { ++$c[3]; }
    elsif ($delta <  32.0) { ++$c[4]; }
    elsif ($delta <  64.0) { ++$c[5]; }
    elsif ($delta < 128.0) { ++$c[6]; }
    elsif ($delta < 256.0) { ++$c[7]; }
    elsif ($delta < 512.0) { ++$c[8]; }
    else                   { ++$c[9]; }
    if ($delta > $max) { $max = $delta; }
  }
  $total += $x;
  print "$c[0] $c[1] $c[2] $c[3] $c[4] $c[5] $c[6] $c[7] $c[8] $c[9] max:$max total:$total\n";
}
Output from VBox session:

Code: Select all

 <2ms  <4   <8  <16  <32  <64 <128 <256 <512 >=512ms
92845 4856 783  576  824   42  28   28  18    0      max:368 total:100000
Output from metal:

Code: Select all

 <2ms  <4 <8 <16 <32 <64 <128 <256 <512 >=512ms
 100000 0  0   0   0   0    0    0    0     0   max:1 total:100000
Yes, a VM will never be as fast as metal. But a 1ms sleep should never take 368ms on an idle machine even under virtualization.
fth0
Volunteer
Posts: 5690
Joined: 14. Feb 2019, 03:06
Primary OS: Mac OS X other
VBox Version: VirtualBox+Oracle ExtPack
Guest OSses: Linux, Windows 10, ...
Location: Germany

Re: Unusual hangs during sleep() calls

Post by fth0 »

First of all, there is no idle system. If you look at the Windows Task Manager and Linux top/htop, you'll probably see a lot of (small) activities at any time. When VirtualBox (a type 2 hypervisor) is running a VM, it creates several OS threads (e.g. 42 ;)), 1 for each vCPU, and a lot more for everything else. The host OS then schedules these OS threads as it sees fit. Additionally, due to the nature of virtualization (keyword VM exit), the vCPU OS threads will typically only be running continuously for significantly less than 1 ms. Why am I telling you all this? To give you some basic background information and to show that on the low level, even an idle system is quite complex.

Regarding measurements using your perl script, I got similar results at first. When providing the HPET (High Precision Event Timer) to the VM, the results got much better and more uniform, so this is something worth a try:

Code: Select all

VBoxManage modifyvm "VM name" --hpet on
xorbe wrote:Yes, a VM will never be as fast as metal. But a 1ms sleep should never take 368ms on an idle machine even under virtualization.
While this expectation seems reasonable at first sight, a type 2 hypervisor cannot guarantee anything, because it is only running as an application on the host OS.
xorbe wrote:Known problem?
Is it a problem? ;)
xorbe
Posts: 39
Joined: 4. Apr 2013, 02:50
Primary OS: MS Windows other
VBox Version: OSE other
Guest OSses: openSUSE Tumbleweed

Re: Unusual hangs during sleep() calls

Post by xorbe »

(Thanks for the explanation, but you are accidentally preaching to the choir ...) Before adjusting the hpet setting, is there a way to query what the value of the current setting is? Ah, got it:
"VBoxManage list vms --long"
Looks like your suggestion is spot on, thanks! Man, it's working great. (Why isn't hpet setting in the gui, lol.) Is there a downside to enabling hpet for a vm? There's not very many useful search results for "virtualbox hpet".

Code: Select all

76933 19897 3170 0 0 0 0 0 0 0 max:7 total:100000
Hmm, it looks like hpet is slower on average (16% more 1ms sleeps fall into the 2-4ms bucket) BUT it's bounded to less than 8ms overall in this sample set. An immediate wild guess is that hpet possibly trades throughput for latency, but I'll have to further check on that. Apparently hpet generates a lot more interrupts.

To me, this seems like a big hammer for a small problem. sleep(1) shouldn't ever delay for 10 seconds even with the other less frequent timers, on an idle host/guest. I am still going to look around at kernel config options, but I'm not hopeful about that. Another test I can run is the same VM but under openSUSE TW host (vs Win10), that might be really interesting. At a minimum, --hpet on shows that the problem is not the host Win10 OS directly, but still could be related to how vbox is implemented on Win10 without hpet enable.
arQon
Posts: 235
Joined: 1. Jan 2017, 09:16
Primary OS: MS Windows 7
VBox Version: VirtualBox+Oracle ExtPack
Guest OSses: Ubuntu 16.04 x64, W7

Re: Unusual hangs during sleep() calls

Post by arQon »

xorbe wrote:Is there a downside to enabling hpet for a vm?
The same as there is to enabling it on bare metal. If the OS handler for it is crap - and some (like Windows) certainly are - you're hurting performance with a thousand papercuts. Adding that overhead to a VM when almost no "typical" software benefits from HPET anyway, with all the additional ring transitions, makes things much worse.

I'm surprised your sleeps were anything like as bad as you say though. I haven't used sleep in serious code in decades (it's hopelessly imprecise for almost anything written in this century) but yeah, that's crazy. I'd suspect Something Strange is going on, but I have no idea what, so as long as your workaround is working well enough for you I wouldn't bother chasing it unless you're very bored at some point. :)
xorbe
Posts: 39
Joined: 4. Apr 2013, 02:50
Primary OS: MS Windows other
VBox Version: OSE other
Guest OSses: openSUSE Tumbleweed

Re: Unusual hangs during sleep() calls

Post by xorbe »

arQon wrote:I'm surprised your sleeps were anything like as bad as you say though. I haven't used sleep in serious code in decades (it's hopelessly imprecise for almost anything written in this century)
And your suggestion for an alternative to sleep(...) for less 1s or less is ... (and by sleep, I also mean usleep and nanosleep, they all have the same problem.) I'm thinking this might actually be bad linux kernel behavior while running within vbox.
arQon
Posts: 235
Joined: 1. Jan 2017, 09:16
Primary OS: MS Windows 7
VBox Version: VirtualBox+Oracle ExtPack
Guest OSses: Ubuntu 16.04 x64, W7

Re: Unusual hangs during sleep() calls

Post by arQon »

xorbe wrote:And your suggestion for an alternative to sleep(...) for less 1s or less is ... (and by sleep, I also mean usleep and nanosleep, they all have the same problem.)
Sorry: I just meant that sleep(), the call, is inherently unusable for most real code, and usleep/nanosleep is where I'd be. Because I was half-asleep myself (no pun intended) and didn't notice you were really talking about nanosleep anyway. :)
However, my point remains: do not use sleep(). Because of its terrible granularity, unless the underlying implementation is using a better call anyway you SHOULD expect it to suck, even on bare metal.

Since you were smart enough to post test code, and have made the effort to dig into this on your own, I was encouraged to give you a second set of data to work with, from a VM that does NOT have HPET enabled:

Code: Select all

10214 5913 3867 6 0 0 0 0 0 0 max:8 total:20000
Not brilliant (the machine is absolutely ancient at this point), but certainly a lot better than your 300ms+ peaks, and within what I would consider acceptable loss for a VM.
So, HPET is not (really) the problem. Something is broken with either the host or the VM. I can't help you find out what, but if you continue to investigate and make it easy for others to test/validate what you find, I'll do what I can.
xorbe
Posts: 39
Joined: 4. Apr 2013, 02:50
Primary OS: MS Windows other
VBox Version: OSE other
Guest OSses: openSUSE Tumbleweed

Re: Unusual hangs during sleep() calls

Post by xorbe »

I'll try with Ubuntu guest like you did (and my same Win10 host). Your results are great for a free VM on old hardware in my opinion, 8ms or better accuracy. I'd be happy with 1s +/- 0.008s, heh.

Here's a fresh openSUSE TW install running within TW, hpet off:

Code: Select all

91500 3721 3055 1111 235 176 123 60 19 0 max:457 total:100000
And a fresh openSUSE TW install running within TW, hpet on:

Code: Select all

64878 34419 703 0 0 0 0 0 0 0 max:6 total:100000
So, on this PC at least, the host OS doesn't seem to make a difference for guest TW, same basic results Win10 or TW host.

openSUSE TW install running within TW, hpet off, vanilla 5.7.15 kernel: (yikes!)

Code: Select all

95959 2018 409 459 1100 20 12 11 5 7 max:851 total:100000
Ubuntu 16.04 / kernel 4.15.0-112 (hpet off):

Code: Select all

99117 684 197 2 0 0 0 0 0 0 max:9 total:100000
Working on Ubuntu 18 / 20 results. I fear it's actually a kernel issue, and that recent kernels aren't scheduling delays correctly under virtualbox.
arQon
Posts: 235
Joined: 1. Jan 2017, 09:16
Primary OS: MS Windows 7
VBox Version: VirtualBox+Oracle ExtPack
Guest OSses: Ubuntu 16.04 x64, W7

Re: Unusual hangs during sleep() calls

Post by arQon »

"God helps those who help themselves". Occurred to me I have a 20.04 VM kicking around - kernel 5.4.0, much closer to what you're using - and that turned out to be VERY interesting... :)

Code: Select all

994 6 0 0 0 0 0 0 0 0 max:2 total:1000
1977 23 0 0 0 0 0 0 0 0 max:2 total:2000
2960 39 1 0 0 0 0 0 0 0 max:5 total:3000
So far, so good amazing - absolutely crushing the previous run. And then it stopped. For so long I thought it had somehow crashed!
When it finally spat out the next line:

Code: Select all

3330 274 102 102 151 26 10 2 3 0 max:314 total:4000
3681 500 234 222 295 41 17 6 4 0 max:333 total:5000
4636 529 236 230 298 42 19 6 4 0 max:333 total:6000
Much less impressive. :(

The 1604 VM (kernel 4.15, for reference) is constantly running, but is usually the only one these days, so I switched back to that to see if perhaps the cause was simply running multiple concurrent VMs:

Code: Select all

15490 3519 989 2 0 0 0 0 0 0 max:9 total:20000
Clearly not.

I diffed the two XMLs too, in case something hidden was different in the 6-ish year gap (like X2APIC: you can find the details on that in here somewhere) but the only difference at all for CPU was

Code: Select all

        <PAE enabled="true"/>
        <PAE enabled="false"/>
and I doubt that's meaningful.

So, your kernel hunch certainly looks like the most probable culprit to me - and that's probably an unmanageable bisection at this point. I don't have an 1804 VM to halve the starting point for you, but I expect you'll have done that yourself not long after this post goes up anyway.

Effectively, although HPET isn't the REAL cause of the problem, from your testing it does look like a functional workaround for it, at least. So that's something...

gl
arQon
Posts: 235
Joined: 1. Jan 2017, 09:16
Primary OS: MS Windows 7
VBox Version: VirtualBox+Oracle ExtPack
Guest OSses: Ubuntu 16.04 x64, W7

Re: Unusual hangs during sleep() calls

Post by arQon »

Incidentally, it feels to me (and is borne out by the numbers) that on the problem VM there are simply windows during which the process stalls. Using time rather than count to split the outputs would give a different view that might be helpful to see: my guess is you'd generally see 800-ish (or whatever) cycles in most batches, but then suddenly have that drop to e.g. 400 every Nth (3-10) batch, or something like that. Won't help with fixing it, but may make the reality easier to visualize, if you need that.
xorbe
Posts: 39
Joined: 4. Apr 2013, 02:50
Primary OS: MS Windows other
VBox Version: OSE other
Guest OSses: openSUSE Tumbleweed

Re: Unusual hangs during sleep() calls

Post by xorbe »

Hey, we almost overlooked the problem. It's also in Ubuntu 16. Configure your VM with 1 vs 2/4 vCPUs. There must be a bug-a-boo with virtuabox's mutli-cpu. Setting openSUSE TW to 1 vCPU produces good latency numbers. This particular cpu is 4C/8T for reference. So basically, somehow '--hpet on' hides the vbox multi-cpu timing bug. I think there's enough info here to file a VirtualBox bug.

Hah, I tried 8 vCPUs. Up to 15 seconds per 1000ms delay. The script crawls, lol.
fth0
Volunteer
Posts: 5690
Joined: 14. Feb 2019, 03:06
Primary OS: Mac OS X other
VBox Version: VirtualBox+Oracle ExtPack
Guest OSses: Linux, Windows 10, ...
Location: Germany

Re: Unusual hangs during sleep() calls

Post by fth0 »

Interesting findings. If you file a bug, please post the link here.

BTW, if you want to scale up even more, you can configure up to 64 vCPUs (using VBoxManage modifyvm "VM name" --cpus 64).
xorbe
Posts: 39
Joined: 4. Apr 2013, 02:50
Primary OS: MS Windows other
VBox Version: OSE other
Guest OSses: openSUSE Tumbleweed

Re: Unusual hangs during sleep() calls

Post by xorbe »

fth0 wrote:Interesting findings. If you file a bug, please post the link here.
https://www.virtualbox.org/ticket/19809
fth0 wrote:BTW, if you want to scale up even more, you can configure up to 64 vCPUs (using VBoxManage modifyvm "VM name" --cpus 64).
I tried this, but it only let me set it to 32. I did this on a much faster / beefier PC than the previous results, so I shouldn't post the results, but they are still ugly.

Just thinking outside of the box ... I should rewrite the same latency test in C/C++ to make sure there's nothing buggy about the perl Time::HiRes package. As far as I know, the library is just a wrapper around C syscalls, but one never knows what might go wrong. Edit: the perl vs C++ results are indistinguishable.

Why is 6.1.10 the latest selectable version on the bug tracker? 6.1.12 has been out a while.
arQon
Posts: 235
Joined: 1. Jan 2017, 09:16
Primary OS: MS Windows 7
VBox Version: VirtualBox+Oracle ExtPack
Guest OSses: Ubuntu 16.04 x64, W7

Re: Unusual hangs during sleep() calls

Post by arQon »

xorbe wrote:Hey, we almost overlooked the problem. It's also in Ubuntu 16. Configure your VM with 1 vs 2/4 vCPUs.
No need - unfortunately. This guest (the 16 VM) is already 4 vCPUs - and, funnily enough, also on a 4C/8T CPU. So, nice idea, but a miss. :(

But, what IS worth you trying (I'm going to bed!) is using taskset in your problem VM. I'd expect that to change things.
xorbe
Posts: 39
Joined: 4. Apr 2013, 02:50
Primary OS: MS Windows other
VBox Version: OSE other
Guest OSses: openSUSE Tumbleweed

Re: Unusual hangs during sleep() calls

Post by xorbe »

Well, I already have "--hpet on" as a work-around. It really does point to some kind of vbox <-> guest kernel timing/wake-up issue. I bet kernel sched debug output might have something interesting to say, but that's not my forte.

https://www.kernel.org/doc/html/latest/ ... no_hz.html
This page talks about tickless scheduling and jitter issues across cores. So I'm building a kernel without tickless to see how it fairs in vbox.
xorbe
Posts: 39
Joined: 4. Apr 2013, 02:50
Primary OS: MS Windows other
VBox Version: OSE other
Guest OSses: openSUSE Tumbleweed

Re: Unusual hangs during sleep() calls

Post by xorbe »

Guess what worked! Turning off the tickless/nohz Linux scheduler! And it works better without hpet.

Code: Select all

 <2ms  <4ms  <8ms <16ms <32ms <64ms  <128  <256  <512  >512
96004  3350   646     0     0     0     0     0     0     0 max:7ms iters:100000
So, it's an unwanted interaction between vbox and nohz scheduler, but I have no idea who to point the finger at.

Compiling custom kernels is a pita ... easier to just use distro kernel + hpet ... good enough ...
Post Reply