Page 1 of 1

nanosleep() works until it doesn't

Posted: 4. Apr 2013, 03:15
by xorbe
I am developing an audio application, and one of my machines is a quad-core Trinity 4600M with 8GB dual-chan mem using vbox 4.2.10 (Host: Win7 SP1 x64, Guest: openSUSE 12.2 x64). It's plugged into the wall for power for max performance. All "power saving" (performance robbing) features are disabled. The host is not running any notable processes or background processes, except for vbox.

I have a small test program that reads the time, nanosleep()s for 10ms { .tv_sec=0, .tv_nsec=10000000 }, and then checks the time again. Generally there is about 1ms of delay beyond the requested 10 ms, and that can wander up to 4ms occasionally (no big deal for devel work in a virtualize environment), but after about 30~120 seconds of these continuous 10ms nanosleeps, it suddenly doesn't return for 200~3800 ms! This does NOT happen on the bare metal (typically just 80-110us extra delay beyond 10ms, max observed 1.4ms). I tried backing up to vbox 4.2.6 since I saw a nanosleep/SIG_ALRM fix for 4.2.8, but that didn't help. VirtualBox otherwise does not freeze during this period -- it's doing fine. The cores are only 10~20% utilized when this happens. It doesn't freeze forever, so I can't get a kernel debugger on it. I tried nanosleep, clock_nanosleep with CLOCK_MONOTONIC and with/without TIMER_ABSTIME, and the portable select(0, NULL, NULL, NULL, &tv) method. All produce the same issue. If I use nanosleep(80ms) then the issue doesn't seem to happen. Also I can omit the nanosleep and let the thread burn in a hot loop, and all is well.

Possible vbox bug around small timers?

Re: nanosleep() works until it doesn't

Posted: 4. Apr 2013, 08:55
by noteirak
There are inherent timing issue with Virtualization - anything timing related will most likely fail, just like you pointed out.

Re: nanosleep() works until it doesn't

Posted: 4. Apr 2013, 10:41
by xorbe
Well that's unfortunate. The same thing seems to work okay on my other desktop with same versions of vbox / windows host / linux guest (edit: actually saw a 90ms hitch). I guess I'll stick to Linux on metal. It was convenient to code on the go with my Windows laptop. Hard to believe that nanosleep() / usleep() blocking for extra seconds is legit

Re: nanosleep() works until it doesn't

Posted: 4. Apr 2013, 10:44
by noteirak
I cannot talk on the matter of seconds, which seems to be much, but the timing issue is a fact.
maybe mpack can give more insight on this.

Re: nanosleep() works until it doesn't

Posted: 4. Apr 2013, 11:48
by mpack
The guest can't time things to greater accuracy than the host, and Windows hosts are not noted for their real-time responsiveness. I believe the standard tick rate of the system clock on NT hosts is still several milliseconds, and any wait would be rounded to some multiple of that. The inaccuracy has nothing to do with VirtualBox, it's the host that limits timer granularity, and IMHO VBox can't change the physical host tick rate without affecting the host.

As to the sudden change of behaviour, that does look like a bug. I'd bet that VirtualBox is implementing a queue of timer events, and sooner or later it falls behind and the queue overflows, and some time-to-expiry value goes negative. Total guess, but as a developer myself it has that feel, that's where I'd be looking. Bottom line: try to get fancy with VM timing and it will probably break. The devs may not give this a high priority unless their paying customers have similar problems.

Re: nanosleep() works until it doesn't

Posted: 4. Apr 2013, 17:41
by xorbe
Yeah I'm not surprised about the reduction in timing accuracy within VirtualBox, and it's good enough (0.5~2 ms typical lag, occasional 3-6ms spikes) even for audio development on the go. The deal breaker is when it just doesn't come back for several seconds under minimal load -- the queue thing and negative delay is what I had in mind, after much nanosleep() bug research.

edit #1: I adjusted one thread of the program to use sem_wait / sem_post (which was the plan all along) instead of polling + nanosleep, which cured the one thread. But the other thread is truly time based (wait 15 ms, send event, wait 5 ms, send event, etc).

Code: Select all

lag:   0.240570ms (max:  5.758074ms)
lag:   0.993890ms (max:  5.758074ms)
lag:   0.351900ms (max:  5.758074ms)
lag: 338.950916ms (max:338.950916ms)
lag:   0.342136ms (max:338.950916ms)
lag:   0.292581ms (max:338.950916ms)
lag:   0.318937ms (max:338.950916ms)
edit #2 with work-around: it chugged for about 45 minutes with at most a 9.2ms lag spike, average lag 0.48ms, but then hitched for 345ms -- that's a lot better than hitching every 5~30 seconds. Perhaps the detail below will point developers in the direction of the bug:

Code: Select all

delay = when - now;
if (delay > 0) {
  sched_yield();      // Yielding right before nanosleep appears to be a decent work-around.
  delay = when - now; // Because now has probably changed.
  if (delay > 0) nanosleep(delay);
}
mpack wrote:The devs may not give this a high priority unless their paying customers have similar problems.
Hmm I'd not want to be a paying customer, and then see reported performance bugs like this brushed aside ... but I digress since I've enjoyed vbox for free so far. They might not realize they are suffering from unnatural delays, or know how to isolate or bother reporting it. I have a private 45KB test case if a dev wants to pursue this.