Page 1 of 1

Linux guest clock slow, HDD controller gets lost

Posted: 10. Mar 2013, 22:16
by jimklimov
We have a number of old demo server installations based on RHEL3-RHEL5 moved into VirtualBoxes running on OpenSolaris (also old, SXCE snv_117). Finally, the VirtualBox software on the host is also old - 3.0.12. All in all, this was a well-performing setup for show-off of our old projects unchanged for several years.

Several months ago the VM guests began to slow down their VritualBox clocks, to the extent that neither every-minute crontabbed "rdate -s clockhost" inside a VM, nor NTP in the VM, nor VBox Guest Additions along with attempts to set up time sync with host (maybe misconfigured attempts though) - neither of these "fixes" alone or combined helps to move the VM clocks at proper pace. Sometimes tens of real minutes pass while one virtual goes along.

Sometimes a VM clock stalls altogether, cycling over the same 2-3 second range over and over, can be stuck so for days.

The only thing to (temporarily) fix the problem is to poweroff and boot the VM. A virtual "reset" without killing the VM process does not help - the clock remains buggy.

The VM logs have hundreds of lines like these:

99:19:42.563 TM: Giving up catch-up attempt at a 60 005 350 962 ns lag; new total: 60 005 350 962 ns
99:21:00.851 TM: Giving up catch-up attempt at a 60 007 957 425 ns lag; new total: 120 013 308 387 ns
...
190:08:56.941 TM: Giving up catch-up attempt at a 60 008 089 399 ns lag; new total: 231 315 151 564 550 ns

These are logged roughly every 75-80 sec (of real time) and cover a 60 second lag (each?)

Usually this kicks in after several hours of VM uptime, though problems can occur even during its startup, or a week can pass without problems.

We've tried to "renice" the VBoxHeadless processes to have a higher-than-usual priority on the host, though now they are depressed into the lowest (19) - because a VM with clock problems consumes a whole CPU core. The problem might be related to the host's ZFS storage becoming pool and slower, at least it seems to occur more often (though not exclusively) when the host is scrubbing its pools regularly on weekends.

Something new began occuring recently: virtual HDDs began to time-out, maybe related to the VM clock and/or host IO lags. Ultimately the Linux guest drops its HDDs and the virtual OS becomes unusable - also a virtual reboot/reset does not help, only a poweroff+poweron allows to find the virtual disks and their controllers.

Any ideas for a fix or understanding of the problem are welcome.

PS: We've recently tried a very different setup, with VirtualBox 4.2.6 on Windows 2008R2 host with VT-X CPU acceleration, running Solaris 10 VMs - these also exhibit regular clock problems requiring regular VM reboots :(

Re: Linux guest clock slow, HDD controller gets lost

Posted: 10. Mar 2013, 22:33
by jimklimov
Example HDD errors (from virtual serial console of the Linux guest):

mptscsih: ioc0: attempting task abort! (sc=d7549d00)
sd 0:0:0:0:
mptscsih: ioc0: task abort: FAILED (sc=d7549d00)
mptscsih: ioc0: attempting target reset! (sc=d7549d00)
sd 0:0:0:0:
mptscsih: ioc0: target reset: SUCCESS (sc=d7549d00)
mptscsih: ioc0: attempting task abort! (sc=d7549d00)
sd 0:0:0:0:
mptscsih: ioc0: task abort: FAILED (sc=d7549d00)
mptscsih: ioc0: attempting bus reset! (sc=d7549d00)
sd 0:0:0:0:
mptscsih: ioc0: bus reset: SUCCESS (sc=d7549d00)

mptscsih: ioc0: attempting task abort! (sc=d7555d00)
sd 0:0:0:0:
mptscsih: ioc0: task abort: FAILED (sc=d7555d00)
mptscsih: ioc0: attempting target reset! (sc=d7555d00)
sd 0:0:0:0:
mptscsih: ioc0: target reset: SUCCESS (sc=d7555d00)
mptscsih: ioc0: attempting task abort! (sc=d7555d00)
sd 0:0:0:0:
mptscsih: ioc0: task abort: FAILED (sc=d7555d00)
mptscsih: ioc0: attempting target reset! (sc=d7555d00)
sd 0:0:0:0:
mptscsih: ioc0: target reset: SUCCESS (sc=d7555d00)
mptscsih: ioc0: attempting task abort! (sc=d7555d00)
sd 0:0:0:0:
mptscsih: ioc0: task abort: FAILED (sc=d7555d00)
mptscsih: ioc0: attempting bus reset! (sc=d7555d00)
sd 0:0:0:0:
mptscsih: ioc0: bus reset: SUCCESS (sc=d7555d00)
mptscsih: ioc0: attempting task abort! (sc=d7555d00)
sd 0:0:0:0:
mptscsih: ioc0: task abort: FAILED (sc=d7555d00)
mptscsih: ioc0: Attempting host reset! (sc=d7555d00)
mptbase: ioc0: ERROR - Enable Diagnostic mode FAILED! (00h)
mptbase: ioc0 NOT READY WARNING!
mptbase: WARNING - (-1) Cannot recover ioc0
end_request: I/O error, dev sda, sector 24683737
Buffer I/O error on device sda5, logical block 462848
lost page write due to I/O error on sda5
Buffer I/O error on device sda5, logical block 462849
lost page write due to I/O error on sda5
Buffer I/O error on device sda5, logical block 462850
lost page write due to I/O error on sda5
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sda6, logical block 338338
lost page write due to I/O error on sda6
sd 0:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sda7, logical block 299009
lost page write due to I/O error on sda7
sd 0:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sda1, logical block 1789
lost page write due to I/O error on sda1
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
end_request: I/O error, dev sda, sector 24683809
Buffer I/O error on device sda5, logical block 462857
lost page write due to I/O error on sda5
Aborting journal on device sda5.
Aborting journal on device sda6.
sd 0:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sda6, logical block 338338
lost page write due to I/O error on sda6
sd 0:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sda6, logical block 346390
lost page write due to I/O error on sda6
Aborting journal on device sda7.
Aborting journal on device sda1.
journal commit I/O error
sd 0:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sda5, logical block 462848
lost page write due to I/O error on sda5
EXT3-fs error (device sda5) in ext3_ordered_writepage: IO failure
sd 0:0:0:0: rejecting I/O to offline device
ext3_abort called.
EXT3-fs error (device sda5): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
journal commit I/O error
ext3_abort called.
EXT3-fs error (device sda6): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
ext3_abort called.
EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
ext3_abort called.
EXT3-fs error (device sda7): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
sd 0:0:0:0: rejecting I/O to offline device
Aborting journal on device sda2.
ext3_abort called.
EXT3-fs error (device sda2): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
EXT3-fs error (device sda6): ext3_find_entry: reading directory #458757 offset 0
sd 0:0:0:0: rejecting I/O to offline device
printk: 58 messages suppressed.
Buffer I/O error on device sda7, logical block 295816
lost page write due to I/O error on sda7
sd 0:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sda7, logical block 557058
lost page write due to I/O error on sda7
sd 0:0:0:0: rejecting I/O to octing I/O to offline device
Buffer I/O error on device sda7, logical block 820137
lost page write due to I/O error on sda7
sd 0:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sda1, logical block 32836
lost page write due to I/O error on sda1
sd 0:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sda1, logical block 32839
lost page write due to I/O error on sda1
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
EXT3-fs error (device sda2): ext3_get_inode_loc: unable to read inode block - inode=672023, block=1343498
SysRq : Power Off
Synchronizing SCSI cache for disk sda:
Power down.
acpi_power_off called

======

I might guess that VM clock jumps during guest-additions sync might be responsible (i.e. an IO is issued at time X, then the "hardware" clock jumps by several seconds, and the IO request is found to be obsolete and expired by timeout).

This did begin to happen more often since we started to use smaller timesync thresholds to enforce clock fixes rather than adjustments; possibly this confuses the Linux kernel?.. Also, maybe the variable names are set incorrectly (we saw many *different* examples on the Internet, and tried most of them - but the log lines about 'Giving up sync' remain in place).

Code: Select all

        <GuestProperty name="/VirtualBox/GuestAdd/VBoxService/--timesync-set-threshold" value="500" timestamp="1362172872598404000" flags=""/>
        <GuestProperty name="/VirtualBox/GuestAdd/VBoxService/timesync-set-threshold" value="15000" timestamp="1357824947002125000" flags=""/>
        <GuestProperty name="/VirtualBox/GuestAdd/VBoxService/--timesync-interval" value="1000" timestamp="1362172845982429000" flags=""/>

Re: Linux guest clock slow, HDD controller gets lost

Posted: 10. Mar 2013, 23:46
by Perryg

Re: Linux guest clock slow, HDD controller gets lost

Posted: 11. Mar 2013, 11:06
by jimklimov
Thanks :)
So it seems that I've used the correct config value names, but to no avail. I fixed the threshold to be 15000ms now, but I've done that before.
Is there something particular that could be done on the linux-guest side to move the kernel's "sense" of the clock? Perhaps, regular "hwclock -w" or some "/etc/adjtime" trickery?

For example, the "hardware clock" of the VM barely ticks even if the "soft time" is sometimes moved forward, with lags of "just" a few minutes. Here are samples from two VMs taken at the same time "12:58:30" on the host's wall-clock:

Code: Select all

(VM booted on March 03 00:31)# date; hwclock -r
Mon Mar 11 12:53:08 MSK 2013
Tue 05 Mar 2013 02:25:16 PM MSK  -0.839153 seconds

# cat /proc/uptime 
223081.12 163063.89

# uptime
 12:53:14 up 2 days, 13:58,  1 user,  load average: 4.30, 3.78, 2.63

=====

(VM booted on March 07 13:26)# date; hwclock -r
Mon Mar 11 12:57:34 MSK 2013
Fri 08 Mar 2013 10:19:27 AM MSK  -0.873034 seconds

# uptime
 12:57:35 up 20:53,  1 user,  load average: 6.71, 3.91, 2.08

# cat /proc/uptime 
75227.30 71414.79


Re: Linux guest clock slow, HDD controller gets lost

Posted: 11. Mar 2013, 14:52
by Perryg
From my experience this kind of issue is hard to track down but most of the time I usually find it to be a cause of overcommit of resources or the guests kernel parms.
Setting timer to jiffies sometimes helps or setting the clock to 100 instead of 1000. Excessive load on the host can cause this as well, but you would be the only person that could tell that.

Re: Linux guest clock slow, HDD controller gets lost

Posted: 15. Mar 2013, 06:10
by jimklimov
I have verified that "divider=10" is set on all those problematic Linux VMs, while their 2.6.18-based kernels are confgured to default HZ=1000. Now I've added "noapic nolapic nolapic_timer clocksource=pit" (as recommended in VMWare forums) to each VM and restarted them. I do now see the VM hardware clocks (via "hwclock -r") working correctly so far (half an hour after VM boot), however OS clocks ("top", "date", etc.) do drift back in time.

Should the Guest Additions kick in to update the "software clock" sometime soon, or should I enable NTP or rdate-via-crontab to set the clock? Would this other mechanism conflict with Guest Additions, or can (should?) they coexist?

Thanks for the tips so far ;)
//Jim

Re: Linux guest clock slow, HDD controller gets lost

Posted: 15. Mar 2013, 14:22
by jimklimov
Alas, these tweaks did not definitely help. Our in-house watchdog did detect stalls and rebooted the VMs in the past few hours already.

As of now, one of the VM's hardware clocks ("hwclock -r") has ticked 22 virtual minutes over an almost 2-hour real uptime (age of VBoxHeadless process on the host). The crontabbed "rdate -s clockhost" does move the software clock ("date") forward by almost 3 minutes at the end of each virtual minute.

Another VM lost and offlined its vHDDs, so I can't really say how much its HW clock shows; but the remaining "top" run showed a correct soft-time running at a pace that feels normal (1-second intervals). It had not yet rebooted over those 8 hours, however (to our watchdog it seems that the services respond).

A third VM has been rebooted by our watchdog, but its HW clock shows corrct time at this moment, and crontab+rdate clock-sync is not enabled, so the software clock lags by some 4 minutes over a 2.5-hour uptime.

---

Are there any ways to further debug VirtualBox's (host and Guest Additions) timesync attempts to see whether it really tries to fix the VM's clock, how often, by how much, in what way (skew or set), etc.? Right now I don't have a reliable way to state whether it even tries to...

Re: Linux guest clock slow, HDD controller gets lost

Posted: 15. Mar 2013, 14:40
by Perryg
I would really think that you should find out what the issue really is. You stated that up until a few months ago this did not happen and that it had been working good for a while.
You mention VMware and that tosses a red flag for me. Were these transplants?

Re: Linux guest clock slow, HDD controller gets lost

Posted: 15. Mar 2013, 15:02
by jimklimov
No, the VMs were initially on hardware, later copied (tar'ed over ssh) into prepared template VirtualBoxes (same 3.0.12). They worked well for a while.

Now their OpenSolaris-based host got more tasks over time, and does sometimes lag noticeably (possibly has to do with the pool filling up and more ZFS fragmentation to combat, i.e. during regular scrubs or at nights when many hosts dump their backups onto it), however it is not fatal for other more native jobs - they are just slower to respond and complete. The host CPU is about 10-15% busy (mostly kernel time) regularly, and has about 1000 processes running with LA ranging from 1.5 to 7 in peaks. There is a fair amount of context-switching going on ("cs" is 7k/sec - 70k/sec according to "vmstat 1"), probably while it iterates the process list and/or launches many watchdog scripts to test availability of services regularly.

VMWare came into the picture only as having similar problems (and hopefully solutions) being a hypervisor as well, and with more archived discussions about it.

Re: Linux guest clock slow, HDD controller gets lost

Posted: 16. Mar 2013, 14:37
by jimklimov
After all these changes outlined above, I've set the VMs to start with a "niceness" of "-2" yesterday and rebooted them (poweroff, start). It seems that two were rebooted by the watchdog an hour later, but they are all running for 8/9 hours now with no vHW clock lag - despite the ZFS scrub running on the host since about the same time (maybe that's why two VMs were indeed rebooted).

The crontab+rdate trick (and likely NTP would be better, just not installed on these VMs) is still useful: the VM without that trick has its software clock a bit behind its vHW clock, though a lot less than before: lost just over a minute over the 9-hour uptime. Perhaps it would be wiser now to sync with hwclock/adjtime rather than rdate or NTP (and/or use local vHW clock as one of NTP sources)?..

However, one of the VMs (which has its clocks completely in sync) has lost its vHDDs anyway. This might be due to IO timeouts because of host scrubbing, but now it seems more certain that it is not because of the clock skews. I'll have to experiment more with HDD scheduler settings in the VMs, so far "noop" with a timeout of 240 seconds(?) did not succeed to keep the disks in place.

The cause for reduced "niceness" of VMs in the first place was the effect that when the clock skew got too big and/or vHDDs were lost, the VM process consumed a whole CPU core (no vSMP, no VT-x on the host -- single-CPU VMs). With smallest priority these runaway VMs at least did not cause much havoc on the host (other tasks won if they needed CPU time). This likely had a great hit on vHW clock however, hiding the effect after I changed the clocksource settings for the VMs' Linux kernels. Now the problematic VM did not eat a CPU, though it did not spew the console with IO error messages either (which seemed to coincide with CPU hogging - perhaps the logging was the hog).

I see that the VM processes are multi-threaded. I wonder if it is possible to run different threads at different priorities (perhaps implemented by VirtualBox, not the host OS) so that the timesync and virtual clock interrupts would perform nearly real-time on the host (and benefit passed on into VMs) even if the main VM computing/IO workloads would be throttled down?

Re: Linux guest clock slow, HDD controller gets lost

Posted: 16. Mar 2013, 17:32
by Martin
The accuracy of the virtual "hw" clock is always a problem in every virtualzation environment.
There are too many factors influencing the clock ticks.
Best practise is to use NTP or the time sync options of the virtualization extensions (guest additions in Vbox) to adjust the time regularily.

Re: Linux guest clock slow, HDD controller gets lost

Posted: 16. Mar 2013, 20:53
by jimklimov
Thanks for the suggestion, I guess it mostly responds to my idea of a higher-priority timer interrupt for the virtualization host?..

However, it is also below an acceptability level to have these kinds of problems I've hit:
* virtual clock slower than realtime by 5x a few hours after boot;
* virtual clock stalled so that it goes over the same few seconds over and over for days (if not noticed on time);
* virtual HDDs timing out and getting lost possibly due to such bad clocking (I am less sure this is well correlated, now...)

In all of the above cases trying to sync with NTP, Guest Additions and so on have failed, at least until I had dug into the problem deep enough to play with ACPI, APIC, clocksource, VirtualBox timesync settings, etc. - all of these barely documented (or I failed to find proper docs). And I still can't be sure (know of no way to test) that the timesync options of virtualbox engine and guest additions had any influence in my resulting moderate success (so it seems now).

If clock skew is a well-known problem with virtualization (and it is for a decade or so), the workarounds and best (aka least-worst) solutions should be detailed in the product docs, I think. If there are templates like "what guest OS would your VM have?" so detailed as in VirtualBox (down to releases sometimes) - such quirks should certainly be either worked around transparently or pointed out to admin/user of the solution (i.e. "when you use a Linux VM with a pre-2.6.21 kernel use clocksource=X in kernel boot options, and for a newer kernel recompile with NO_HZ, and configure time-syncing options X, Y, Z" or something along such lines). Bits of knowledge can be found on users' blogs, while they really belong in mainstream product docs. For example, I still haven't found an equivalent solution for the Solaris VM on another host, which luckily - though inexplicably - had no problems this week.

PS: Over the past 6 hours, the VM without "soft time" sync via crontab+rdate, which lagged only a minute behind by the end of its first 9 hours of uptime, has lost an additional 10 minutes compared to its "vHW clocK", which now (still) ticks properly.

Re: Linux guest clock slow, HDD controller gets lost

Posted: 24. Mar 2013, 13:58
by jimklimov
As a follow-up: the storage host was relieved of the duty of *running* the VMs in question, and they are now executed by a different machine which mounts the disk images over NFS (on a Jumbo-framed VLAN) from the original server like this:

Code: Select all

# cat /zones/_svr4/demostand/summit-vboxes/root/etc/auto_direct 
/Vbox    -noforcedirectio,hard,intr  thumper-jumbo:/Vbox/summit-vboxes
The rest of the settings remains the same (virtual network with etherstubs on the host, VM settings including timesync, and so on), the OpenSolaris zone which contains execution environment for the VMs was copied onto another server with same OS release and mightier CPUs. So there have been nearly no configuration changes, save for switch onto NFS from local disks and execution on a less-abused machine, with pretty good results.

This has been running for 5 days now, and the discrepancy between virtual "hwclock -r" and "date" in three VMs is under 2 seconds. One VM has about 6 seconds of lag behind "real" time on their new host, other clocks run on time (within eye-detectable precision). This has also survived well across the weekend scrubs on the storage host. Apparently, vHDD delays are of much lesser consequence to the systems (reasonable, since they mostly start up the JVM and serve cached data from RAM) than busy physical CPUs and resulting rareness of timer interrupts, or whatever :)

Running VMs on the storage host did have to compete with many processes, context switches and interrupts, and overall the VMs lost even with boosted "niceness" - even though on average the CPUs had tens of percents of idle time. I wrote that the CS column in vmstat often reported about 70000 context switches per seconds on the storage host. Without these three VMs, it is usually an order of magnitude smaller - 6-9k with rare peaks of about 30k once a minute (when many cronjobs are spawned with scripts to check subsystem statuses).

So... the timing problem is there in the VirtualBox, it can be worked around by separation of storage and exec hosts - if one can afford this generosity. I hope future versions would propose solutions to enhance VM timing even on such "busy" single systems. Pushing the clock is IMHO a relatively cheap task in terms of processing, such that it can be reasonably boosted in priority closer to real-time, and it should be boosted because it has bad consequences when it can't work properly despite its "cheapness".

//Jim Klimov

Re: Linux guest clock slow, HDD controller gets lost

Posted: 5. Apr 2013, 17:03
by jimklimov
As a new data point, after about 2 weeks of uptime on the new host (with the HDD images stored remotely on old host), I looked at the guests again. They do continue running without fatal hiccups (like losing their disks), but clock has skewed greatly.

Here's a time sample taken at about Apr 5 18:49:20 (+- 2 sec) local time; guests all booted at the same time about 2 weeks ago:

Code: Select all

# cat /proc/uptime ; uptime; hwclock -r; date
1335017.04 1227851.32
 18:48:38 up 15 days, 10:50,  1 user,  load average: 0.49, 0.42, 0.39
Thu 04 Apr 2013 07:17:19 PM MSK  -0.479847 seconds
Fri Apr  5 18:48:38 MSK 2013

# cat /proc/uptime ; uptime; hwclock -r; date
1332376.78 1261598.22
 18:47:48 up 15 days, 10:06,  1 user,  load average: 0.05, 0.08, 0.07
Thu 04 Apr 2013 10:17:49 AM MSK  -0.494735 seconds
Fri Apr  5 18:47:49 MSK 2013

# cat /proc/uptime ; uptime; hwclock -r; date
1331313.12 1262361.72
 18:48:54 up 15 days,  9:48,  1 user,  load average: 0.06, 0.09, 0.09
Thu 04 Apr 2013 03:43:28 PM MSK  -0.184119 seconds
Fri Apr  5 18:48:55 MSK 2013
Also, on console logs I see messages like:

Code: Select all

set_rtc_mmss: can't update from 58 to 5
set_rtc_mmss: can't update from 58 to 6
set_rtc_mmss: can't update from 59 to 9
set_rtc_mmss: can't update from 2 to 53
As can be seen, hwclocks are now 1 - 1.5 days behind, in "top" I can see the seconds slowly ticking, but Guest Additions, crontab+rdate and NTP (all enabled) do manage to keep the software (kernel) clocks within a couple of minutes of lag behind real time so far (basically, thanks to rdate being called every minute at 0 seconds, I can roughly estimate that the VM clocks are up to 2 times slower than realtime now).

While this is acceptable for these particular guests (which are just demo plumbing and are only expected to be up and render their static web pages), it is a very poor situation for other types of guests which might depend on real time being real (for DNS sync, kerberos and other security/timestamp related stuff, as well as email and whatever else)...

I guess I'll reboot them now for "cleanness" anyway...

Re: Linux guest clock slow, HDD controller gets lost

Posted: 5. Apr 2013, 22:27
by xorbe
I have a recent thread here about nanosleep() hitching, and I've also noticed my vbox losing 15 minutes wall clock time in not too many hours. Perhaps the nanosleep() is not overflowing the queue or going negative, but instead if the VM's time stops progressing forward properly, then it doesn't reach the end of nanosleep() in a timely fashion ...