Solaris 11.4 SRU 43: Sudden Reboots due to Kernel Panics

Discussions about using Solaris guests in VirtualBox.
Post Reply
Steffen M.
Posts: 17
Joined: 12. Sep 2013, 16:56

Solaris 11.4 SRU 43: Sudden Reboots due to Kernel Panics

Post by Steffen M. »

Hi all,

from time to time, I run Solaris 11.4 in VirtualBox on my Linux workstation, mainly for developing and testing applications we use on our Solaris 11.4 x64 server infrastructure. At least for some months, I've been observing sudden reboots of the Solaris VM caused by kernel panics of Solaris.

So more information about my system:

Host:
  • Mainboard: Gigabyte WRX80-SU8-IPMI
  • CPU: AMD Ryzen Threadripper PRO 3955WX (16 cores, 32 threads)
  • RAM: 128 GB, DDR4, ECC, Registered
  • Storage: Samsung SSD 860 PRO 1TB
  • OS: OpenSUSE Tumbleweed, currently kernel: 5.17.1-1-default
  • VirtualBox 6.1.32
Guest:
  • CPU: 16 vCPUs (also happens with less vCPUs)
  • Memory: 24 GB
  • Storage: SATA, Type Normal (VDI), 256 GB dynamically allocated
  • Network: Bridged Adapter, Intel PRO/1000 MT Desktop (82540EM)
  • OS: Solaris 11.4, fully-patched, currently to SRU 43.
I've seen exactly the same behavior running Solaris VMs on my older workstation (which had 2 CPUs of AMD Opteron 4386 on a Tyan S8225), which means it was a much older host platform), and at the mid of last year, when I upgraded the workstation, I hoped that the problems might be gone, but unfortunately, this did not happen. I've seen the same problem with older SRUs of Solaris, so it is definitely not a problem of SRU 43. Unfortunately, I cannot state when it happened for the first time, especially, I am not sure if we had the problem back in the days of Solaris 11.3.

The kernel panics are especially triggered when there is load (e.g., caused by compilation of software) within the VM. Today, it happened when building JDK 9 till 17 for Solaris OS after about three hours of successful operation.

Solaris reports:

Code: Select all

Apr  8 12:53:47 solaris savecore: [ID 570001 daemon.notice] reboot after panic: assertion failed: tl.tv_nsec >= 0 (0xfffffffffffff541 >= 0x0), file: ../../../common/hrt/i386/timestamp_cmn.c, line: 187
Apr  8 12:53:47 solaris savecore: [ID 702911 daemon.warning] System dump time: Fri Apr  8 12:37:32 2022
Apr  8 12:53:47 solaris savecore: [ID 702911 daemon.warning] Dump found on /dev/zvol/dsk/rpool/dump
Apr  8 12:53:47 solaris savecore: [ID 702911 daemon.error] Saving compressed system crash dump files in directory /var/crash/data/52d6fcbb-a47d-4317-8285-d5a5efa7881d
The VM's log shows:

Code: Select all

0:02:10.507592 Display::i_handleDisplayResize: uScreenId=0 pvVRAM=00007f9ee9bff000 w=800 h=600 bpp=32 cbLine=0xC80 flags=0x1 origin=0,0

...Solaris 11.4 SRU 43 working fine in headless mode...

03:37:11.159066 AHCI#0: Port 0 reset
03:37:11.163862 VD#0: Cancelling all active requests
03:37:37.856165 OHCI: USB Reset
03:37:37.856275 OHCI: Software reset
03:37:37.858014 EHCI: USB Suspended
03:37:37.868026 EHCI: Hardware reset
03:39:35.802846 AHCI#0: Port 0 reset
03:39:35.804911 VD#0: Cancelling all active requests
03:40:35.906136 AHCI#0: Port 0 reset
03:40:35.908205 VD#0: Cancelling all active requests
03:41:36.006911 AHCI#0: Port 0 reset
03:41:36.008970 VD#0: Cancelling all active requests
03:42:36.110119 AHCI#0: Port 0 reset
03:42:36.112190 VD#0: Cancelling all active requests
03:43:36.221787 AHCI#0: Port 0 reset
03:43:36.223856 VD#0: Cancelling all active requests
03:43:56.330074 AHCI#0: Port 1 reset
03:44:36.374050 AHCI#0: Port 0 reset
03:44:36.376124 VD#0: Cancelling all active requests
03:44:56.482047 AHCI#0: Port 1 reset
03:45:36.526085 AHCI#0: Port 0 reset
03:45:36.528316 VD#0: Cancelling all active requests
03:45:56.633240 AHCI#0: Port 1 reset
03:46:36.678072 AHCI#0: Port 0 reset
03:46:36.680149 VD#0: Cancelling all active requests
03:47:36.780569 AHCI#0: Port 1 reset
03:48:36.822028 AHCI#0: Port 1 reset
03:49:36.866033 AHCI#0: Port 0 reset
03:49:36.868099 VD#0: Cancelling all active requests
03:50:36.966031 AHCI#0: Port 0 reset
03:50:36.968088 VD#0: Cancelling all active requests
03:50:41.787007 OHCI: Software reset
03:50:41.796065 OHCI: USB Reset
03:50:41.796132 OHCI: USB Operational
03:50:41.995441 EHCI: Hardware reset
03:50:42.005559 EHCI: USB Operational
03:51:22.826993 OHCI: USB Reset
03:51:22.827073 OHCI: Software reset
03:51:22.827121 EHCI: USB Suspended
03:51:22.837133 EHCI: Hardware reset
03:51:22.857367 ACPI: Reset initiated by ACPI
03:51:22.860087 Changing the VM state from 'RUNNING' to 'RESETTING'
At first I thought, it could be related to a timing issue, maybe the system time inside the VM is not forwarded, maybe caused by load spikes on the host or by a change of the CPU clock speed. But nevertheless, the time should never be negative, and the assertion is for ">=0" not ">0".

Attached, there is the full "VBox.log". I've also attached the configuration file of the VM.

My question: Does anybody have an idea what could be causing the core dumps (and therefore the reboots of the VM)? Has anyone seen this before? Is there any setting wrong within the VM's config? Should I log an SR with Oracle for Solaris to ask for analyzing the core dump?

I've never seen this behavior before on our physical, non-virtualized Solaris servers.

I would be very glad for getting any helpful hints. Thank you very much in advance!

Kind regards,
Steffen
Attachments
Solaris 11.4.vbox.bz2
VirtualBox config file of the affected Solaris 11.4 VM
(2.04 KiB) Downloaded 616 times
VBox.log.bz2
VirtualBox log file when the crash occurred.
(49.4 KiB) Downloaded 607 times
Steffen M.
Posts: 17
Joined: 12. Sep 2013, 16:56

Re: Solaris 11.4 SRU 43: Sudden Reboots due to Kernel Panics

Post by Steffen M. »

The behavior does not change depending on the state of "HPET" option. In my first tests, HPET was inactive, but after having activated it, I still ran into this problem, yesterday.
fth0
Volunteer
Posts: 5690
Joined: 14. Feb 2019, 03:06
Primary OS: Mac OS X other
VBox Version: VirtualBox+Oracle ExtPack
Guest OSses: Linux, Windows 10, ...
Location: Germany

Re: Solaris 11.4 SRU 43: Sudden Reboots due to Kernel Panics

Post by fth0 »

Some random thoughts (from someone who doesn't really know Solaris ;)):

The Linux kernel 5.17.1 is newer than the latest VirtualBox 6.1.32, and the VirtualBox 6.1.30 change log "Introduced initial support for kernel 5.16". There could be an incompatibility with the newer Linux kernel (happened regularly in the past).

If you don't use nested virtualization, you could disable System > Processor > Nested VT-x/AMD-V.

I don't know what a Solaris guest expects here, but: What's configured in System > Acceleration > Paravirtualization Interface?

Regarding the logs: The clocks of the host and the guest do not look synchronized enough to me to be sure about the sequence of events:

a) If the difference of the clocks amounted to only one or two minutes, it would indicate that the AHCI port resets happened first, and the time difference kernel panic in the Solaris guest happened 15 minutes later.

b) If the difference of the clocks amounted to ~15 minutes, the time difference kernel panic in the Solaris guest happened first, and saving the crash dump led to the AHCI port resets.

Depending on what assumption is correct, the issue would be related either to disk I/O or the clocks.
Steffen M.
Posts: 17
Joined: 12. Sep 2013, 16:56

Re: Solaris 11.4 SRU 43: Sudden Reboots due to Kernel Panics

Post by Steffen M. »

Hi!

Thank you very much for your ideas and thoughts… Please find my comments below…
fth0 wrote:Some random thoughts (from someone who doesn't really know Solaris ;)):

The Linux kernel 5.17.1 is newer than the latest VirtualBox 6.1.32, and the VirtualBox 6.1.30 change log "Introduced initial support for kernel 5.16". There could be an incompatibility with the newer Linux kernel (happened regularly in the past).
Good point. In the meantime, I've updated to VirtualBox 6.1.34 (and the Solaris guest to SRU 44) and still do see this problem. As I've seen this for months on my older machine, I am quite sure it has also happened with VirtualBox versions which have been newer than the kernel versions.
fth0 wrote: If you don't use nested virtualization, you could disable System > Processor > Nested VT-x/AMD-V.
I actually don't need nested virtualization. So I disabled it. When I configured the VM, I thought that I would also have to work with Solaris kernel zones, but as long as I don't need them, I can safely disable it. Unfortunately, it did not make the problem go away...
fth0 wrote: I don't know what a Solaris guest expects here, but: What's configured in System > Acceleration > Paravirtualization Interface?
I've set it to "Default". In general, VirtualBox should choose the virtualization provider which fits best for the guest operating system when this option is set to "Default". Nevertheless, this is a very good question! I've checked the logs. It seems that VirtualBox communicates "None". Wow!

Just a note: A Solaris 11.4 guest also seems to boot fine when it is either set to "KVM" or to "Minimal". I haven't checked "Hyper-V", yet, but I cannot imagine that this Windows-typical provider is suitable for Solaris.
fth0 wrote:
Regarding the logs: The clocks of the host and the guest do not look synchronized enough to me to be sure about the sequence of events:
I've done a lot of checks regarding the clocks. They seem to be synchronized and deviate at a maximum of 1 second. I think we should consider the following when looking at my logs: VirtualBox's host logging always starts its "timer" at "00:00:00.000000". The guest kernel crashed on Fri Apr 8 12:37:32 2022, after the VM has been running for 3 hours and 37 minutes. I think it was by accident that we see the 37 twice. This might have been a bit confusing, at least it was to me. :-)
fth0 wrote: a) If the difference of the clocks amounted to only one or two minutes, it would indicate that the AHCI port resets happened first, and the time difference kernel panic in the Solaris guest happened 15 minutes later.

b) If the difference of the clocks amounted to ~15 minutes, the time difference kernel panic in the Solaris guest happened first, and saving the crash dump led to the AHCI port resets.
What I can say quite for sure now is that the kernel panic (assertion failed: tl.tv_nsec >= 0 (0xfffffffffffff541 >= 0x0), file: ../../../common/hrt/i386/timestamp_cmn.c, line: 187) occurs in the first place. After then, the system reboots (as it is configured to). When rebooting, the VM host logs the AHCI port reset.
fth0 wrote: Depending on what assumption is correct, the issue would be related either to disk I/O or the clocks.
What I can say so far: Enabling/disabling HPET does not solve it. Reducing the number of cores seems to make the problem occur less, but it even happens with 2 cores. Haven't checked a single-core configuration, yet. I've changed the storage controller type (AHCI, LSI, LSI SAS), but it does not change the behavior. I've also changed the chipset from PIIX3 to ICH9, which also does not help. Changing the network controller and the network connectivity is also without any effect.


What could probably help: Today's morning, I tried to deactivate CPU frequency scaling using "cpufrequtils" at the host side as best as possible, so it should run always at maximum speed – but I think it is not working perfectly, as there might be still turbo modes in modern CPUs. Interestingly, the VM is still up and running and compiling Java JDKs 9 .. 18, which is a task, which typically leads to crashes. In System > Acceleration > Paravirtualization the interface mode is still "Default", resulting, as we have just learned in "None". If I do understand the documentation correctly, "None" does not even expose the TSC and APIC frequency to the guest operating system. At least, I conclude it from two parts of the documentation:

https://www.virtualbox.org/manual/ch08.html tells us:

Code: Select all

--paravirtprovider none|default|legacy|minimal|hyperv|kvm: Specifies which paravirtualization interface to provide to the guest operating system. Specifying none explicitly turns off exposing any paravirtualization interface. The option default selects an appropriate interface when starting the VM, depending on the guest OS type. This is the default option chosen when creating new VMs. The legacy option is used for VMs which were created with older Oracle VM VirtualBox versions and will pick a paravirtualization interface when starting the VM with Oracle VM VirtualBox 5.0 and newer. [i]The minimal provider is mandatory for Mac OS X guests.[/i] kvm and hyperv are recommended for Linux and Windows guests respectively. These options are explained in Section 10.5, “Paravirtualization Providers”. 
and https://www.virtualbox.org/manual/ch10.html#hwvirt tells us:

Code: Select all

 Oracle VM VirtualBox provides the following interfaces:
    Minimal: Announces the presence of a virtualized environment. Additionally, reports the TSC and APIC frequency to the guest operating system. This provider is mandatory for running any Mac OS X guests.
    KVM: Presents a Linux KVM hypervisor interface which is recognized by Linux kernels version 2.6.25 or later. Oracle VM VirtualBox's implementation currently supports para-virtualized clocks and SMP spinlocks. This provider is recommended for Linux guests.
    Hyper-V: Presents a Microsoft Hyper-V hypervisor interface which is recognized by Windows 7 and newer operating systems. Oracle VM VirtualBox's implementation currently supports para-virtualized clocks, APIC frequency reporting, guest debugging, guest crash reporting and relaxed timer checks. This provider is recommended for Windows guests. 
So I presume that "None" does not even expose any property of cores running at different clock speeds to the guest. And can indeed lead to strange effects in a guest, if there are several cores running at different clock speeds while the guest does nothing know about this fact! Of course, this can cause an assertion to fail if the kernel seems to see a negative jump in time when switching from one core to another one! This could also explain the fact why the problem occurs more often the more cores I've assigned to the guest.

So my conclusions so far:
  • I'll try to continue running the whole thing with fixed clocks at the host side and compile and compile and compile… :-)
  • If this turns out to be stable, I am rebooting the host in order to have the original CPU frequency scaling behavior back. I'll see if I can reproduce the crashes (I am very sure that I can - as they occur after a few hours of load, it should not be too hard).
  • After then, I explicitly change the para-virtualization interface mode from "Default" (aka "None" for Solaris) to "Minimal" – and later to "KVM".
  • If one of these steps solves the problem (especially changing the paravirtualization mode), I am going to file a bug report to VirtualBox that the VirtualBox developers should change the behavior of "Default" if a guest is Solaris 11.
  • If the problem still occurs, I am going to log a service request for Solaris 11.4.
So far – thank you very much for your help, especially for the hint which made me look more deeply into the para-virtualization mode. I assumed (and was even very sure) that VirtualBox uses KVM for Solaris as it does for Linux when the mode is configured to Default.

Your post triggered me to check this!

Kind regards,
Steffen
Steffen M.
Posts: 17
Joined: 12. Sep 2013, 16:56

Re: Solaris 11.4 SRU 43: Sudden Reboots due to Kernel Panics

Post by Steffen M. »

All these nice ideas, I wrote about in my last post, can be directly abandoned. Just had a crash after two hours of running Solaris 11.4 SRU 44 in para-virtualization mode "KVM" and fixed frequencies…
fth0
Volunteer
Posts: 5690
Joined: 14. Feb 2019, 03:06
Primary OS: Mac OS X other
VBox Version: VirtualBox+Oracle ExtPack
Guest OSses: Linux, Windows 10, ...
Location: Germany

Re: Solaris 11.4 SRU 43: Sudden Reboots due to Kernel Panics

Post by fth0 »

Regarding the times and the clocks, I somehow didn't see the crash time staring at me from your Solaris log excerpt. ;)

BTW, the 2nd log message in the VBox.log file shows the host's UTC time when either the VM was started or this log messages was created, I don't know (yet), but with the ~10 ms difference, it doesn't really matter here. Starting the VM at the full hour explains the coincidence of the 37 minutes.

Regarding the CPU frequency scaling, it shouldn't matter (but who knows): Your CPU has an invariant TSC (Time Stamp Counter), which is supposed to increment independent from the CPU frequency or power.

I'm not sure if it would really help (or enlarge the confusion ;)), but there's a method to watch parts of the VirtualBox internal timer magic, see my post in Re: Understanding "Giving up catch-up attempt at".
Steffen M.
Posts: 17
Joined: 12. Sep 2013, 16:56

Re: Solaris 11.4 SRU 43: Sudden Reboots due to Kernel Panics

Post by Steffen M. »

I've just logged an SR for Oracle Solaris 11.4: SR 3-29594519391
CasperDik
Posts: 1
Joined: 12. Mar 2024, 17:53

Re: Solaris 11.4 SRU 43: Sudden Reboots due to Kernel Panics

Post by CasperDik »

We have seen this at Oracle as well as soon you have more than one vCPUs.

The workaround is using a different way of how VirtualBox hands out the TSC:

$ VBoxManage setextradata VM-name VBoxInternal/TM/TSCMode RealTSCOffset


In the log you should see:

one CPU:
00:00:02.500205 TM: cTSCTicksPerSecond=2 491 819 003 (0x948623fb) enmTSCMode=3 (Dynamic)
> 1 CPU
00:00:02.425865 TM: cTSCTicksPerSecond=2 491 819 003 (0x948623fb) enmTSCMode=1 (VirtTscEmulated)

and it appears that the VirtTscEmulated does not create the same TSC on all the CPUs which confuses Solaris.

The default setting for Solaris in VirtualBox causes a VM-Exit so the TSC latency is then around 1us (microseconds); with the
setting "RealTSCOffset" uses the standard mechanism of TSC+Offset (or really, -Offset as the TSC in the VirtualBox guest
would be lower. The latency of rdtscp would then be around 10-20ns (nano seconds)
klaus
Oracle Corporation
Posts: 1436
Joined: 10. May 2007, 14:57

Re: Solaris 11.4 SRU 43: Sudden Reboots due to Kernel Panics

Post by klaus »

The better approach would be to get the "paravirt provider" going for Solaris. At the moment the default for Solaris is mapped to "none" (for Linux it's KVM, for Windows it's Hyper-V) and having a proper hook would be much preferred.

It is possible that the TSC values are a bit jittery without such 'help', but I haven't heard that they're going backwards even across CPUs. By design this should not happen.
Post Reply