Page 1 of 1

Trying to diagnose poor start/stop reliability with VBoxManage

Posted: 30. Nov 2020, 14:29
by rdghickman
Hi all,

I am starting/stopping a lot of VMs (up to 60) using scripts that call out to VBoxManage. The VMs have been created prior to the script and are in a powered off state. For startup, I do 2 at a time, waiting 30 seconds between batches. For shutdown I do similar, sending an ACPI power off command via VBoxManage.
This is on a Windows host and generally I have the VBox GUI up as well. Quite frequently (usually 1/10 VMs) one of the VMs will get stuck at power on with a black screen. It is not even getting to the GRUB boot loader. On shut down very frequently one or more VMs will get stuck in "powering off" and the VBox GUI will also hang and become inoperative. If this happens then the only option is to reboot the PC, at which point the affected VMs are in "aborted" state. This seems far more likely to happen if I interact with the GUI while the operations are being performed.

I am assuming this is a VBox problem because both VBoxManage and the VBox GUI hang indefinitely if you try and interact with one of the VMs which it got stuck with powering off. This makes the whole automation process very fraught and unreliable and requires reboots. If you slowly do it one at a time by hand from the GUI there are no problems.

Has anyone got any suggestions or ideas I can try to diagnose the problem?

Re: Trying to diagnose poor start/stop reliability with VBoxManage

Posted: 2. Dec 2020, 02:03
by scottgus1
rdghickman wrote:one of the VMs will get stuck at power on with a black screen. It is not even getting to the GRUB boot loader.
When this happens, close the Virtualbox window for the VM with the Power Off option set.

Right-click the VM in the main Virtualbox window's VM list, choose Show Log. Save the far left tab's log, zip it, and post the zip file, using the forum's Upload Attachment tab.

Also, I don't know that a simple timer will be the most optimum way to smoothly tart guests from a script. Things happen, and a recovery system s needed.

I ran backup & restart scripts on the office server for 5 VMs. I set up a subroutine script that would launch a VM, then monitor that VM for some form of feedback, whether a flag file in a host shared folder, or a guestproperty toggle, something suitable to the capabilities of the VM OS. I put a script in each VM that ran when the OS booted up and logged in, which caused the flag file or toggle. For shutdown I had another subroutine script that would send acpipowerbutton, also some keyboard scan inputs for shutdown commands, then monitor the output of 'vboxmanage showvminfo', to see if the VM shut down, as well as Virtualbox's CPU metrics to see if the guest stopped processing but the VM's Virtualbox window never closed, then send a vboxmanage poweroff.

This may be overkill for your needs, but it was built up over a few times of coming in Monday morning with the sever down because the backup routine seized up on a VM that didn't shut down... :shock:

Re: Trying to diagnose poor start/stop reliability with VBoxManage

Posted: 2. Dec 2020, 16:19
by rdghickman
At least one of the other developers has reported a fault today which appears to be a -450 (VERR_NET_DOWN) failure.
I'll attach the logs here at least.

Starting to get a bit concerned that VBox is not the right solution for what we're trying to do here. I'll see if I can make the scripts back off some more and see if that helps reduce errors further.

Re: Trying to diagnose poor start/stop reliability with VBoxManage

Posted: 2. Dec 2020, 17:11
by rdghickman
Additionally here is the log for where it gets stuck on boot.

The process is this:
- VM is cloned
- VM is started
- Some network settings are changed on the VM
- The VM is rebooted by sending a "reboot" command to the Linux guest OS
- The guest VM goes to a black screen and nothing else happens (waited > 5 minutes)

The last part of the log is:

00:01:47.592093 TM: Switching TSC mode from 'Dynamic' to 'RealTscOffset'
00:01:48.579465 GIM: KVM: Enabled wall-clock struct. at 0x000000002f9e8008 - u32Sec=1606921375 u32Nano=660980153 uVersion=2
00:01:48.601670 PIT: mode=2 count=0x12a5 (4773) - 249.98 Hz (ch=0)
00:01:48.602326 APIC0: Switched mode to x2APIC
00:01:48.733144 PIT: mode=0 count=0x10000 (65536) - 18.20 Hz (ch=0)
00:01:48.843711 OHCI: Software reset

You can then see I power it off to get the logs more than five minutes later:

00:01:48.843711 OHCI: Software reset
00:06:59.972369 Console: Machine state changed to 'Stopping'