Errors when programmatically stopping Windows VMs on 6.1.28

Discussions related to using VirtualBox on Linux hosts.
challengeursli
Posts: 20
Joined: 21. Oct 2021, 14:52

Errors when programmatically stopping Windows VMs on 6.1.28

Post by challengeursli »

Since we have upgraded to 6.1.28, we have massive issues programmatically powering off headless Windows VMs running on a Linux host.

We have tried with both:

Code: Select all

controlvm <XYZ> poweroff
commands as well as hard stopping the vboxheadless process by ID:

Code: Select all

kill -2 <PID>
When we do the former, we get errors like this:

Code: Select all

Failed to get a console object from the direct session (VBOX_E_INVALID_OBJECT_STATE)
VBoxManage: error: Details: code VBOX_E_VM_ERROR (0x80bb0003), component MachineWrap, interface IMachine, callee nsISupports
VBoxManage: error: Context: "LockMachine(a->session, LockType_Shared)" at line 388 of file VBoxManageControlVM.cp
In these cases, we were still able to restore and start the VM again.

In another case, the poweroff command did not return, and these were that last lines in the machine's log:

Code: Select all

00:03:18.324978 ************** End of Guest state at power off ***************
00:03:18.371894 Console: Machine state changed to 'Stopping'
00:03:18.372607 Console::powerDown(): A request to power off the VM has been issued (mMachineState=Stopping, InUninit=0)
00:03:18.396602 PDMR3PowerOff: 71 595 566 ns run time
00:03:18.396623 Changing the VM state from 'POWERING_OFF' to 'OFF'
00:03:18.396745 VMR3PowerOff:
00:03:18.396759  RUNNING -> POWERING_OFF, SUSPENDED -> POWERING_OFF, DEBUGGING -> POWERING_OFF, LOAD_FAILURE -> POWERING_OFF, GURU_MEDITATION -> POWERING_OFF, FATAL_ERROR -> POWERING_OFF, CREATED -> POWERING_OFF, RUNNING_LS -> POWERING_OFF_LS, DEBUGGING_LS -> POWERING_OFF_LS, GURU_MEDITATION_LS -> POWERING_OFF_LS, FATAL_ERROR_LS -> POWERING_OFF_LS failed, because the VM state is actually OFF
00:03:18.396832 VMSetError: /home/vbox/vbox-6.1.28/src/VBox/VMM/VMMR3/VM.cpp(3342) int vmR3TrySetState(PVM, const char*, unsigned int, ...); rc=VERR_VM_INVALID_VM_STATE
00:03:18.396845 VMSetError: VMR3PowerOff failed because the current VM state, OFF, was not found in the state transition table (old state FATAL_ERROR_LS)
00:03:18.399344 ERROR [COM]: aRC=VBOX_E_VM_ERROR (0x80bb0003) aIID={872da645-4a9b-1727-bee2-5585105b9eed} aComponent={ConsoleWrap} aText={Could not power off the machine. (Error: VERR_VM_INVALID_VM_STATE)}, preserve=false aResultDetail=-1907
If we do the latter (hard stop via SIGINT), the VM gets locked in the "Stopping" state immediately.
Any action then fails with:

Code: Select all

vboxmanage snapshot "W7 Analyzer 1" restore "Current"
VBoxManage: error: Code NS_ERROR_FAILURE (0x80004005) - Operation failed (extended info not available)
VBoxManage: error: Context: "LockMachine(a->session, LockType_Shared)" at line 338 of file VBoxManageSnapshot.cpp
There is no process left for this machine, so the only way to get the machine back is to fully kill the VBoxSVC process (which kills other VMs as well).

We haven't had issues like this before at all (on 6.1.18).
I have downgraded VirtualBox to 6.1.24 and 6.1.26 and tested on both versions. The problem does not happen there.
So it must be related to the latest release 6.1.28.

Has anybody experienced the same?

I have analyzed the VBoxSVC log as well as the VM logs, there is not more information available than given here.
scottgus1
Site Moderator
Posts: 20965
Joined: 30. Dec 2009, 20:14
Primary OS: MS Windows 10
VBox Version: PUEL
Guest OSses: Windows, Linux

Re: Errors when programmatically stopping Windows VMs on 6.1.28

Post by scottgus1 »

A bug has crept in that leaves one of the support processes for headless running: viewtopic.php?f=3&t=103980 Try a test build if you want, or revert back to the version you were running.
challengeursli
Posts: 20
Joined: 21. Oct 2021, 14:52

Re: Errors when programmatically stopping Windows VMs on 6.1.28

Post by challengeursli »

@scottgus thanks for the information and link.

Somehow it looks to me like something was broken in 6.1.26 and (maybe) fixed in 6.1.28, but at the same time creating a new bug which we are facing now.
6.1.26 seems to run fine here.
klaus
Oracle Corporation
Posts: 1115
Joined: 10. May 2007, 14:57

Re: Errors when programmatically stopping Windows VMs on 6.1.28

Post by klaus »

This is very mysterious... we didn't actually change anything with the VM state transitions in 6.1.28 (at least not intentionally), and that's where the problem seems to be.

The only change (to address https://www.virtualbox.org/ticket/20574 actually, a regression with VM termination on Windows) was I believe unrelated, because that problem happened only on Windows systems with 6.1.26 (and it was a really odd corner case related to hardening).

Why you have trouble killing the VM process is even more mysterious.
challengeursli
Posts: 20
Joined: 21. Oct 2021, 14:52

Re: Errors when programmatically stopping Windows VMs on 6.1.28

Post by challengeursli »

Thanks @klaus, I have been running a bit on 6.1.26, and the machines started going into limbo after a while as well (we stop and revert machines every few minutes).

I have now switched to 6.1.24.

I have to debug this further in our environment and will update the ticket when I have more information.
challengeursli
Posts: 20
Joined: 21. Oct 2021, 14:52

Re: Errors when programmatically stopping Windows VMs on 6.1.28

Post by challengeursli »

I debugged further. What happens exactly is this:
I run some of our workflow cycles on a VM. At the end of each cycle, the VM is stopped via

Code: Select all

kill -2 <PID>
The process is terminated, but the VM is stuck in "Stopping" state.

This is the end of the VM's log:

Code: Select all

00:03:35.063788 E1000#0: TX frames up to 16384: 356
00:03:35.063791 E1000#0: TX frames up to 32768: 563
00:03:35.063794 E1000#0: Larger TX frames    : 2125
00:03:35.063797 E1000#0: Max TX Delay        : 0
00:03:35.145109 Changing the VM state from 'DESTROYING' to 'TERMINATED'
00:03:35.145133 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={872da645-4a9b-1727-bee2-5585105b9eed} aComponent={ConsoleWrap} aText={The object is not ready}, preserve=false aResultDetail=0
00:03:35.149719 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={872da645-4a9b-1727-bee2-5585105b9eed} aComponent={ConsoleWrap} aText={The object is not ready}, preserve=false aResultDetail=0
00:03:35.307951 VBoxHeadless: exiting
Result when trying to run a command on the machine:
vboxmanage snapshot "W7 Analyzer 1" restore "Current"
VBoxManage: error: Code NS_ERROR_FAILURE (0x80004005) - Operation failed (extended info not available)
VBoxManage: error: Context: "LockMachine(a->session, LockType_Shared)" at line 338 of file VBoxManageSnapshot.cpp
I have tried to reproduce this issue using a bash script with a loop of start - wait 30 seconds - kill - revert, but didn't succeed. It seems to only happen if the machines run for some time or are interacted with (at least I have no other explanation at the moment).

The issue does not happen at all with VirtualBox <= 6.1.24.
challengeursli
Posts: 20
Joined: 21. Oct 2021, 14:52

Re: Errors when programmatically stopping Windows VMs on 6.1.28

Post by challengeursli »

Could this maybe be related to this change:
VBoxHeadless: Running VM will save its state on host shutdown
in the changelog for version 6.1.26?
klaus
Oracle Corporation
Posts: 1115
Joined: 10. May 2007, 14:57

Re: Errors when programmatically stopping Windows VMs on 6.1.28

Post by klaus »

Think we need to get more information from you to hunt this down. First of all it'd be good to see a complete VBox.log and not just the last few lines. Maybe there is a hint about the cause of the hang earlier (because I can't see any pausing which you mentioned). After that it would help getting a core dump of VirtualBoxVM and VBoxSVC processes using the "gcore" approach mentioned in getting core dumps of VirtualBox. The core dumps would be most useful from VirtualBox 6.1.28, because that would exclude running into some very long standing API bug which was very recently reported which can cause hanging VMs with surprising ease.

Just to mention it (should be no actual secret even though it's not advertised in the documentation): for killing a VM you could also use:

Code: Select all

VBoxManage startvm "vmname" --type emergencystop
This is equivalent to kill -9, so it is really the last measure to stop a VM. The bonus of this command is that your script doesn't need to figure out the process ID.
challengeursli
Posts: 20
Joined: 21. Oct 2021, 14:52

Re: Errors when programmatically stopping Windows VMs on 6.1.28

Post by challengeursli »

@klaus thanks a lot!

I mixed up "Pausing" and "Stopping", actually I never paused the VM.
The undocumented command looks very useful, I'll try that out.

I'll try to collect the information you asked for.
klaus
Oracle Corporation
Posts: 1115
Joined: 10. May 2007, 14:57

Re: Errors when programmatically stopping Windows VMs on 6.1.28

Post by klaus »

challengeursli wrote:Could this maybe be related to this change:
VBoxHeadless: Running VM will save its state on host shutdown
in the changelog for version 6.1.26?
6.1.26 does modify the signal handling of VBoxHeadless on Linux/macOS/Solaris (and does the equivalent on Windows), yes. For HUP, INT, TERM, USR1 and USR2 (i.e. all the friendly "go away" signals which are routinely used by logout/shutdown sequences) this triggers saving state for the VM. Which may take a few seconds, and would only kick in if your earlier VM poweroff attempt (which also needs a little reaction time, nothing in this world happens instantly) didn't work out which might indicate some other trouble, i.e. a VM which is hung already.
challengeursli
Posts: 20
Joined: 21. Oct 2021, 14:52

Re: Errors when programmatically stopping Windows VMs on 6.1.28

Post by challengeursli »

Using SIGKILL instead of SIGINT absolutely does the job!
This correlates perfectly with the change in version 6.1.26.

So the issue is resolved for me.
vushakov
Oracle Corporation
Posts: 29
Joined: 18. Sep 2012, 16:13

Re: Errors when programmatically stopping Windows VMs on 6.1.28

Post by vushakov »

I think the signal changes in VBoxHeadless are a red herring. When you call controlvm and it fails - that already indicates that the machine is in a weird state. A signal just causes more or less the same API call to be performed from inside VBoxHeadless instead of from VBoxManage.
klaus
Oracle Corporation
Posts: 1115
Joined: 10. May 2007, 14:57

Re: Errors when programmatically stopping Windows VMs on 6.1.28

Post by klaus »

Correct - if your reporting of the issue is in any way accurate then the VM goes into a unexpected state before you try the "VBoxManage controlvm foo poweroff", and that's something which should be investigated. Doing a kill -9 instead is not a real replacement for powering off a VM (it kills the process, which means that the usual poweroff logic is never executed), so it would still be very much appreciated if you can provide the log file at least, ideally with a time correlated description which VBoxManage command (or kill) happens when.
challengeursli
Posts: 20
Joined: 21. Oct 2021, 14:52

Re: Errors when programmatically stopping Windows VMs on 6.1.28

Post by challengeursli »

The thing is that we have been using "kill -2" without a previous "poweroff" command for a long time already, because "controlvm poweroff" used to fail (seemingly) randomly also on older versions of VirtualBox.
But I understand that it would be interesting to investigate what exactly goes wrong with the VM.

I'll try to provide that info.
challengeursli
Posts: 20
Joined: 21. Oct 2021, 14:52

Re: Errors when programmatically stopping Windows VMs on 6.1.28

Post by challengeursli »

@klaus attached please find the log file (split into two because of max upload size) for a VM where I:
* first ran the "vboxmanage controlvm poweroff" command. The VM went into "Stopping" state then. The vboxheadless process was still there. (VBox.log)
* then ran "pkill -f <name of VM>" which terminated the process, but kept the VM in "Stopping" state. (VBox-sigkill.log, see next post)
I can resolve this situation and get the VM back into a usable state only by killing all virtualbox processes.
Attachments
VBox.log
(106 KiB) Downloaded 9 times
Last edited by challengeursli on 28. Oct 2021, 15:09, edited 1 time in total.
Post Reply