Virtual machine freezes

Discussions about using Linux guests in VirtualBox.
Post Reply
AleksandrLykov
Posts: 3
Joined: 27. Apr 2021, 07:49

Virtual machine freezes

Post by AleksandrLykov »

Good day.

Situation:
There are 12 virtual machines running on a single server. While 5 of them were working, there were no problems. As soon as they started all of them, some machines started to stop working (the status is "Working", but the machines themselves froze did not allow any reboot, nothing).
If "Poweroff" or "reset", then the machine hung with the status "Stopping". When trying to "Kill" the VM process, the HOST machine freezes, from which only reset saves. At the same time, the RAM is fine (there is about 100GB left in the stock, the SWAP is empty, the total CPU load is about 50%). PostgreSQL 12 is installed on the VMs. I didn't find anything in the logs.

HOST machine: Ubuntu 20.04, 256GB ram, 12TB ssd, VirtualBox version: 6.1.18.
sysctl.conf:

Code: Select all

vm.swappiness=5
vm.zone_reclaim_mode=0
vm.overcommit_memory=2
vm.overcommit_ratio=90
vm.vfs_cache_pressure=100
vm.min_free_kbytes=16384

kernel.shmmax=17179869184
kernel.shmmni=6144
kernel.sched_migration_cost_ns=5000000
kernel.sched_autogroup_enabled=0

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
The characteristics of the virtual machines are about the same: Ubuntu 20.04, about 16GB ram, 512GB.

Logs in the attached files (approximate crash time: 2021.04.26 11: 27: 00, restarted the HOST machine at 11: 30: 00)
1) KartyFarmani_PG.vbox - config of the falling virtual machine
2) syslog.HOST.log - logs of the HOST machine for the 26th number
3) syslog.VM.log - logs of the virtual machine
4) VBox.log-VirtualBox logs.
5) PG - PostgreSQL logs on the VM.
Attachments
logs.rar
logs
(231.09 KiB) Downloaded 9 times
Last edited by AleksandrLykov on 28. Apr 2021, 17:30, edited 1 time in total.
fth0
Volunteer
Posts: 5668
Joined: 14. Feb 2019, 03:06
Primary OS: Mac OS X other
VBox Version: PUEL
Guest OSses: Linux, Windows 10, ...
Location: Germany

Re: Virtual machine hang

Post by fth0 »

On the host, the two Intel® Xeon® Gold 6230 Processors together have "only" 40 cores with 2 threads each. Do I understand correctly that you're trying to emulate 12 * 16 = 192 vCPUs on those 40/80 cores/threads?

On the host, you should update the VirtualBox Extension Pack to the same version as VirtualBox. On the guest, you should reinstall the VirtualBox Guest Additions and pay attention to resulting error messages.
AleksandrLykov
Posts: 3
Joined: 27. Apr 2021, 07:49

Re: Virtual machine hang

Post by AleksandrLykov »

fth0 wrote:Do I understand correctly that you're trying to emulate 12 * 16 = 192 vCPUs on those 40/80 cores/threads?.
Yep. After reading the documentation, I did not find why it is impossible to do this. Is this a bad idea?
fth0 wrote:On the guest, you should reinstall the VirtualBox Guest Additions and pay attention to resulting error messages
Reinstalled. All the same.

Removed unnecessary sysctl parameters:

new sysctl.conf

Code: Select all

vm.swappiness=5
vm.overcommit_memory=2
vm.overcommit_ratio=90
Did not help.
Could there be a problem with HugePage on virtual machines? We use it for PostgreSQL.
AleksandrLykov
Posts: 3
Joined: 27. Apr 2021, 07:49

Re: Virtual machine hang

Post by AleksandrLykov »

I found new information.

Right before the machine freezes, the following appears in the logs

Code: Select all

06:53:50.683254 AHCI#0: Port 2 reset
06:53:51.971624 AHCI#0: Port 0 reset
06:53:51.975301 VD#0: Cancelling all active requests
06:53:51.975319 VD#0: Request{0x007fc2606b2580}:
06:53:51.975320     Type=FLUSH State=ACTIVE Id=0x13 SubmitTs=24855370 {30251} Flags=0x2
06:53:51.975323     Offset=0 Size=0 Left=0 BufSize=0
06:54:07.297692 AHCI#0: Port 0 reset
06:54:07.300109 VD#0: Cancelling all active requests
06:54:07.300146 VD#0: Request{0x007fc2606b2ac0}:
06:54:07.300150     Type=FLUSH State=ACTIVE Id=0x0 SubmitTs=24885937 {15008} Flags=0x2
06:54:07.300156     Offset=0 Size=0 Left=0 BufSize=0
06:54:07.300166 VD#0: Request{0x007fc2606b2580}:
06:54:07.300169     Type=FLUSH State=CANCELED Id=0x13 SubmitTs=24855370 {45575} Flags=0x2
06:54:07.300174     Offset=0 Size=0 Left=0 BufSize=0
06:54:22.690907 AHCI#0: Port 0 reset
06:54:22.692656 VD#0: Cancelling all active requests
06:54:22.692678 VD#0: Request{0x007fc2606b2ac0}:
06:54:22.692681     Type=FLUSH State=CANCELED Id=0x0 SubmitTs=24885937 {30401} Flags=0x2
06:54:22.692686     Offset=0 Size=0 Left=0 BufSize=0
06:54:22.692693 VD#0: Request{0x007fc2606b2200}:
06:54:22.692695     Type=FLUSH State=ACTIVE Id=0x0 SubmitTs=24901255 {15083} Flags=0x2
06:54:22.692697     Offset=0 Size=0 Left=0 BufSize=0
06:54:22.692703 VD#0: Request{0x007fc2606b2580}:
06:54:22.692704     Type=FLUSH State=CANCELED Id=0x13 SubmitTs=24855370 {60968} Flags=0x2
06:54:22.692706     Offset=0 Size=0 Left=0 BufSize=0
06:54:54.159033 AHCI#0: Port 0 reset
06:54:54.160737 VD#0: Cancelling all active requests
06:54:54.160751 VD#0: Request{0x007fc2606b2ac0}:
06:54:54.160752     Type=FLUSH State=CANCELED Id=0x0 SubmitTs=24885937 {61869} Flags=0x2
06:54:54.160754     Offset=0 Size=0 Left=0 BufSize=0
06:54:54.160757 VD#0: Request{0x007fc2606b2200}:
06:54:54.160758     Type=FLUSH State=CANCELED Id=0x0 SubmitTs=24901255 {46551} Flags=0x2
06:54:54.160759     Offset=0 Size=0 Left=0 BufSize=0
06:54:54.160775 VD#0: Request{0x007fc2606b2e40}:
06:54:54.160775     Type=FLUSH State=ACTIVE Id=0x0 SubmitTs=24916666 {31140} Flags=0x2
06:54:54.160776     Offset=0 Size=0 Left=0 BufSize=0
06:54:54.160779 VD#0: Request{0x007fc2606b2580}:
06:54:54.160780     Type=FLUSH State=CANCELED Id=0x13 SubmitTs=24855370 {92436} Flags=0x2
06:54:54.160781     Offset=0 Size=0 Left=0 BufSize=0
[b][u]06:54:58.728794 VMMDev: vmmDevHeartbeatFlatlinedTimer: Guest seems to be unresponsive. Last heartbeat received 4 seconds ago[/u][/b]
Then (apparently) the VM tries to ping with the host machine

Code: Select all

08:13:57.642681 PDMR3Suspend: after 62763 ms, 2 cycles: 1 asynchronous tasks-ahci/0
08:13:58.712880 PDMR3Suspend: after 63834 ms, 3 cycles: 1 asynchronous tasks-ahci/0
08:13:59.762847 PDMR3Suspend: after 64884 ms, 4 cycles: 1 asynchronous tasks-ahci/0
08:14:00.812360 PDMR3Suspend: after 65933 ms, 5 cycles: 1 asynchronous tasks-ahci/0
08:14:01.859061 PDMR3Suspend: after 66980 ms, 6 cycles: 1 asynchronous tasks-ahci/0
08:14:02.951592 PDMR3Suspend: after 68072 ms, 7 cycles: 1 asynchronous tasks-ahci/0
08:14:03.858405 PDMR3Suspend: after 68979 ms, 8 cycles: 1 asynchronous tasks-ahci/0
08:14:04.854918 PDMR3Suspend: after 69976 ms, 9 cycles: 1 asynchronous tasks-ahci/0
08:14:05.858340 PDMR3Suspend: after 70979 ms, 10 cycles: 1 asynchronous tasks-ahci/0
08:14:06.859576 PDMR3Suspend: after 71980 ms, 11 cycles: 1 asynchronous tasks-ahci/0
08:14:07.859589 PDMR3Suspend: after 72980 ms, 12 cycles: 1 asynchronous tasks-ahci/0
08: 14: 08.858298 PDMR3Suspend: after 73979 ms, 13 cycles: 1 asynchronous tasks-ahci/0
08:14:09.858283 PDMR3Suspend: after 74979 ms, 14 cycles: 1 asynchronous tasks-ahci/0
08:14:10.864868 PDMR3Suspend: after 75986 ms, 15 cycles: 1 asynchronous tasks-ahci/0
08:14:11.861183 PDMR3Suspend: after 76982 ms, 16 cycles: 1 asynchronous tasks-ahci/0
08:14:12.858438 PDMR3Suspend: after 77979 ms, 17 cycles: 1 asynchronous tasks-ahci/0
08:14:13.858293 PDMR3Suspend: after 78979 ms, 18 cycles: 1 asynchronous tasks-ahci/0
...
Then, when you try to shut down the VM, the following results:

Code: Select all

08:21:41.120429 ERROR [COM]: aRC=VBOX_E_IPRT_ERROR (0x80bb0005) aIID={755e6bdf-1640-41f9-bd74-3ef5fd653250} aComponent={KeyboardWrap} aText={Could not send all scan codes to the virtual keyboard (VERR_PDM_NO_QUEUE_ITEMS)}, preserve=false aResultDetail=-2807
08:21:41.120532 ERROR [COM]: aRC=VBOX_E_IPRT_ERROR (0x80bb0005) aIID={755e6bdf-1640-41f9-bd74-3ef5fd653250} aComponent={KeyboardWrap} aText={Could not send all scan codes to the virtual keyboard (VERR_PDM_NO_QUEUE_ITEMS)}, preserve=false aResultDetail=-2807
08:21:41.120619 ERROR [COM]: aRC=VBOX_E_IPRT_ERROR (0x80bb0005) aIID={755e6bdf-1640-41f9-bd74-3ef5fd653250} aComponent={KeyboardWrap} aText={Could not send all scan codes to the virtual keyboard (VERR_PDM_NO_QUEUE_ITEMS)}, preserve=false aResultDetail=-2807
08:21:41.120679 ERROR [COM]: aRC=VBOX_E_IPRT_ERROR (0x80bb0005) aIID={755e6bdf-1640-41f9-bd74-3ef5fd653250} aComponent={KeyboardWrap} aText={Could not send all scan codes to the virtual keyboard (VERR_PDM_NO_QUEUE_ITEMS)}, preserve=false aResultDetail=-2807
08:21:41.120788 ERROR [COM]: aRC=VBOX_E_IPRT_ERROR (0x80bb0005) aIID={755e6bdf-1640-41f9-bd74-3ef5fd653250} aComponent={KeyboardWrap} aText={Could not send all scan codes to the virtual keyboard (VERR_PDM_NO_QUEUE_ITEMS)}, preserve=false aResultDetail=-2807
Please help me, what could be the problem?
fth0
Volunteer
Posts: 5668
Joined: 14. Feb 2019, 03:06
Primary OS: Mac OS X other
VBox Version: PUEL
Guest OSses: Linux, Windows 10, ...
Location: Germany

Re: Virtual machine freezes

Post by fth0 »

AleksandrLykov wrote:
fth0 wrote:Do I understand correctly that you're trying to emulate 12 * 16 = 192 vCPUs on those 40/80 cores/threads?
Yep. After reading the documentation, I did not find why it is impossible to do this. Is this a bad idea?
I cannot really answer that question, but I'd only say that it was a good idea after seeing it working for myself. ;)

Only very few VirtualBox users coming to the VirtualBox forums have server-grade hardware with many CPU cores, so we could only gain very little experience from that (*). Typical problems reported were CPU performance scalability and disk I/O issues (and configuration errors).

Regarding the CPU topic, what you're trying to do is somewhat comparable to simultaneously running 192 processes on your host OS, thats approximately 5 processes per CPU core, or a mean CPU core usage of 20% per process. Note that Intel once estimated the performance gain of hyper-threading to be up to 30% (only), so IMHO a conservative calculation should ignore hyper-threading and consider it as an emergency reserve. In addition to that, VirtualBox will create and run a few hundred additional processes on your host OS to emulate the 12 VMs. Up to here, I only considered the CPU resources. Then there is memory throughput, disk I/O throughput and latency, and so on ... How did you estimate the resources that your host needs?

Regarding your potential disk I/O issues, you could try if enabling Storage > Controller: SATA > Use Host I/O Cache makes any difference.

(*) Note that the forum volunteers and moderators are mostly not affiliated to Oracle or the VirtualBox development, and we therefore have no knowledge about the VirtualBox experience with large-scale setups of Oracle customers, which have their own support channels.
Post Reply