Page 1 of 1

Performance decrease when running 30+ VMs simultaneously

Posted: 31. Jan 2017, 19:12
by wbenny
Hi there,
Let's clear up the host machine first, so you don't think I'm trying some nonsense with a lowcost laptop.
It is a blade server with:
- 2x Xeon E5-2699 v3
- 630 GB RAM (2133 MHz)

Part of the RAM (200 GB) is used as a RamDisk (tmpfs).
Currently I'm using VirtualBox 5.1.8 (r111374)
I have created 2 virtual machines, Win7x86 & Win7x64. They have the same settings:
- 2 CPU cores
- 4 GB RAM
- "paravirtualization interface" set to "legacy", because I've created them in VBox 4.x before
- host-only net adapter

Everything else is left as default.

Each of the machines is link-cloned 40x. All of the 82 machines reside in the RamDisk.
I run and use only the cloned machines.
They are part of a dynamic malware analysis framework called cuckoo ( cuckoosandbox org ), you might've heard of it.
Because of that, the machines are frequently in the cycle of start, poweroff, revert.

Now to the problem - I'm experiencing a significant performance decrease over a few minutes.
At first, when every VM is powered-off and the framework will start them all at once, there is apparently no problem. They run smoothly.
Average time of the "startvm" command is about 7-10 seconds.
A few minutes and a few reverts/starts later, the "startvm" command starts to struggle and the duration goes up to minutes - I've seen 6-10 minutes are not rare.
The count of concurrent running machines drops from 80 to 15 (sometimes even less, like 6).
More strangely, the CPU is literally bored during those long lasting "startvm" commands, as the load is at ~12.
Note that no disk I/O is performed (I'm monitoring it), everything is happening in the RAM.
Another observation I made is that, when I was talking about how the duration of the "startvm" goes up to few minutes, they are all started at once.
For example: 10 machines are running, 50 machines are in the "startvm" state, after a 6 minutes 30 of them are suddenly started.
I have no idea why is this happening. It looks to me like some mutex/lock (in VBoxSVC?) is preventing them to start.

Has anyone else experienced - or, even better, solved - this problem? Or is it a limitation of the VirtualBox? Or is it some bug in the VirtualBox?

If anyone is interested in more details, I'd be pleased to provide them.

Re: Performance decrease when running 30+ VMs simultaneously

Posted: 31. Jan 2017, 20:15
by socratis
When you notice this, pick one VM that is slow. Shut it down. Right-click on the VM in the VirtualBox Manager. Select "Show Log...". Save it (just the first log), ZIP it and attach it in your response (see the "Upload attachment" tab below the reply form).

Re: Performance decrease when running 30+ VMs simultaneously

Posted: 31. Jan 2017, 22:45
by wbenny
Thanks for your response. But there is no issue in VMs being slow. VMs are performing just fine. Problem is that the process of starting them is slow.
Does there exist any way of enabling some detailed log of a VM being started?

Re: Performance decrease when running 30+ VMs simultaneously

Posted: 31. Jan 2017, 23:40
by socratis
wbenny wrote:Does there exist any way of enabling some detailed log of a VM being started?
Plenty. You wanna start with that? No you don't. You start with baby steps. I want to see for example what's taking so much time for the VM to start-up, compared to other VMs. Some sort of I/O? Just because it's a RAM disk doesn't mean there's no congestion...

Re: Performance decrease when running 30+ VMs simultaneously

Posted: 1. Feb 2017, 12:58
by frank
Could you provide a VBox.log file of one VM you are running? And a few words about the host and your VM, i.e. what workload etc?

Re: Performance decrease when running 30+ VMs simultaneously

Posted: 1. Feb 2017, 13:20
by klaus
You spent so much effort on the problem description (which is very clear, thanks!), but you forgot to mention a word about your host OS (and the exact VirtuaBox package flavor - both information we could take from VBox.log). Plays a big role for finding out what tools and processes are suitable.

Reading between the lines I guess it's some Linux kernel with an unknown distro - so it'd be useful to get an overview of the CPU usage with 'top' (so that we can tell if the service processes are showing unusual load - the VMs themselves wouldn't be that interesting initially, I'm after CPU usage of VBoxSVC and VBoxXPCOMIPCD). Next would be getting a core of VBoxSVC when it shows the "twiddling thumbs" behavior, using the gcore utility which usually comes with gdb. gcore can get a core dump without crashing the process. I would hope that from the core we can make some guess what's causing the delay...

Re: Performance decrease when running 30+ VMs simultaneously

Posted: 1. Feb 2017, 16:10
by wbenny
Host OS: Ubuntu Server 16.04.1 LTS
Kernel: 4.4.0-59-generic

I'm sending logs & videos recorded in the moment this occured.
In the log files I've suspected that there is always lag between these lines:

Code: Select all

00:00:00.025792 SUP: Opened VMMR0.r0 (/usr/lib/virtualbox/VMMR0.r0) at 0xffffffffc0a53020.
00:00:49.010595 Guest OS type: 'Windows7_64'
In the videos you can see times of startvm/snapshot restorecurrent commands, htop with VBoxSVC & VBoxXPCOMIPCD visible, atop with nearly zero I/O, and VMState of the machines.

There is attached dump of VBoxXPCOMIPCD. Dump of VBoxSVC isn't there, because it has 3.2 GB.
If you'd like to get it, just tell me and I will upload it.

Link: mega.nz/#!UpAQiB6a!DeJW89No-5efWbAXEM6Q-gfrJZo3Jw-8F5LQMW9roIQ

Re: Performance decrease when running 30+ VMs simultaneously

Posted: 2. Feb 2017, 17:24
by klaus
Very interesting observation... the "SUP: ..." line is from relatively early in creating the VM machinery, and the "Guest OS Type: ..." line happens a bit later, after some internal API calls to VBoxSVC. So VBoxSVC clearly plays a significant role.

Yes, we need the core of VBoxSVC, there's no other good option to get a clear picture. 3GB is a bit bigger than what I'd expect from a setup which uses a relatively low number of VMs, but maybe it's due to your scripts beating relatively hard on the API. Let's not assume straight away it's a memory leak (which we couldn't easily investigate directly anyway, at least in the general case). From past experience a core of VBoxSVC compresses very well, so the 7z file shouldn't be big.

The alternative to investigating the core would be to enable API call logging, which would decrease performance further and would produce gigantic VBoxSVC.log files which means we'd have to find a needle in a haystack.

From your videos I can't spot anything totally unusual. VBoxXPCOMIPCD consumes a little more CPU time as I would expect (which hints that there's a lot of API activity). VBoxSVC CPU usage is usually not high, but goes into the 30% region every now and then. Again a little high, but not immediately hinting that there's a bottleneck.

How are you triggering the snapshot resets/VM starts? Using VBoxManage? Or directly using the API from some scripting language? If it's the latter you might be polling too frequently in some places, slowing down progress.

Re: Performance decrease when running 30+ VMs simultaneously

Posted: 2. Feb 2017, 19:31
by wbenny
Communication with VirtualBox is made entirely via VBoxManage.

https://mega.nz/#!Vth1wIDL!JU_wquF8UHTu ... HEBV31vSI4

In the "d1" folder, there are dumps of VBoxSVC (the bigger file) and VBoxXPCOMIPCD (the smaller file) which I've taken during the video recording posted previously.
In the "d2" folder there is a dump of VBoxSVC which I've taken few seconds before the attached screenshot was made.
The 23rd core is high because 7z is compressing the dumps.

I find it noteworthy that the Load went up to 450-650 several times while CPUs were almost idling.
I'm not implying that this might be unusual, but I'm trying to give you any info I find relevant.

Re: Performance decrease when running 30+ VMs simultaneously

Posted: 2. Feb 2017, 20:48
by klaus
Will see when we find time to dig through the cores. The 2nd VBoxSVC core is even bigger, which adds a little plausibility to the theory that you might be triggering a memory leak (but memory leaks shouldn't immediately result in performance decreases, which makes this investigation less urgent).