Page 1 of 1

Mass starting of virtual machines causes weird DHCP issues

Posted: 30. Jun 2010, 18:28
by nmitchell
Everyone,

I've attempting to set up a test system for a boot management software that I'm developing. My goal was to set up ten virtual guests that could be used to simulate a cluster to be managed.

I have been able to successfully do this with the following environment:
  • * All machines are on vboxnet0 with hostonly networking, host address 192.168.56.1
    * All host services are running on interface 192.168.56.1
    * A conserver daemon is running allowing console access to each vm
    * A dhcp server is running on the host with static blocks for each vm
    * TFTP and HTTP daemons are running providing data for the vms
    * and finally the boot management software is running on the host
Now the problem seems to occur when I attempt to boot all ten vms at once. They all start successfully, and begin the PXE process. They all issue DHCP requests and receive addresses from my server. Then they TFTP the gPXE boot rom and retrieve the boot script from my management software. This script causes them to run another DHCP request. They all begin issuing requests but no response ever comes. The DHCP server logs show both the original boot requests and the gPXE requests. Offers are made for both of them, but the second time around no vms issue any ACKS. I then attached a trace to VMs nics and saw that while the vms made the requests, they never saw any offer packets - leading me to believe that gPXE is not at fault here.

What leads me to believe this is some sort of resource contention is the fact that if I only start one vm, everything works perfectly the first and second DHCP request go through with no problems.

Has anyone else seen similar behavior with virtualbox, many vms, and network booting?

-- Nathan

Re: Mass starting of virtual machines causes weird DHCP issues

Posted: 30. Jun 2010, 22:16
by fixedwheel
nmitchell wrote:Now the problem seems to occur when I attempt to boot all ten vms at once. ...

What leads me to believe this is some sort of resource contention ...
what and how many CPU do you have?

Re: Mass starting of virtual machines causes weird DHCP issues

Posted: 30. Jun 2010, 22:23
by nmitchell
Its a fairly low budget machine. It has a single Intel Core2 Duo, E4600, 2.4Ghz. The hardware virtualization features were disabled by HP, but I was hoping that wouldn't manifest any problems until later in the boot sequence. Do you think its an interupt issue, or is the virtualization hardware far more important than I thought it was? The long term goal was to, once it was up and running, image the machine and move it to a more powerful platform. Perhaps I'll need to do that sooner rather than later.

-- Nathan

Re: Mass starting of virtual machines causes weird DHCP issues

Posted: 30. Jun 2010, 22:32
by Sasquatch
Yes, it does sound like an out-of-resources error. Have you ever checked your host's CPU usage? Whenever a VM boots, it can be close to 100%. Now imagine if you do that 10 times. That means that you have 5 VMs per core and that is overkill. You can run two at a time at best without slowing things down (as long as the Host stays idle). Due to the heavy CPU abuse, the packets that offer the IP address are lost on the way. That's what I think is happening. Get a system with a lot more CPU power or start one or two VMs at a time.

Re: Mass starting of virtual machines causes weird DHCP issues

Posted: 30. Jun 2010, 22:40
by fixedwheel
Sasquatch wrote:...(as long as the Host stays idle).
but Nathans host has to do some work at the same time: DHCP server and the like and that is what fails

Re: Mass starting of virtual machines causes weird DHCP issues

Posted: 30. Jun 2010, 22:46
by Sasquatch
fixedwheel wrote:
Sasquatch wrote:...(as long as the Host stays idle).
but Nathans host has to do some work at the same time: DHCP server and the like and that is what fails
That's why I said 'run two at most without slowing things down'. This means one core per VM. When you run more VMs than you have cores, things will slow down. DHCP asks almost no CPU, having 2 VMs running at the same time does not interfere in any way. Having 10 is another story.

Re: Mass starting of virtual machines causes weird DHCP issues

Posted: 30. Jun 2010, 22:55
by fixedwheel
Sasquatch wrote:That's why I said ...
sorry, yes: full ACK to what you said

still i think it could be a bit more failsafe running the DHCP and TFTP and other services off an external box

Re: Mass starting of virtual machines causes weird DHCP issues

Posted: 30. Jun 2010, 22:56
by nmitchell
Yeah, that's what I thought.

However, I have observed a strange behavior, and maybe this also explains it.

I tried ramping up the VMs from one, to two, etc...

One always starts fine.
Two sometimes works, where one VM will go and the other hangs
By three and higher I always get failures, though sometimes one or two VMs make it.

So I tried dropping to the gPXE command line on one that failed and waiting for several minutes until the others had settled down.
After repeated attempts, I never got it to successfully get an address from DHCP. It seems that if it fails once, it won't ever work until it is reset.

Does this make any sense?

-- Nathan

Re: Mass starting of virtual machines causes weird DHCP issues

Posted: 30. Jun 2010, 23:02
by Sasquatch
It does make sense, somewhat. It would be interesting if you could test it by booting them one at a time, starting the next one until the previous one settles down (or just pause it when it's about to go/already going). That should give DHCP ACK enough resources to complete.

Re: Mass starting of virtual machines causes weird DHCP issues

Posted: 30. Jun 2010, 23:04
by nmitchell
Also, are there any log files that might indicate that packets are being dropped? The DHCP logs state that the offer was sent, but it never makes it to the VM's nic. I'm hoping that there is something in between I can look at to find where things go badly.

-- Nathan

Re: Mass starting of virtual machines causes weird DHCP issues

Posted: 30. Jun 2010, 23:11
by Sasquatch
Unless the internal DHCP is used, it probably doesn't show in the normal VM logs. You can always strace it and write it's output to a file and stop the VM when you encounter it so you can examine the strace output.

Re: Mass starting of virtual machines causes weird DHCP issues

Posted: 30. Jun 2010, 23:29
by nmitchell
Okay, ran the experiment. It does appear to make a significant difference if I wait a few moments before starting the next VM. If I wait until the DHCP process is over and the machine is then fetching the kernel/initrd, I can start the next VM with no problems. The cpu usage is still 100% percent the entire time however. Each VM just gets smaller and smaller slices, but they all work this way.

I think I will need to move to more beefy hardware though. My boot management software doesn't support the fine grained staggering that this process needs. If I enabled boot staggering in my software, the one VM would need to fully build before the next would start. Its unfortunate, but I need more 'cluster' like performance out of my VM farm... Hopefully a 8 core machine will be more capable of handling the load.

As for the strace, do you think I should strace the dhcpd server, or would the drops be happening in the kernel driver that's running the virtual interface? I'm not sure how to debug that unless I wanted to add debugging to the kernel module code and rebuild it.

Re: Mass starting of virtual machines causes weird DHCP issues

Posted: 3. Jul 2010, 12:19
by Sasquatch
Since the DHCP does see the request and sends a reply, it's best to strace the VM and hopefully, it will show network data too in one way or another.