I decided to run some network throughput benchmarks today (mainly interested in max packets per second), and have noticed a large discrepancy when assigning one CPU to a VM vs 2 or more CPUs. My setup (all 64-bit):
Host: Ubuntu 11.04 2.6.38-13-generic #53-Ubuntu SMP, 3 GHz Intel i7 950 (4 phys cores + HT), 12GB RAM, Ethernet: Intel 82574L Gigabit
Guest: Ubuntu 11.10 3.0.0-12-server kernel, 1GB Ram assigned
VirtualBox version: 4.1.8
Following are the results of the basic testing I performed using netserver/netperf against a bridged network interface (bridged to above Intel device). The following commands were run on the Guest against its local interface (not the loopback):
netserver -4 (starts an ipv4 tcp/udp server).
netperf -H <IP_address_of_eth0> -t TCP_CRR (runs a TCP connect/request/response transaction benchmark)
~17-18k transactions per second
2-CPU VM with eth0 interrupts and netserver/netperf all pinned to the same core
Confirmed v. low scheduling interrupts during benchmark (watching /proc/interrupts)
2-CPU VM with 2nd core disabled via hotplugging
Disabled the second cpu with: echo '0' > /sys/devices/system/cpu/cpu1/online and confirmed via /proc/interrupts and other system tools.
Also worth noting that on the host system, the same test yields around 26k TPS. netfilter/conntrack is disabled on both host and guest.
So even with the second cpu disabled I'm seeing around a 50% performance degradation vs the single-cpu VM. The results with more than 2 CPUs were very similar to the 2-CPU scenario.
I would like to understand why this is the case (I'm certainly no virtualization expert); are additional extensions/emulations loaded when starting a multicore guest? I did have a quick look at VBox.log and the main thing I noticed was that HwVirtExtForced
is enabled when >1 CPU is configured. Could this be the cause of degradation, and if so, where can I read more about these extensions?
Any insight greatly appreciated.
I repeated the same test on an OS X 10.6 host on similar architecture (quad core intel i7) and the results were the same, also on VBox 4.1.8.
I decided to extend the test to something CPU bound and ran a Linpack benchmark (single thread), but the results are unaffected by number of vCPUs (which is good). And so I also ran a a disk read benchmark using hdparm, and this was also unaffected, so this seems to be confined to network performance for now.