VirtualBox and Intel Nehalem/Westmere + Intel SMT
Posted: 6. Jun 2010, 18:08
I did a bit of benchmarking, testing and tuning of VirtualBox and how it interacts with Intel Nehalem and Westmere + Intel SMT (Simultaneous multithreading EG: HyperThreading).
I did so on the following Hardware + OS(s):
Sun/Oracle x6270 Blades - 2x Intel Nehalem X5570 = 8 Physical, 16 Virtual Cores + OpenSUSE 11.1/11.2, SLES 11, Oracle Linux 5.5, CentOS 5.4/5.5, Windows Server 2008 R2 x64
Sun/Oracle x4170 Servers - 2x Intel Nehalem X5570 = 8 Physical, 16 Virtual Cores + OpenSUSE 11.1/11.2, SLES 11, Oracle Linux 5.5, CentOS 5.4/5.5, Windows Server 2008 R2 x64
HP Z800 Workstations - 2x Intel Nehalm W5580/X5570 = 8 Physical, 16 Virtual Cores + OpenSUSE 11.1/11.2, SLES 11, Oracle Linux 5.5, CentOS 5.4/5.5, Windows Server 2008 R2 x64
HP Z800 Workstations - 2x Intel Westmere = 12 Physical, 24 Virtual Cores + OpenSUSE 11.1/11.2, SLES 11, Oracle Linux 5.5, CentOS 5.4/5.5, Windows Server 2008 R2 x64
One common thread between all of these HW configurations is that the "Optimal Defaults" in the system BIOS is to have SMT enabled. Benchmarking and testing outside of VirtualBox yielded an overall observation that unless an application was heavily threaded or otherwise written to take advantage of a multi-core architecture there was no benefits to be had. However having Intel SMT "On" did not hurt actual system performance baseline or existing applications that that were single-threaded or non-multi-core aware. Sparing details actual performance differences across the board really boiled down to the efficiency of the OS scheduler itself.
Taking this data into account much the same behavior transgressed to a muli-core guests. As expected I found that much like other hypervisors (eg: KVM/VMware) one can always "over-subscribe" cores in this case real or virtual. Just bear in mind that that in doing so your mileage may vary on actual shared CPU cycles between the guests based mostly on the ability of your OS scheduler. The schedulers overall awareness of the base NUMA architecture and it's ability to context switch and efficiently schedule between cores will be your main focus when over-subscribing cores even in the case of Intel SMT.
A side note here somewhat but not directly related is that on the newer Intel Nehalem/Westmere/Core architectures lies an interesting animal known as "Turbo Boost". Turbo boost compliments Intel SMT. In general Intel Turbo boost will up the clock of a particular core(s) when the OS requests a P0 power state of the associated core(s) to which the process thread has been scheduled. Again efficiency of the OS scheduler plays into when a P0 state is kicked in for a particular process or core. In general VirtualBox guest processes would only kick the associated virtual/real OR real core into a P0 state when the guest core count was allocated the optimum for applications needs within the guest. Basically this says that if you have processors sitting 50% or so idle in a guest, it may be prudent to trim back on your core count to the guest. You mileage may vary on this based on the actual application load or purpose within the guest. It will take some benchmarking, testing and tuning with your particular guest(s) and associated workloads on your target HW/Base OS configuration to capitalize on and get the most out of it.
I did so on the following Hardware + OS(s):
Sun/Oracle x6270 Blades - 2x Intel Nehalem X5570 = 8 Physical, 16 Virtual Cores + OpenSUSE 11.1/11.2, SLES 11, Oracle Linux 5.5, CentOS 5.4/5.5, Windows Server 2008 R2 x64
Sun/Oracle x4170 Servers - 2x Intel Nehalem X5570 = 8 Physical, 16 Virtual Cores + OpenSUSE 11.1/11.2, SLES 11, Oracle Linux 5.5, CentOS 5.4/5.5, Windows Server 2008 R2 x64
HP Z800 Workstations - 2x Intel Nehalm W5580/X5570 = 8 Physical, 16 Virtual Cores + OpenSUSE 11.1/11.2, SLES 11, Oracle Linux 5.5, CentOS 5.4/5.5, Windows Server 2008 R2 x64
HP Z800 Workstations - 2x Intel Westmere = 12 Physical, 24 Virtual Cores + OpenSUSE 11.1/11.2, SLES 11, Oracle Linux 5.5, CentOS 5.4/5.5, Windows Server 2008 R2 x64
One common thread between all of these HW configurations is that the "Optimal Defaults" in the system BIOS is to have SMT enabled. Benchmarking and testing outside of VirtualBox yielded an overall observation that unless an application was heavily threaded or otherwise written to take advantage of a multi-core architecture there was no benefits to be had. However having Intel SMT "On" did not hurt actual system performance baseline or existing applications that that were single-threaded or non-multi-core aware. Sparing details actual performance differences across the board really boiled down to the efficiency of the OS scheduler itself.
Taking this data into account much the same behavior transgressed to a muli-core guests. As expected I found that much like other hypervisors (eg: KVM/VMware) one can always "over-subscribe" cores in this case real or virtual. Just bear in mind that that in doing so your mileage may vary on actual shared CPU cycles between the guests based mostly on the ability of your OS scheduler. The schedulers overall awareness of the base NUMA architecture and it's ability to context switch and efficiently schedule between cores will be your main focus when over-subscribing cores even in the case of Intel SMT.
A side note here somewhat but not directly related is that on the newer Intel Nehalem/Westmere/Core architectures lies an interesting animal known as "Turbo Boost". Turbo boost compliments Intel SMT. In general Intel Turbo boost will up the clock of a particular core(s) when the OS requests a P0 power state of the associated core(s) to which the process thread has been scheduled. Again efficiency of the OS scheduler plays into when a P0 state is kicked in for a particular process or core. In general VirtualBox guest processes would only kick the associated virtual/real OR real core into a P0 state when the guest core count was allocated the optimum for applications needs within the guest. Basically this says that if you have processors sitting 50% or so idle in a guest, it may be prudent to trim back on your core count to the guest. You mileage may vary on this based on the actual application load or purpose within the guest. It will take some benchmarking, testing and tuning with your particular guest(s) and associated workloads on your target HW/Base OS configuration to capitalize on and get the most out of it.