Curious about exact behavior of rdtsc -- not what I expected

This is for discussing general topics about how to use VirtualBox.
Post Reply
rdmartin
Posts: 1
Joined: 16. Sep 2011, 17:41
Primary OS: Ubuntu other
VBox Version: OSE self-compiled
Guest OSses: Ubuntu

Curious about exact behavior of rdtsc -- not what I expected

Post by rdmartin »

Background: I am experimenting with TSC values in VirtualBox. My current understanding of how VirtualBox emulates rdtsc is that it makes it a privileged instruction, traps the exception, and simulates the value by calling rdtsc on the host and subtracting the offset (which would be the value of rdtsc when virtualization started). The advantage here is that rdtsc will advance with wall clock time in an expected manner, but the disadvantage is that a process may perceive rdtsc to take longer than expected. For instance, in simple code like this:

Code: Select all

x = rdtsc();
y = rdtsc();
z = y - x;
print z
executed on the guest, z may be larger than expected because of the wall-clock-time cost associated with trapping rdtsc. It would be even worse if the host OS swapped off the VirtualBox process in between these two calls.

From reading the VirtualBox manual (Change TSC Mode), I read there is an alternative virtualization technique which is supposed to directly simulate TSC. As I understand it, the offset value will only take into account time that the guest OS actually uses the CPU. The advantage is that with respect to cycles available, TSC will behave exactly as if it was on a host machine. The downside is that TSC will drift away from wall-clock-time as there are "missing cycles" that the guest OS is not aware of.

My goal: I am trying to set VirtualBox to do the 2nd option. I want to emulate the short-term behavior of rdtsc as if it were running in hardware as precisely as possible, and I don't care if it doesn't match wall-clock-time. I am fully aware that this is not "reliable" on SMP; it's for experimenting not for enterprise software.

What I did: First I wrote a simple test program that calls rdtsc repeatedly, then prints the results:

Code: Select all

__inline__ uint64_t rdtsc()
{
    uint32_t lo, hi;
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
    return (uint64_t)hi << 32 | lo;
}

int main()
{
    int i;
    uint64_t val[8];

    val[0] = rdtsc();    
    val[0] = rdtsc();
    val[0] = rdtsc();
    val[0] = rdtsc();
    val[0] = rdtsc();
    val[0] = rdtsc();
    val[0] = rdtsc();
    val[0] = rdtsc();

    for (i = 0; i < 8; i++) {
        printf("rdtsc (%2d): %llX", i, val[i]);
        if (i > 0) {
            printf("\t\t (+%llX)", (val[i] - val[i - 1]));
        }
        printf("\n");
    }
    return 0;
}
I tried this program on my host machine. Then, I ran it in my VirtualBox machine. The deltas between rdtsc were essentially identical -- the only difference was the value itself on my host was about 30T more. Example output:
rdtsc ( 0): 334F2252A1824
rdtsc ( 1): 334F2252A1836 (+12)
rdtsc ( 2): 334F2252A1853 (+1D)
rdtsc ( 3): 334F2252A1865 (+12)
rdtsc ( 4): 334F2252A1877 (+12)
rdtsc ( 5): 334F2252A1889 (+12)
rdtsc ( 6): 334F2252A18A6 (+1D)
rdtsc ( 7): 334F2252A18B8 (+12)
Then, I changed the TSCTiedToExecution flag in VirtualBox, which I thought was supposed to ignore wall-clock-time in favor of more precise virtual cycle counting. I got this from the manual page I mentioned above:

Code: Select all

./VBoxManage setextradata "HelloWorld" "VBoxInternal/TM/TSCTiedToExecution" 1
However this gave me unexpected results. The virtual program now returned:
rdtsc ( 0): F2252A1824
rdtsc ( 1): F2252A1836 (+B12)
rdtsc ( 2): F2252A1853 (+B1D)
rdtsc ( 3): F2252A1865 (+AFF)
rdtsc ( 4): F2252A1877 (+B13)
rdtsc ( 5): F2252A1889 (+AF2)
rdtsc ( 6): F2252A18A6 (+B1D)
rdtsc ( 7): F2252A18B8 (+B0C)
With TSCTiedToExecution on, rdtsc seems to be taking about 1100 cycles to execute....

Question: First, I am wondering why did I get this behavior? It seems like almost the opposite of what I would expect, and it certainly does not match with my understanding of how this is implemented.

Second, I am wondering how can I accomplish my original goal of having TSC advance for each virtual cycle as if it was on hardware?

My Setup: I am running on a 8x Intel(R) Xeon(R) CPU X5550 @ 2.67GHz. VirtualBox has VMX and nested paging enabled. I compiled it from source, version: 4.1.2_OSE r38459.

Thanks in advance!
Post Reply