HyperThreading and performance


As i mentioned in a previous post,  ran on some unexpected results during benchmarking linked to hyperthreading.

Let’s start by a brief overview of what is hyperthreading. In a few words it is the ability of a CPU core to support two threads of execution simultaneously (Intel speaks of physical threads). It looks like having two cores instead of one, does it taste like it too? From an OS perspective, yes, it looks like the CPU has twice has much virtual cores as it has physical cores.

The catch lies in the fact that instructions are still executed sequentially (i.e. one after the other * )! So why bother? it turns out that you can actually expect some performance gain. This technology exploits wait times generated by memory accesses (and cache accesses, to a lesser extent), i.e. while a physical thread is temporarily stopped until data is fetched into the L1 cache, the other physical thread can still execute instructions, assuming those bear on cached data.

I definitely suggest you read ‘What every developer should know about memory‘ to fully understand how this works. As you will see, the performance gain, if any, depends on a delicate balance of the cache hit ratio of both threads.

The experiment

Our benchmark was based on CPU only tasks, based on intensive integer computation; as such it provides no wait time, hence no opportunity for performance gain with hyperthreading. And benchmark results confirmed that; before commenting those, let me explain how hyperthreading is managed by the OS.

A bit of theory

Hyperthreading comes with some caveat regarding thread/process scheduling; the OS must favor scheduling a free core before starting to ‘reuse’ the physical core by scheduling a second ‘virtual core’ on the same physical one. Otherwise, you will get disappointing performance, probably worse than without HT.

HT Core
Identification of virtual cores

On Windows (NT) kernel variants, the ‘virtual core’ of each physical core is enumerated first, then the second.

You should care about that when you plan to exploit processor affinity, which is the ability to restrict a process (or even a thread) so that it can only run on one or more specific cores. It allows to cap CPU allocated to a specific process/thread.

On an 8 virtual cores CPU, if you limit a process to run on cores 0 and 4 you will roughly allocates 25% of the total execution power. But if you limit it on core 0 and 1, you end allocating something like 50% of the processing power.

The brutal facts

We faced this harsh reality when working on our TPL vs in-house library benchmark.

We use processor affinity to test for scalability by executing the benchmark with a variable number of enabled cores: at first all of them, then we ‘disable’ one and then another one etc… for each run we measure the execution time (how long did the bench run) and the used CPU time. The test ran on a dual 4 cores hyperthreaded Xenon, offering us 16 virtual cores (2*2*4).

The results we got were disconcerting at first:

  • from one to 8 cores we observed a linear reduction of the execution time for an
    almost constant CPU time. That was the expected result and showed good scalability: if you have 100 tasks to do, you expect to have then done twice as fast if you have two CPUs processing 50 tasks each, but the total amount of work is the same (100 tasks to do).BenchGraph
  • But from 8 to 16 cores, the execution time remained flat and the CPU time kept increasing. This was pure heresy. Our benchmark was no longer scaling at all and even worse, it started wasting CPU. It was as if each added core was just spinning busy, providing no work
  • The reassuring piece of news was that the behavior did not depend on the library used; so something was fishy here.

And then it dawned on me: Hyperthreading fucked with us

Since our benchmark was computation intensive and doing not memory access at all (beyond executable code), we could not get any benefits from hyperthreading. So when the benchmark started engaging the second virtual cores, it just meant that the Xeon physical core had to split its execution capacity in two. But the total execution capacity was not increased in any way.

The aforementioned 100 tasks were still splat in two stacks of 50 tasks each, but each virtual core is now twice as slow as before (as the virtual cores take turns to execute on the physical core).

So it justified why the execution time was no longer decreasing. But what about consumed CPU time? how come the benchmark became seemed to lose efficiency?

It comes from the fact that the Windows kernel is measuring the time spent by virtual cores, not by physical cores. So it just identified that two virtual cores were constantly busy processing tasks, and it is unable to take into account the fact that those core were twice less efficient.

recommendations

  1. Do not mess with affinity on hyperthreaded CPUs!
  2. Do not try to interpolate results you may get on hyperthreaded CPUs. They do not scale linearly.
  3. And as always, measure, measure, measure and accept the result.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s