The danger of microbenchmarking


Performance measurement is a hot topic

in IT, and it has always been so. How fast is the new hardware, which RDBMS has the highest TPS, which implementation has the lowest memory footprint…

Nowadays, pretty much anyone with a decent knowledge of a programming language can write some benchmark and publish his results, so benchmarks flourish everywhere.

Please read them with the colloquial grain of salt: measurement is hard and more often than not, the benchmark may include some bias favoring the author expectations. The bias will probably be accidental and of limited impact, but it will be there nonetheless.

But I wanted to discuss about microbenchmarking, with a focus on concurrency related ones.

Quick reminder, microbenchmarks measure a very small part of a program/library. It is at most a simple use case, but it can be a single line of code, such as trying to measure the impact of method virtualization.

Those are harder to write, because of the smaller impact of the measured code. You’re entering the quantum world and start to face uncertainty. This of course adds to the inherent complexity of writing a benchmark in the first place.

Let me list some pitfalls

  • if you measure execution time for a given iterations count, make sure the clock precision can be neglected. E.g: a precision of 10 ms implies an acceptable variance of 1% for a duration of 1 second; any running time below 100 ms can be considered as biased.
  • if you count # of calls/throughput for a specified duration, make sure the clock precision can be neglected; also, you should measure actual execution time: duration based run implies asynchronism, which in itself will bring variance.
  • make sure there is ample time for the system to be stable before starting to measure.
    • so that all dependencies are properly loaded
    • code is compiled, and no longer interpreted
    • background threads are up
  • make sure to include any side effects the benchmark may have²
    • external resource consumption: creation of files, handles, opening of ports, starting of services/daemon
    • impact of the virtual machine: memory pressure
    • impact on caches

Convinced yet?
I must confess that memory pressure is a pet peeve of mines: look for yourself, it is almost never taken into account in benchmarking, even in less in microbenchmarking. But do you think adding several megs of memory pressure on the GC has no impact on performance? Of course it has, even if it is deferred

And even if the benchmark looks like having a pretty good implementation, you cannot translate its results to your use case. Just refer to my hyperthreading and performance post for a blatant example; also how do you convert 1M/sec small messages in a probable rate for large messages.

We would all love that the components we use and code behave in a predictable way according to a couple of parameters, such as size and rate. But they don’t; because they are sensible to many other parameters such as the system state, because those parameters are more or less coupled and you have none linear behavior (think CPU cache).

Platform

A good benchmark properly lists the attributes of the platform it was ran on. It means, computer details*

  • CPU (technical reference and frequency)
  • memory (quantity and speed)
  • OS (including version)

network infrastructure if relevant

  • network card
  • cabling and switches

Middleware (messaging, database….) including variant and version number

And remember that for the platform is of utmost importance for microbenchmarks: when I tried Martin Thompson’s latest benchmark on my computer (an early 2011 Mac Book Pro Core i7 2Ghz), I got significantly different results: where Martin got a three fold performance increase, I got a measly +20% (StampedLock vs LockFree). My assumption: the Ivy Bridge CPU have been significantly improved on cache synchronization and therefore offers greater performance versus my sandy Bridge (an assumption that you should not trust since I have not tried to prove it in anyway)

By the way, Martin’s benchmark is exemplary for how to do benchmarks right:

  • explicit test cases
  • describe clearly the test platform
  • provide the code so other can replicate (this is scientific)

Full disclosure: I do not think the comparison was completely fair on a certain perspective (see the comments section).

Tuning a benchmark

More often than not, a benchmark requires tuning: adjusting the count of runs, tweaking the dataset size, and so on…
As all tuning, this is an iterative work, until you are satisfied by the result.

Oops, wait a second…

Did you see the problem right there? In a scientific experiment, the testing protocol is clearly established beforehand and then rigorously implemented. But this is rarely done in a benchmark, because none of us is an actual scientist and it is pretty time consuming to proceed rigorously.

So you end up with a confirmation bias in your benchmark and you publish the figures that match your assumption. Wrong! I understand this was not your intention in the first place, but that what is going to happen.

It happened to me (see again Hyperthreading and performance) and I saw this elsewhere too; but I will not give any name as it happens with everyone, and I am convinced all of them were trying to be as fair as possible!

Conclusion

mind-the-gap2

  • Understand what is measured
  • Assess the quality of the measurement
  • Understand that the platform on which the bench was made probably have a significant impact on performance
  • Do not extrapolate results
  • Mind the gap between the benchmark’s and your use cases
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s