The problem with locks (2)

In Why locks suck? I stressed several issues with locks. Today, I will elaborate on their main issues, which is they do not compose. But I need to finish my list first, as I skipped performance related issues.

Locks hurt performance badly,

it is a universally known fact. But what is less known is how bad they hurt performance and why. The main mythos that locks are expensive because they are kernel objects and that user/kernel transition is expensive.

This argument still hold some truth, but the fact is that both Intel and AMD worked on improving this situation in their recent CPU lines, so now the transition cost (back and forth) is less than 300 cycles, same order of magnitude that access to non cached memory.

But the main reason that locks hurt performance, is that they simply stall a core, by triggering a context switch in case of contention, by trashing various CPU caches. Basically, a context switch is quite a catastrophic event from a performance point of view. Here is a complete list of what it may cost:

  • Kernel user transition (always)
  • Stacking and un-stacking of the core state: all registers to be stored for current context + restore for target context (always)
  • Loss of execution context (very likely, unless target context uses same code)
    • loss of branch prediction caches
    • flush of the execution pipeline
    • stack entries are lost from the cache
  • Loss of cache entries for recently accessed data (likely, unless switching to a similar context within the same process)
  • Loss of cache entries for recently accessed code(likely, unless switching to a similar context within the same process)
  • Loss of TLB entries (likely). As a reminder, TLB stands for Translation Look-aside Buffer; this used for address translation computation that is required to implement virtual memory. This happens if you switch to a different process.
  • Scheduling cost: you have to factor in the execution time to elect the new thread to be ran.

When you take all those into account, you realize that the actual latency cost of an execution context is a matter of thousands of cycles, far above the user/kernel transition cost.

They do not compose

I assume most of you had first hand experience with some ‘thread-safe’ framework. Therefore I can confidently state that I have yet to see a ‘thread friendly’ framework. Most of the thread safe framework relies on lock and offer you events or subscription mechanisms which are basically deadlock pitfalls.

You want to see a thread safe framework fail: try to unsubscribe while being notified on that very same subscription. At best, it will raise an exception, but it will probably simply deadlock.

So why locks do not compose? Simply because you need to order them properly! And any significant chunk of code relies on some event/observer/inversion of control/interface implementation/virtual method overloads. We use those as extension points to alter and complete the behavior of existing objects.
And to develop those extensions, one need to be aware of the locks used by those classes. The bais situation is that the classes are not thread aware and use no locks, is typically called non thread safe. It can be a blessing, but it means you have to wrap them somehow. Why extra work?

Then you have the so called thread safe classes, that typically use ‘synchronized’ (in Java) or ‘lock’ ( in C#) extensively; you need to be careful when using their extension points. But the truth is, even if you are careful enougth, the next version of the library, or the next developer that maintains this codebase will probably encounter deadlocks, or even worse, race conditions.
I have been the initial ‘clever’ developer, I have also been the next and even the nth. I have made many, many mistakes, enough of those to be considered as an expert in the multithreading field.

The bottom line is: whatever the locking model is, there is no way you can make it foolproof; even sadder, there is no way you can document it properly. The best you can do, is ensure long knowledge transfer session that would cover both how the locks are used and the general threading model to allow newcomers to add their own locks. And pray for the day when a ‘clever’one will try to refactor the model!

Key takeway: locks are a dead end!
I came to this gloomy conclusion 7 years ago. I understood then than we needed a radically new approach and as none were available at the time. Therefore I decided to act upon this!

.Net’s Monitor pattern

C# brought us two interesting tools: the Monitor and the Threadpool. Neither of them was anything new, especially the threadpool.
The monitor is a powerful concept, a generalization of the semaphores, the versatility of which has been demonstrated years ago. The monitor allows to implement conditions that can be signaled, which allows threads to wait for a specific state and then do perform their tasks atomically!

The general approach is to associate a variable, or a consistent set of variables to a monitor instance, and then, use this pattern

...
lock(_monitor)
{
  while(!conditionIsMet)
  {
    Monitor.Wait(_monitor);
  }
  // here we know the conditions are met
  // and that we have exclusive access
  // so do some work
  ...
  // if we have changed the variables, we can skip it otherwise
  Monitor.PulseAll(_monitor);
}


Its strength lies in the Wait method which releases the lock and then waits for a notification atomically, which is error prone to implement manually. When one is used to work with Mutexes and Events, it takes some time to get accustomed to. But rewards are great because most synchronization patterns can be implemented safely, easily and with maintainable code.

Alas, .Net team missed an opportunity to have a great API which pushed developers in the pit of success.

    The design should push toward using PulseAll() instead of Pulse()!

They should also probably have secured the use of while(condition) Monitor.Wait();.

Why all those critics to an otherwise fine piece of API?
Well, using Pulse can lead to deadlocks! My first Can You Spot gave a good example of the risks, but it was obvious. Next one will be devious, but still relying on that weakness.
The problem with Pulse is that you must be sure that only one thread has to be waken up. If this assumption holds true, then you are golden. Otherwise, you are one deadlock away from misery.
The good news is that it is often true, but not always. But this decision shortcut also led to a slightly degenerated monitor pattern:

...
lock(_monitor)
{
  if(!conditionIsMet) // notice it was 'while' before. 
  {
    Monitor.Wait(_monitor);
  }
  // here we know the conditions are met
  //  and that we have exclusive access
  // so do some work
  ...
  // if we have changed the variables, we can skip it otherwise
  Monitor.Pulse(_monitor);
}

The difference may look subtle, but suddenly the assumption is now that the condition is true if some other thread signals the monitor, as it is no longer checked after the call to Wait. It may well not be the case, and then you start facing a really nasty bug. The rule is ‘check your condition in a while loop, not a simple if’.

Back to the original topic: PulseAll() vs Pulse(). You may be wondering why .Net provides two variants for that mechanism, especially after my comments regarding the dangers of Pulse(). Well, PulseAll() is probably more expensive than a simple Pulse(), but not by a wide margin either. Looking at Rotor source code, PulseAll() loops through all waiting threads to wake them up all instead of just the first one. As far as I know, there is nothing I consider an issue with PulseAll(), you have one annoyance: PulseAll does not ensure fairness among threads. But that is rarely an issue.

Take away: use the Monitor pattern, it is an excellent synchronization tool, but use it properly, or it will bite you. Well, what doesn’t, when you really think about it ūüôā

As a conclusion, for me it is a missed opportunity for a ‘pit of success’ design. I would have proposed to have Pulse() wake all threads and offered a PulseOne method.

When a lock-free algorithm spins out of control

Yesterday I had the privilege to diagnose a difficult bug. I have yet to identify the full story, but based on the accumulated evidence so far, I am pretty confident I found the culprit: a low level method implemented using “lock free” techniques. I will post the full story in a few days. As a teaser, let me say that it took us two months from first occurrence to cause identification, the problem was that the system went up to 100% CPU usage for a variable time (up to 30 minutes).

One last thing, when I say “lock-free”, I do not mean lock freedom, I mean code where you do not rely on any OS/kernel object for synchronization needs, but on your own mechanisms, usually some form of atomic operations.

So, stay with us for further development.

Next Part

Multithreading issues 2: Fun with locks

Locks and deadlocks

Remember the Scrabble game used to illustrate race conditions? Today we will fix the mess I left the players with using exclusivity. But I will also show you the serious limitations this approach has.

Step one: locks and exclusivity

So, we left the game when everybody was playing simultaneously, leading to a messy board full of meaningless words, plus angry players around. Well, fixing this is easy: we just need to ensure that players take turns! Now, everything falls back into place. No more conflicts, only real words on the board. If we look a bit closer into the process, we realize that a player first wait until nobody is playing before laying its own word, and that no other can proceed until he is finished. Another way to express this is that a player needs to have exclusive access to the board before playing, and this happens whenever a player states ‘it’s my turn’.

Let’s diverge a bit from the standard rules…

Now players are playing one after the other, nicely. To be honest,  there is one point that I will not cover today, which is to ensure they do play in proper order, we will get back to that later.

Now assume I want the players to fill the board as fast as possible. There is one thing that is clear: having only one player laying letters at any given time means I do not leverage having multiple players. Even worse, as player has to check when then can play, it just slows the game down. I would fare better with a single player.

So if I want to rip benefits from having multiple players, I must make sure that more than one can act at the same time, meaning it is time to reassess the rules (= algorithms) we put into place at the previous paragraph, starting with the sub-steps:

    1. read the game board
    2. look at one’s letter
    3. identify the best word
    4. lay down the letter
    5. pick random new letters

Only steps (A) and (D) requires access to the board, and step (A) only requires ‘read’ access.¬† We can try to leverage on that analysis and change the rule in: only one player can be doing either (A) or (D) at any given time.

Let’s evaluate this proposal:

  1. The rule is more complex (bad)
  2. Exclusive part as been reduced to roughly 40% (2 out of 5) of the activity (good)
  3. The player can do stupid things (bad).

A word as explanation is probably required for item 3. Here is an example:

  1. Alice (player one) reads the board.
  2. Bob (player two) reads the board too.
  3. Alice lays down her word
  4. Bob tries but cannot because Alice took his spot (include sexist comment here :-))

In the real world, we can expect Bob to figure out either another word or another location, but computer algorithms are not that smart, yet, and so they would probably ‘butcher’ the board by force replacing the letters or raising an exception.

The obvious fix is to keep the exclusive access not only during steps A and D, but from A to D. Then we will ensure consistency between when the player look at the board¬†and when he/she lays down his/her word. But we can release the lock for step ‘E’. Exclusive access is now 80%, while only 40% was required. At least, while Alice is picking up random new letters, Bob can start thinking: Alice’s E steps can happen simultaneously with Bob’s A step.

From a performance perspective, we have ripped a 20% speed gain with 2 players instead of one. But adding new players will not give us any further boost as they will just be waiting on one another.

This bring us to an important element of parallel programming: Amdahl’s law¬†which sates how much performance you can get by running an algorithm on multiple CPUs. We will discuss it later.

Conclusion

In this post, I did give a quick view on what locks are all about: removing concurrency related risks by permitting exclusive access to critical data. They were also some hints about the inherent complexity of locks, something I will dug further next.

Multithreading issues 1: race conditions

I was planning to post about the context and assumptions of this blog, but I realized that a page is a better format as I anticipate it will need completion and amendments.


Today, we will go through the 101 on multithreading issues.


The first ones every developer is bound to encounter are race conditions.


It covers all cases where two (or more) threads are competing for a variable (or object, or any resource for that matter): as threads are ruthless this often ends in a gory way.


Let’s use Scrabble as an analogy: a player first identify a word fitting the current game, then he put the required letters. Finally he picks as many letters he as used.
The normal rule is that players play one after the other. But suppose we remove that rule and that now players plays simultaneously. There will be of course a situation the word a player as selected is now longer fitting the game because of some other players action. And if two players are laying letters in the same area at the same time, it will get really messy, we can probably assume that those players would start to argue, and the game would be over, leaving a board with both meaningful and nonsense words on it.


That perfectly describes a race condition: the players represent the various threads and the board represents the application memory, players argument is an exception which leads to program termination.


Scrabble BoardWell, of course, actual players would find out fast that something is not right, but race conditions in a program may take some time to surface. The reason is simple, an actual scrabble board is 525 cells and a human will take several seconds to lay down a word of five to seven letters, while a computer program needs fraction of a milliseconds to modify a hundred bytes within a multimegabytes space!


What does it mean for the user of the system: that sometimes the application behaves strangely, displaying inconsistent data, garbage or simply crashes.


For the developer, it just means that he is entering a world of pain trying to chase the bug, because race conditions are as difficult to chase than the dreaded memory bugs, possibly more.
This is something that must be avoided at all costs, as no tolerance mechanism can be implemented against this. You need to be scared to death of race conditions.


I did face my share, directly or indirectly, most of them dragged on for weeks or months, usually until the code is significantly refactored.


Next post will talk about the traditional solution to race conditions, namely locks. It wil give us also the opportunity to talk about deadlock.