Real world multithreading issues, resolution


 See first post and second post to get full context.

Now, I had the explanation for the technical symptoms.

The server was reaching 100%CPU because a lot of threads were busy waiting. Actually, there was even more threads doing silly stuff than there were available cores. So the machine was just going in circles.

The question is why the hell did it happen? let us look once again at the responsible code:

while (Interlocked.Exchange(ref synchroObject, 1) == 1)
{
}
this.last.Next = node;
this.last = node;
Interlocked.Exchange(ref synchroObject, 0);

Thanks to winddbg and SOS (!CLRStack -p), I was able to confirm that all threads were trying to acquire the very same resource. A code review revealed me that this resource was a global activity counter used for monitoring statistics.

Fractal Filament
Fractal Filament (Photo credit: Simon Lexton)

I double checked, but no other code was referring to ‘synchroObject’; this.last is never null, so no exception should be raised. Actually, if the problem was caused by an exception, the system would have been stuck forever, as the ‘synchroObject’ would never have been reset; production system was able to recover from its crisis, so this was not the issue.

How can this situation occurs? what could have somehow interrupted the execution of those measly 6 lines of code? Well, the Windows’ kernel did. A simple thread scheduling occurred and froze any progress on that thread. Even worse, as there is a significant thread over allocation, it may have scheduled another thread that got (busy) stuck on that very same resource. Then we have several threads that are in a ready state waiting to be scheduled and spent their quantum (30ms) just doing a useless loop.

Then, I guess that the thread owning the lock was interrupted for a long time, probably because it started to contribute to a GC. And then all cores were scheduling that were either doing endless loops or trying to make a GC!

Having busy threads of course slowed down the GC, meaning that the thread that held the locks took longer time before giving it away. And then, there was a lock convoy which made things even worse.

Lessons learned

  • Well, the usual rule of thumb for spin loops is that they must not be used for single core machine, and you should revert to kernel lock then. But when you think about it, the real rule should be ‘spin locks must not be used if you are not sure the lock holder will make progress’which is far trickier to implement than the first one.
  • Make sure you can easily generate dumps of your live systems as it may be the only mean you have to understand what is happening
  • When you have a large number of threads on a significant number of cores, a very very very unlikely event can become pretty much usual….
  • But above everything else: you are never clever enough to implement adequate scheduling primitives. Whatever your motives are, just think TDD: how are you going to write the test demonstrating the safety of your approach ?

I am pretty sure this assertion may not be shared by every developers, so comments are welcome and even requested.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.