Can You Spot the DeadLock?

Hi everyoneI was considering starting to put some code in this blog, but was lacking the proper motivation. But I found one: trivia game!

So welcome to the first instalment of

Can You Spot the Deadlock?

Rules are simple: I publish some code which exhibits some concurrency related problem and ask some questions about it such as:

  • What concurrency issue(s) is(are) present in the code? (deadlock, race condition, livelock…)
  • Where does it occurs or what are the impacts?
  • How can it be fixed

Please post your answer(s) in the comment section, I will provide the expected answers one week after first submission. Of course, debugging is not allowed here.
So without further ado, here is the first one, quite easy for a starter.

Where is the deadlock?

Which single line needs to be changed to fix this?

Problem is now closed, please see the solution.

class Program
{
  /// Main function
  static void Main(string[] args)
  {
    DeadLock1 sample1 = new DeadLock1();
    sample1.Start();
  }
}
/// First instance of 'Can you Spot the DeadLock?
/// Let's start easy to put things in motion
internal class DeadLock1
{
  private Thread _firstRunner;
  private Thread _secondRunner;
  private bool _stopSignal = false;
  private readonly object _synchro = new object();

  public DeadLock1()
  {
    _firstRunner = new Thread(FirstRunner);
    _secondRunner = new Thread(SecondRunner);
  }

  // start your engines
  public void Start()
  {
    _firstRunner.Start();
    _secondRunner.Start();
    Thread.Sleep(100);
    lock (_synchro)
    {
      _stopSignal = true;
      Monitor.Pulse(_synchro);
    }
    _firstRunner.Join();
    _secondRunner.Join();
  }
  // first thread logic
  private void FirstRunner()
  {
     lock (_synchro)
     {
       if (!_stopSignal)
       {
         Monitor.Wait(_synchro);
       }
     }
   }
  // second thread logic
  private void SecondRunner()
  {
    lock (_synchro)
    {
      if (!_stopSignal)
      {
        Monitor.Wait(_synchro);
      }
    }
  }
}

Several quality comments can be made about this code, but please focus on the multithreading issues. Bonus points if the second design weakness is identified!

PS: I’ll wait a few days before approving comments not to disclose the solution. I already have at least one good answer (but not perfect) and one wrong one. Keep commenting!

edited to explicitly initialize _stopSignal for clarity.

Real world multithreading issues, resolution

 See first post and second post to get full context.

Now, I had the explanation for the technical symptoms.

The server was reaching 100%CPU because a lot of threads were busy waiting. Actually, there was even more threads doing silly stuff than there were available cores. So the machine was just going in circles.

The question is why the hell did it happen? let us look once again at the responsible code:

while (Interlocked.Exchange(ref synchroObject, 1) == 1)
{
}
this.last.Next = node;
this.last = node;
Interlocked.Exchange(ref synchroObject, 0);

Thanks to winddbg and SOS (!CLRStack -p), I was able to confirm that all threads were trying to acquire the very same resource. A code review revealed me that this resource was a global activity counter used for monitoring statistics.

Fractal Filament
Fractal Filament (Photo credit: Simon Lexton)

I double checked, but no other code was referring to ‘synchroObject’; this.last is never null, so no exception should be raised. Actually, if the problem was caused by an exception, the system would have been stuck forever, as the ‘synchroObject’ would never have been reset; production system was able to recover from its crisis, so this was not the issue.

How can this situation occurs? what could have somehow interrupted the execution of those measly 6 lines of code? Well, the Windows’ kernel did. A simple thread scheduling occurred and froze any progress on that thread. Even worse, as there is a significant thread over allocation, it may have scheduled another thread that got (busy) stuck on that very same resource. Then we have several threads that are in a ready state waiting to be scheduled and spent their quantum (30ms) just doing a useless loop.

Then, I guess that the thread owning the lock was interrupted for a long time, probably because it started to contribute to a GC. And then all cores were scheduling that were either doing endless loops or trying to make a GC!

Having busy threads of course slowed down the GC, meaning that the thread that held the locks took longer time before giving it away. And then, there was a lock convoy which made things even worse.

Lessons learned

  • Well, the usual rule of thumb for spin loops is that they must not be used for single core machine, and you should revert to kernel lock then. But when you think about it, the real rule should be ‘spin locks must not be used if you are not sure the lock holder will make progress’which is far trickier to implement than the first one.
  • Make sure you can easily generate dumps of your live systems as it may be the only mean you have to understand what is happening
  • When you have a large number of threads on a significant number of cores, a very very very unlikely event can become pretty much usual….
  • But above everything else: you are never clever enough to implement adequate scheduling primitives. Whatever your motives are, just think TDD: how are you going to write the test demonstrating the safety of your approach ?

I am pretty sure this assertion may not be shared by every developers, so comments are welcome and even requested.

Real world threading issue analysis

A couple of weeks ago I said encountered an interesting production issue. A real time application was goinf crazy aboutonce a month: CPU usage went up to 100% but without any productive output.
As this sytems run on 8 servers, it basically occured twice a week. The first occurences happened at system startup, during initialization phase, and systematically led to an OOM crash; probably related to some kins of starvation. By the way, this is a .Net aplication, one can imagine that the GC was unable to perform cleanup in an appropriate time frame. To have more context, I am in charge of the application but I inherited it through some reorganization and relied on the existing team lead for all technical issues, so far.

Since the problem was important, that it was eluding the team, I decided to step in technicaly. I did extensive code review and log analysis to no significant avail: I identified various potential culprit, but testing prooved them inocent each time.
We tool therefore several memory savings action and peripheral adjustments (such as logging strategy) trying to improve the situation. And then things changed: occurences diminished by 50% and the system survived the crisis, most of the time. Looking at performance statistics provided by our infrastructure team revealed that we had 100% crisis that lasted between 10 and 40 minutes, during which the system was almost stucked (little progress in logs). But it was able to restore it functions afterward.
Once again we ensure that other applicationsw as running on the machines. The conclusion was that this problem, whatever it was, originated from our system and nothing else.
We decided to deploy Procdump and have an image of the process at crisis time.

Of course, once it was implemented, it did not occur. A couple of weeks later, we missed the occurence because system scheduling were changed. Ultimately, we did a full dump, captured during a crisis.

Time to get my hands in Windbg and SOS. The bad news was that this service created 170 threads (on a 8 core machine), most of them being harmless, typically waiting on some external ressource.

I dumped all stacks using !CLRStack and .k when not enough information was available, typically when some stack frames related to Garbage Collector/Allocation or synchronization. I painstakingly described all threads in the spreadsheet, noting what was their purpose, run time, probable owning component and current location.

I progressively realized that many threads were stuck on the same line. Once the full cartography was done, the results was the 13 of them were more or less on the same line(s) of code.

   while (Interlocked.Exchange(ref synchroObject, 1) == 1)
    {
    }
    this.last.Next = node;
    this.last = node; 
    Interlocked.Exchange(ref synchroObject, 0);

You see the problem?

Threads are busy waiting for a resource!

The objective is clear: ensure the waiting thread can access to the resource ASAP. The underlying assumption is that the lock (synchroObject having a value of one) is kept for a very short time. Judging by the code above, you can see this is true… well sort of.

To be continued…