Real world threading issue analysis

A couple of weeks ago I said encountered an interesting production issue. A real time application was goinf crazy aboutonce a month: CPU usage went up to 100% but without any productive output.
As this sytems run on 8 servers, it basically occured twice a week. The first occurences happened at system startup, during initialization phase, and systematically led to an OOM crash; probably related to some kins of starvation. By the way, this is a .Net aplication, one can imagine that the GC was unable to perform cleanup in an appropriate time frame. To have more context, I am in charge of the application but I inherited it through some reorganization and relied on the existing team lead for all technical issues, so far.

Since the problem was important, that it was eluding the team, I decided to step in technicaly. I did extensive code review and log analysis to no significant avail: I identified various potential culprit, but testing prooved them inocent each time.
We tool therefore several memory savings action and peripheral adjustments (such as logging strategy) trying to improve the situation. And then things changed: occurences diminished by 50% and the system survived the crisis, most of the time. Looking at performance statistics provided by our infrastructure team revealed that we had 100% crisis that lasted between 10 and 40 minutes, during which the system was almost stucked (little progress in logs). But it was able to restore it functions afterward.
Once again we ensure that other applicationsw as running on the machines. The conclusion was that this problem, whatever it was, originated from our system and nothing else.
We decided to deploy Procdump and have an image of the process at crisis time.

Of course, once it was implemented, it did not occur. A couple of weeks later, we missed the occurence because system scheduling were changed. Ultimately, we did a full dump, captured during a crisis.

Time to get my hands in Windbg and SOS. The bad news was that this service created 170 threads (on a 8 core machine), most of them being harmless, typically waiting on some external ressource.

I dumped all stacks using !CLRStack and .k when not enough information was available, typically when some stack frames related to Garbage Collector/Allocation or synchronization. I painstakingly described all threads in the spreadsheet, noting what was their purpose, run time, probable owning component and current location.

I progressively realized that many threads were stuck on the same line. Once the full cartography was done, the results was the 13 of them were more or less on the same line(s) of code.

   while (Interlocked.Exchange(ref synchroObject, 1) == 1)
    this.last.Next = node;
    this.last = node; 
    Interlocked.Exchange(ref synchroObject, 0);

You see the problem?

Threads are busy waiting for a resource!

The objective is clear: ensure the waiting thread can access to the resource ASAP. The underlying assumption is that the lock (synchroObject having a value of one) is kept for a very short time. Judging by the code above, you can see this is true… well sort of.

To be continued…


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s