When a lock-free algorithm spins out of control


Yesterday I had the privilege to diagnose a difficult bug. I have yet to identify the full story, but based on the accumulated evidence so far, I am pretty confident I found the culprit: a low level method implemented using “lock free” techniques. I will post the full story in a few days. As a teaser, let me say that it took us two months from first occurrence to cause identification, the problem was that the system went up to 100% CPU usage for a variable time (up to 30 minutes).

One last thing, when I say “lock-free”, I do not mean lock freedom, I mean code where you do not rely on any OS/kernel object for synchronization needs, but on your own mechanisms, usually some form of atomic operations.

So, stay with us for further development.

Next Part

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s