Yesterday I had the privilege to diagnose a difficult bug. I have yet to identify the full story, but based on the accumulated evidence so far, I am pretty confident I found the culprit: a low level method implemented using “lock free” techniques. I will post the full story in a few days. As a teaser, let me say that it took us two months from first occurrence to cause identification, the problem was that the system went up to 100% CPU usage for a variable time (up to 30 minutes).
One last thing, when I say “lock-free”, I do not mean lock freedom, I mean code where you do not rely on any OS/kernel object for synchronization needs, but on your own mechanisms, usually some form of atomic operations.
So, stay with us for further development.