Concurrency Paradigms: Data parallelism

Data parallelism is the approach where the same program/algorithm can be applied to multiple subsets of data in parallel. This is not a new approach and it has variants:

  • vector computing: data are expressed as vector, often of fixed size, and all entries are processed in parallel. This technique is used in all current GPUs and exists in most current CPU (SSEx instructions for Intel, as an example)
  • mass parallelism: systems using many dumb nodes with dedicated memory and small
    English: CRAY-1 on display in the hallways of ...
    English: CRAY-1 famous SIMD calculator (Photo credit: Wikipedia)

    processing capabilities

In C#, the Parallel.For method (and its variants) help to implement data parallelism.

Pros:

  • Scales well as the dataset is usually far larger than the number of available execution units (think thousands versus tens)
  • Many helping framework/libraries exist
  • Primitives are friendly
  • Can be used for horizontal scalability as well

Cons:

  • Only makes sense at the algorithm level
  • Does not work for all algorithms

Bottom line: Dominant model for scientific/computational intensive topics.

Advertisements

Concurrency Paradigms: Task parallelism

Task parallelism is the approach where the application execution is broken down in smaller independent tasks. Those tasks are typically executed by a thread pool; as long as they are independent, they can be executed concurrently. If any synchronization or mutual exclusion is needed, one can still use locks or events as usual; but you need to remain cautious, as thread pools and synchronization may not play well together and can lead to tricky deadlock or livelock situations.

The strongest attribute of this paradigm is the fact that it scales very well, assuming the program is able to provide enough independent tasks to feed the thread pool.

This approach has gained a lot of traction since the new millennium, especially as it fits very well the basic needs of http servers: they expose a stateless service,each request being short-lived.

Pros:

  • Scale well
  • Relatively easy to master

Cons:

  • No help regarding synchronization needs
  • Implies tasks are short-lived.

Bottom line: THE PARADIGM OF CHOICE for server-side applications, but you need to think ahead your synchronisation/mutex needs.

The problem with locks (2)

In Why locks suck? I stressed several issues with locks. Today, I will elaborate on their main issues, which is they do not compose. But I need to finish my list first, as I skipped performance related issues.

Locks hurt performance badly,

it is a universally known fact. But what is less known is how bad they hurt performance and why. The main mythos that locks are expensive because they are kernel objects and that user/kernel transition is expensive.

This argument still hold some truth, but the fact is that both Intel and AMD worked on improving this situation in their recent CPU lines, so now the transition cost (back and forth) is less than 300 cycles, same order of magnitude that access to non cached memory.

But the main reason that locks hurt performance, is that they simply stall a core, by triggering a context switch in case of contention, by trashing various CPU caches. Basically, a context switch is quite a catastrophic event from a performance point of view. Here is a complete list of what it may cost:

  • Kernel user transition (always)
  • Stacking and un-stacking of the core state: all registers to be stored for current context + restore for target context (always)
  • Loss of execution context (very likely, unless target context uses same code)
    • loss of branch prediction caches
    • flush of the execution pipeline
    • stack entries are lost from the cache
  • Loss of cache entries for recently accessed data (likely, unless switching to a similar context within the same process)
  • Loss of cache entries for recently accessed code(likely, unless switching to a similar context within the same process)
  • Loss of TLB entries (likely). As a reminder, TLB stands for Translation Look-aside Buffer; this used for address translation computation that is required to implement virtual memory. This happens if you switch to a different process.
  • Scheduling cost: you have to factor in the execution time to elect the new thread to be ran.

When you take all those into account, you realize that the actual latency cost of an execution context is a matter of thousands of cycles, far above the user/kernel transition cost.

They do not compose

I assume most of you had first hand experience with some ‘thread-safe’ framework. Therefore I can confidently state that I have yet to see a ‘thread friendly’ framework. Most of the thread safe framework relies on lock and offer you events or subscription mechanisms which are basically deadlock pitfalls.

You want to see a thread safe framework fail: try to unsubscribe while being notified on that very same subscription. At best, it will raise an exception, but it will probably simply deadlock.

So why locks do not compose? Simply because you need to order them properly! And any significant chunk of code relies on some event/observer/inversion of control/interface implementation/virtual method overloads. We use those as extension points to alter and complete the behavior of existing objects.
And to develop those extensions, one need to be aware of the locks used by those classes. The bais situation is that the classes are not thread aware and use no locks, is typically called non thread safe. It can be a blessing, but it means you have to wrap them somehow. Why extra work?

Then you have the so called thread safe classes, that typically use ‘synchronized’ (in Java) or ‘lock’ ( in C#) extensively; you need to be careful when using their extension points. But the truth is, even if you are careful enougth, the next version of the library, or the next developer that maintains this codebase will probably encounter deadlocks, or even worse, race conditions.
I have been the initial ‘clever’ developer, I have also been the next and even the nth. I have made many, many mistakes, enough of those to be considered as an expert in the multithreading field.

The bottom line is: whatever the locking model is, there is no way you can make it foolproof; even sadder, there is no way you can document it properly. The best you can do, is ensure long knowledge transfer session that would cover both how the locks are used and the general threading model to allow newcomers to add their own locks. And pray for the day when a ‘clever’one will try to refactor the model!

Key takeway: locks are a dead end!
I came to this gloomy conclusion 7 years ago. I understood then than we needed a radically new approach and as none were available at the time. Therefore I decided to act upon this!

Paradigms: threaded concurrency

Derived from the previous one (process concurrency): all sequential programs can be hosted within the same process.

Advantages

  • easier to implement
  • easier to debug.

Drawbacks

  • No fault tolerance still overhead due to the use of kernel objects
  • But user space locks can be used
  • No dynamic control of parallelism level.

One typical implementation is creating one thread per client in a server.One famous implementation is Node.js where developer write single threaded code that may be run on multiple threads if the runtime needs it to.

My opinion: Still being the dominant model, first implementations were transposition of the multi-process model, for which it brings greatly improved performance. It is also well suited for session oriented servers/services where each thread is in charge of a session/client.
it does not scale that well because you need to ensure you have at least one thread per core to get benefits. And then, one often ends up with a system creating far too many threads, but this is a pragmatic solution when metrics fall within the correct range, which is around 50-300 hundreds clients/simultaneous clients/requests.