[ad_1]

With the performance of modern CPU cores plateauing recently, the main performance gains are with multiple cores and multithreaded applications. Typically, a fast GPU is only so mind-bogglingly quick because thousands of cores operate in parallel on the same set of tasks. So, it would seem prudent for our applications to try to code in a multithreaded fashion to take advantage of this parallelism. Or so it would seem, but as [Marc Brooker] illustrates, it’s not as simple as one would assume, and it’s very easy to end up with far worse overall performance and no easy way to fix it.

[Marc] was rerunning an old experiment to calculate the expected number of birthdays in a shared group of people using brute force. The experiment was essentially a tight loop running a pseudorandom number generator, the standard libc rand() function. [Marc] profiled the code for single-thread and multithreaded versions and noted the runtime dramatically increased beyond two threads. Something fishy was going on. Running perf, [Marc] noted that there were significant L1 cache misses, but the real killer for performance was the increase in expensive context switches.  Perf indicated that for four threads, the was an overhead of nearly 50% servicing spin locks. There were no locks in the code, so after more perf magic, the syscalls taking all the time were identified.  Something in there was using a futex (or fast userspace mutex) a whole lot.

After delving into the glibc source code, a comment said it all:

/* POSIX.1c requires that there is mutual exclusion for the `rand' and
`srand' functions to prevent concurrent calls from modifying common
data. */

__libc_lock_lock (lock);
 (void) __random_r (&unsafe_state, &retval);
 __libc_lock_unlock (lock);

By replacing the call to rand() with random_r(), the program’s performance with four threads improved dramatically. The runtime was reduced to a theoretical quarter of the single-thread version. As Marc summarizes, multi-threaded programming is not always as straightforward as one might think. While performance can be significantly worse in some cases, improvements are possible. However, this is not guaranteed to be the case in every situation.

The art of debugging and profiling code is complex, so here’s how to use Valgrind to look for problems you might not even know about. Even the humble Linux pipe needs to be thought out to get decent performance. What a surprise!

[ad_2]

Leave a Reply

Your email address will not be published. Required fields are marked *