Send your request Join Sii

Concurrency issues are one of the hardest to investigate and fix. It is caused mainly by the nondeterministic nature of threads and processes. Every time you run a concurrent code it may execute in a slightly different order which certainly affects the correctness and performance of a program. As a result, the most popular techniques used in investigating sequential code like log analysis or examinating code with debugger are insufficient.

You could obviously say that there are multiple high-level solutions like Node.js based frameworks, Java applications server frameworks or simply databases which implement whole concurrency stuff internally and you don’t need to care about that at all. That sounds fair for enterprise applications or web services but not necessarily for embedded systems where you often reach out for low-level stuff to deliver expected efficiency.

This article aims at presenting methods and tools which might be quite helpful in dealing with nondeterminism and typical problems of multi-threaded applications.

Prerequisite

In the article I mainly use tools available on Linux platform with kernel release 5.19.0. All the examples have been written in C++ programming language and compiled with GCC 12.0 on Intel(R)Core(TM)[email protected] processor. However, the presented techniques are universal so you can find analogous tools for different platforms and programming languages.

Analyzing deadlocks with “post-mortem” debugging

Let’s start with the first typical anomaly that occurs in multi-threaded environment called deadlock. Deadlock is a situation where two or more threads try to acquire the same resources and wait for others in a way that puts them in a deadly embrace. As a result, none of threads can proceed and do useful work.

To illustrate the issue better it’s good to evoke dining philosophers’ problem. In a nutshell you have a finite number of philosophers that sit at the same table and each one has a plate. There is only one fork between each plate and each philosopher can do two actions: eat or think.

Certainly, a philosopher needs two forks to eat (left and right). Now you can imagine the situation where each philosopher wants to eat at the same time and grabs one of available forks. As a result, all the forks are occupied and none of philosophers can eat because they block each other. That’s exactly what you understand by deadlock.

Surprisingly, this class of problems is quite simple to investigate and fix. All you need is actually a debugger 🙂 You already know the theory so you can jump into the Example 1.

#include <iostream>
#include <chrono>
#include <thread>
#include <mutex>


using namespace std::chrono_literals;

static std::mutex m1, m2;

void Alice()
{
    std::lock_guard<std::mutex> lock1{m1};
    std::this_thread::sleep_for(1s);
    std::lock_guard<std::mutex> lock2{m2};
}

void Bob()
{
    std::lock_guard<std::mutex> lock2{m2};
    std::this_thread::sleep_for(1s);
    std::lock_guard<std::mutex> lock1{m1};
}


int main()
{
    std::thread aliceThread(Alice);
    std::thread bobThread(Bob);
    
    aliceThread.join();
    bobThread.join();
    return 0;
}

You can compile the Example 1 and run it as follows:

$ g++ -ggdb -o deadlock -lpthread deadlock.cpp
$ ./deadlock

The code is compiled with a flag adding particular debugging symbols to the binary for gdb debugger only — that’s the tool that will be used in the analysis. After running the above program multiple times, you can observe it freezes — you have classical deadlock. The very first step in dealing with such case is reading a PID of a hanged process and sending SIGABRT signal to it — this forces Linux kernel to generate core dump file which is crucial for further analysis (make sure that core file size is not set to 0 for your Linux OS. You can check it ulimit -c command). The following commands do the job:

$ ps x | grep "deadlock"
   5767 pts/0    Sl+    0:00 ./deadlock
   5785 pts/1    S+     0:00 grep --color=auto deadlock
$ kill -SIGABRT 5767

On most Linux distribution core dump files are stored in the current working directory but some of them define a specific global path for all processes — you need to check how it works for your platform (It’s possible to edit a default core file path in /proc/sys/kernel/core_pattern file). Having a core dump file you can run gdb on it.

$ gdb ./deadlock core_dump

Now you can display information about threads:

(gdb) info threads
 Id   Target Id                        Frame 
  * 1  Thread 0x7f2ec07043c0 (LWP 6179) __futex_abstimed_wait_common64
  2    Thread 0x7f2ebffff640 (LWP 6180) futex_wait
  3    Thread 0x7f2ebf7fe640 (LWP 6181) futex_wait

As you see it seems that two threads are waiting for a lock. Let’s dig into call stack of specific threads. You can simply do it by running the following:

(gdb) thread 2
(gdb) bt
#0  futex_wait  
#1  __GI___lll_lock_wait
#2  0x00007f2ec0098082 in lll_mutex_lock_optimized 
#3  ___pthread_mutex_lock 
#4  0x0000559f706ea5d5 in __gthread_mutex_lock
#5  0x0000559f706ea74c in std::mutex::lock
#6  0x0000559f706ea7c6 in std::lock_guard<std::mutex>::lock_guard 
#7  0x0000559f706ea371 in Alice () at deadlock.cpp:15
#8  0x0000559f706eb24d in std::__invoke_impl<void, void (*)()> 
#9  0x0000559f706eb1f9 in std::__invoke<void (*)()> 
...

If you go carefully via stack trace you notice that the problem occurs in line 15 — that’s the place where Alice is blocked:

std::lock_guard<std::mutex> lock2{m2};

Now if you repeat the same steps for Bob, you realize that this thread is also blocked in line 22:

std::lock_guard<std::mutex> lock1{m1};

It’s relatively simple to explain the issue — a thread Bob tries to obtain a resource m1 which has been grabbed by Alice and vice versa. So how can you fix the problem? In this particular case you need to make sure that the resources are locked in the same order by Alice and Bob. You can look at the Example 2.

void Alice()
{
    std::lock_guard<std::mutex> lock1{m1};
    std::this_thread::sleep_for(1s);
    std::lock_guard<std::mutex> lock2{m2};
}

void Bob()
{

    std::lock_guard<std::mutex> lock1{m1};
    std::this_thread::sleep_for(1s);
    std::lock_guard<std::mutex> lock2{m2};
}

So simple isn’t it? Let’s move on.

Tracking race conditions with capture and replicate technique

Now things are getting much more interesting as this time you are going to face race condition :). In the previous example it was quite easy to track the problem as the program suspended and you could simply take a snapshot of process’s memory. In case of race condition it is, to put it mildly, a little bit more challenging.

In general, this anomaly is a situation where two or more operations are performed simultaneously and a specific order of those operations is required to make the program work properly. In a nutshell the race condition results in undefined behaviour of a program. Ok, that’s enough in term of theory. Let’s have a look at the Example 3:

#include <iostream>
#include <thread>
#include <vector>


static auto counter = 0u;
constexpr auto MAX_ITER_NUMBER = 20000000u;


void task()
{
	auto tmp = counter;
	++tmp;
	counter = tmp;
}

void worker()
{
    for(auto idx=0; idx<MAX_ITER_NUMBER; ++idx)
    {
	    task();
    }
}


int main()
{
	const auto threadNumber = std::thread::hardware_concurrency();
	std::vector<std::thread> threads;
	threads.reserve(threadNumber);
	for(auto idx=0;idx<threadNumber; ++idx)
	{
		threads.emplace_back(worker);
	}
	for(auto& th : threads)
	{
		th.join();
	}
	std::cout << "Expected calls: " << threadNumber*MAX_ITER_NUMBER << " Real calls: " << counter << std::endl;
    	return 0;
}

This time the program creates multiple workers depending on the number of, so called, hardware threads supporting by your processor. Each thread just increments a counter and in the end of the program the number of increments is written to stdout. In theory, you expect that value of counter is predictable and equal to a number of thread multiple by a number of iterations. Let’s compile the code and run it.

$ g++ -o race -ggdb -lpthread race.cpp
$ ./race
$ Expected calls: 160000000 Real calls: 25869183

As you see the displayed values are different and what is more if you run the program multiple times you realize that the final counter value is completely random — that’s a pure race condition. It’s worth emphasizing that in the example you get a wrong result almost each time you fire the code — that’s the best possible case. Usually, you witness the failure quite rare and thus race conditions are irritating and hard to detect (especially in complex systems). Anyway, let’s discuss one of techniques which is quite flexible and effective in dealing with this kind of bugs.

The method is called capture and replicate (some people say time travel debugging :)) which actually can be described with the following steps:

  1. Run the program repeatedly with recording until you reproduce an issue.
  2. Analyse the recording.
  3. Replay the recording running the program under a debugger until a bug occurred.
  4. Analyse a program’s state at that point.

Unfortunately, this time a standard debugger is not sufficient — you need support from some external tool. In fact, the most mysterious thing in terms of this approach is the recording functionality. There are multiple (commercial and non-commercial) tools that offer this feature and in this specific case you will use which is a free and lightweight recording tool (Pin$PLay tool created by Intel and UDB are also interesting options).

It is also worth mentioning that gdb itself also has the recording feature. Unfortunately, it does not support some crucial system calls for simulating concurrency behaviour — it is expected to be added in the future. Ok, now you know the theory so it’s high time to put it into practise. Let’s check what tool can do:

$ rr record -h ./race
(rr): Saving execution to trace directory /root/.local/share/rr/race-20.
Expected calls: 60000000 Real calls: 25757037

As you see, we’ve succeeded in recording the failure — that’s half the battle. The cool thing about the recording is the fact that now you can repeat the sequence of events that leads to race condition deterministically (time travel). The second super feature of rr is a possibility of running gdbserver with our recording and simply attaching to it with GNU debugger.

$ rr replay race-20
For help, type "help".
Type "apropos\ word" to search for commands related to "word"...
Reading symbols from /home/michal/race...
Really redefine built-in command "restart"? (y or n)
Remote debugging using 127.0.0.1:7599
Reading symbols from /lib64/ld-linux-x86-64.so.2...
(rr)

Looks good, isn’t it? You have a gdb session fired which allows an advanced analysis of the misbehaviour. In the beginning you can set a conditional breakpoint in order to stop a program at some point and then check what kind of operations are performed by specific threads.

(rr) break 14 if counter>15757037
Breakpoint 1 at 0x5626f30a73be: file race.cpp, line 14.
(rr) c
Continuing.
[New Thread 21868.21872]
[New Thread 21868.21871]
[New Thread 21868.21870]
[New Thread 21868.21869]
[Switching to Thread 21868.21871]
Thread 3 hit Breakpoint 1, task () at race.cpp:14
(rr) info thread
  Id   Target Id                 Frame 
  1    Thread 21868.21868 (race) __futex_abstimed_wait_common64
  2    Thread 21868.21872 (race) task () at race.cpp:14
* 3    Thread 21868.21871 (race) task () at race.cpp:14
  4    Thread 21868.21870 (race) task () at race.cpp:14
  5    Thread 21868.21869 (race) task () at race.cpp:15
(rr) thread 2
[Switching to thread 2 (Thread 21868.21872)]
#0  task () at race.cpp:14
14      counter = tmp;
(rr) print(tmp)
$8 = 15355793
(rr) thread 3
[Switching to thread 3 (Thread 21868.21871)]
#0  task () at race.cpp:14
14      counter = tmp;
(rr) print(tmp)
$9 = 15757044
(rr) thread 4
[Switching to thread 4 (Thread 21868.21870)]
#0  task () at race.cpp:14
14      counter = tmp;
(rr) print(tmp)
$10 = 13297385

As you see 3 threads out of 5 try to write completely different values to counter at the same time! Now everything becomes clear — the source of the problem is lack of atomicity in case of read and write operations on counter variable. You can simply fix it by adding some synchronization mechanism.

It’s worth to remember that adding synchronization always affect the performance of a program so be careful. In this case the best option is an atomic variable with relaxed memory ordering (actually memory barriers and memory ordering are other interesting topic in terms of concurrency, but I will not touch them in this paper). You can check the full solution in Example 4:

#include <iostream>
#include <thread>
#include <vector>
#include <atomic>


static std::atomic_int counter = 0u;
constexpr auto MAX_ITER_NUMBER = 20000000u;


void task()
{
	counter.fetch_add(1, std::memory_order_relaxed);
}

Finally let’s make sure that everything works fine.

$ g++ -o fix_race -ggdb -lpthread fix_race.cpp
$ ./fix_race
$ Expected calls: 160000000 Real calls: 160000000

In the end it is significant to mention about limitations of rr which touch different recording tools as well:

  1. rr works only with intel processors and requires a brand-new linux kernel.
  2. Despite being quite effective in reproducing concurrency environment, it is still single core simulation which might behave different than expected.
  3. A great majority of system calls is supported (recorded) but not all of them.
  4. Debugging the recording is a slow process — moving from one breakpoint to another take some time, especially when multiple threads must be simulated.

Anyway, time travel tools really make your life easier in terms of finding bugs (not necessarily always in mutli-threaded applications) so I encourage you to play with them a little bit.

Investigating performance issues with performance counters

So far you have been investigating issues affecting correctness of programs. This time let’s try to dig into another class of concurrency problems connected with performance. Let’s start with some portion of theory at first…

In modern multi-core processor architectures, each core possesses his own internal memory (cache). The reason for that is obviously …performance. If some data stored in a specific memory region is accessed frequently it is just worth caching it to make the access faster. The data are transferred between main memory and cache using fix size blocks called cachelines.

And here comes the question: what if multiple threads share data from the same cache line? At first glance it looks like a potential race condition. However, you do not have to care of that — the processor cores use cache coherency protocol which synchronizes data between threads. Unfortunately, this synchronization primitive has a negative impact on performance. This effect is well-known as false sharing. Let’s have a look at the following piece of code in Example 5:

#include <random>
#include <algorithm>
#include <array>

#include <omp.h>

constexpr auto THREAD_NUMBER = 8;
constexpr auto N = 10000000;

std::array<unsigned, THREAD_NUMBER> sums;
std::array<unsigned, N> values;


void calculateSums()
{
	auto threadId = 0;
#ifdef _OPENMP
	omp_set_num_threads(THREAD_NUMBER);
	#pragma omp parallel private(threadId)
	{
		threadId = omp_get_thread_num();
#else
	for(threadId=0; threadId<THREAD_NUMBER; ++threadId)
	{
#endif
		for(auto idx=0; idx<N; ++idx)
		{
			sums[threadId] += values[idx] >> threadId;
		}
	}
}


int main() 
{
	std::random_device rDevice;
	std::mt19937 engine {rDevice()};
	std::uniform_int_distribution<unsigned> dist {1, N};
	std::generate(values.begin(), values.end(), [&dist, &engine]() {
		return dist(engine);
	});
	calculateSums();	
	return 0;
}

The program calculates 8 sums divided by a power of 2 in two modes: sequentially and in parallel using OpenMP library. In theory you should expect the second approach to be much faster — eventually each sum is calculated independently using 8 threads. Let’s find out how fast it works with OpenMP support.

$ g++ -o false_sharing_par -fopenmp false_sharing.cpp
$ time ./false_sharing
$ real  0m1,097s
$ user  0m6,061s
$ sys   0m0,004s

As you see it takes about 1.097s to perform 8 sums. Now let’s compile the code without OpenMP support and repeat the experiment.

$ g++ -o false_sharing_seq false_sharing.cpp
$ time ./false_sharing_seq
$ real  0m0,928s
$ user  0m0,907s
$ sys   0m0,020s

This time it took about 0.928s so the time of performing the same task sequentially decreased… Are you surprised? What you’re currently witnessing is certainly false_sharing. All 8 threads modify the same buffer while calculating the sums. As a result, CPU cores fire synchronization protocol and you observe performance degradation.

Now let’s try to dig into a problem and track a bottleneck. In order to debug this, you will tap into CPU’s performance counters. The counters track event connected with …CPU’s performance for instance the number of executed instructions or CPU cycles etc. There are obviously multiple tools to analyze performance counters but as long as you use Linux, perf tool is sufficient for your needs.

In order to investigate the issue, you will mainly focus on LLC – loads counter which stands for last-level cache misses. In fact, when falsesharing occurs the number of cache misses (lack of specific data in CPU core’s cache memory) dramatically increases because the coherency protocol kicks into action. Let’s compare the counter for sequential and parallel versions of the code.

$ perf stat -e LLC-loads ./false_sharing_seq 
  Performance counter stats for './false_sharing_seq':
            339 028      LLC-loads
$ perf stat -e LLC-loads ./false_sharing_par
  Performance counter stats for './false_sharing_seq':
            2 343 694      LLC-loads

As you see the difference is meaningful — the number of misses is approximately 7 times higher for parallel code. Now you need to locate a source of the problem (to be more specific a place where false_sharing affects the code). In order to do that you can use perf tool again. The following command will record the event which is interesting from your perspective.

$ perf record -e LLC-loads ./false_sharing_seq 

Finally, you can analyze the results of recording with the following command.

$ perf record -e LLC-loads ./false_sharing_seq
Percent | Source code & Disassembly of false_sharing_par for LLC-loads 
-------------------------------------------
         :
         :
         :
         : 3    Disassembly of section .text:
         :
         : 5    000000000000263f <calculateSums()
         : 6    calculateSums() [clone ._omp_fn.0]:
         : 19   void calculateSums()
         : 19   void calculateSums()
         : 20   {
         : 19   void calculateSums()
         : 20   {
         : 21   auto threadId = 0;
         : 22   #ifdef _OPENMP
         : 23   omp_set_num_threads(THREAD_NUMBER);
         : 24   #pragma omp parallel private(threadId)
    0.00 :   263f:   endbr64
    0.00 :   2643:   push   %rbp
    0.00 :   2644:   mov    %rsp,%rbp
    0.00 :   2647:   push   %rbx
    0.00 :   2648:   sub    $0x28,%rsp
    0.00 :   264c:   mov    %rdi,-0x28(%rbp)
         : 21   {
         : 22   threadId = omp_get_thread_num();
    0.00 :   2650:   call   2310 <omp_get_thread_num@plt>
    0.00 :   2655:   mov    %eax,-0x14(%rbp)
         : 26   #else
         : 27   for(threadId=0; threadId<THREAD_NUMBER; ++threadId)
         : 28   {
         : 29   #endif
         : 30   for(auto idx=0; idx<N; ++idx)
    0.00 :   2658:   movl   $0x0,-0x18(%rbp)
    0.00 :   265f:   nop
    0.00 :   2660:   cmpl   $0x98967f,-0x18(%rbp)
    0.02 :   2667:   jg     26ae <calculateSums()
         : 28   {
         : 29   sums[threadId] += values[idx] >> threadId;
    0.02 :   2669:   mov    -0x18(%rbp),%eax
    0.00 :   266c:   cltq
    0.00 :   266e:   mov    %rax,%rsi
    0.00 :   2671:   lea    0x39e8(%rip),%rax
    0.02 :   2678:   mov    %rax,%rdi
    0.00 :   267b:   call   2996 <std::array<>::operator[](unsigned long)>
    0.00 :   2680:   mov    (%rax),%edx
    0.29 :   2682:   mov    -0x14(%rbp),%eax
    0.00 :   2685:   mov    %edx,%ebx
    0.59 :   2687:   mov    %eax,%ecx
    0.00 :   2689:   shr    %cl,%ebx
    2.92 :   268b:   mov    -0x14(%rbp),%eax
    0.00 :   268e:   cltq
    0.00 :   2690:   mov    %rax,%rsi
   32.50 :   2693:   lea    0x39a6(%rip),%rax
   49.12 :   269a:   mov    %rax,%rdi
    0.01 :   269d:   call   296c <std::array<unsigned int, 8ul>::operator[](unsigned long)>
    0.00 :   26a2:   mov    (%rax),%edx
   13.48 :   26a4:   add    %ebx,%edx
    0.97 :   26a6:   mov    %edx,(%rax)
         : 26   for(auto idx=0; idx<N; ++idx)
    0.04 :   26a8:   addl   $0x1,-0x18(%rbp)
    0.03 :   26ac:   jmp    2660 <calculateSums()
         : 19   #pragma omp parallel private(threadId)
    0.00 :   26ae:   nop
    0.00 :   26af:   mov    -0x8(%rbp),%rbx
    0.00 :   26b3:   leave

The command produces low-level AT&T assembler code with information about cache misses generated by each instruction. Thanks to debugging symbols the output is more human-readable and you can simply notice that almost all the misses occur after executing in line 28.

sums[threadId] += values[idx] >> threadId;

So, the problem lies in writing data to the shared sums buffer in the loop. Fixing the issue requires to limit the number of writes to the sums. You can simply do that by adding a stack-base variable which stores a temporary sum which will be finally assigned to the buffer after summing all elements from values. Check the code in Example 6:

void calculateSums()
{
	auto threadId = 0;
#ifdef _OPENMP
	omp_set_num_threads(THREAD_NUMBER);
	#pragma omp parallel private(threadId)
	{
		threadId = omp_get_thread_num();
#else
	for(threadId=0; threadId<THREAD_NUMBER; ++threadId)
	{
#endif
		auto tmp = 0;
		for(auto idx=0; idx<N; ++idx)
		{
			tmp += values[idx] >> threadId; 
		}
		sums[threadId] = tmp;
	}
}

Let’s find out how this cosmetic change affects the performance of the program.

$ g++ -o false_sharing_par -fopenmp false_sharing.cpp
$ time ./false_sharing
$ real  0m0,387s
$ user  0m0,908s
$ sys   0m0,000s

Looks much better! This time it takes 0.387s which is quite satisfying.

As you see performance counter statistics allowed to track the issue connected with memory. It is worth mentioning that perf is quite a powerful tool and can be used to investigate different problems connected with performance.

Summary

Finally, you got to the end 🙂 As you see analysing the concurrent code is not a piece of cake. Even for simple examples presenting in this article it’s not so easy to find a source of a problem (for instance in case of false_sharing).

I hope that the techniques described in the article will help you improve your debugging skills and track some concurrency issues in your application faster. Using low-level threads or processes stuff requires good understanding of OS and processors’ internals, so think twice about the design before implementing it. If you spend a lot of time dealing with new multi-threaded code you should definitely consider some high-level solutions.

References

  • D. Spinellis, “Effective Debugging: 66 Specific Ways to Debug Software and Systems,” 2017
  • R. Stallman, “Debugging with Gdb: The Gnu Source-Level Debugger,” 2018
  • https://en.cppreference.com/ [online, accessed 01-August-2023]
  • https://rr-project.org/ [online, accessed 01-August-2023]
5/5 ( votes: 5)
Rating:
5/5 ( votes: 5)
Author
Avatar
Michał Mazurek

Graduate in Computer Science from Poznan University of Technology with over 9 years of commercial experience as a software engineer. Currently working as a C++ programmer for a client in the electrical engineering industry. He specializes in object-oriented programming languages and operating systems. He enjoys playing tennis and practicing agility training with his Border Collie dog in his spare time

Leave a comment

Your email address will not be published. Required fields are marked *

You might also like

More articles

Don't miss out

Subscribe to our blog and receive information about the latest posts.

Get an offer

If you have any questions or would like to learn more about our offer, feel free to contact us.

Send your request Send your request

Natalia Competency Center Director

Get an offer

Join Sii

Find the job that's right for you. Check out open positions and apply.

Apply Apply

Paweł Process Owner

Join Sii

SUBMIT

Ta treść jest dostępna tylko w jednej wersji językowej.
Nastąpi przekierowanie do strony głównej.

Czy chcesz opuścić tę stronę?