{"id":23291,"date":"2023-08-07T05:00:00","date_gmt":"2023-08-07T03:00:00","guid":{"rendered":"https:\/\/sii.pl\/blog\/?p=23291"},"modified":"2023-08-03T15:00:16","modified_gmt":"2023-08-03T13:00:16","slug":"effective-multi-threaded-code-debugging","status":"publish","type":"post","link":"https:\/\/sii.pl\/blog\/en\/effective-multi-threaded-code-debugging\/","title":{"rendered":"Effective multi-threaded code debugging"},"content":{"rendered":"\n<p>Concurrency issues are one of the hardest to investigate and fix. It is caused mainly by the nondeterministic nature of threads and processes. Every time you run a concurrent code it may execute in a slightly different order which certainly affects the correctness and performance of a program. As a result, the most popular techniques used in investigating sequential code like log analysis or examinating code with <em>debugger<\/em> are insufficient.<\/p>\n\n\n\n<p>You could obviously say that there are multiple high-level solutions like Node.js based frameworks, Java applications server frameworks or simply databases which implement whole concurrency stuff internally and you don\u2019t need to care about that at all. That sounds fair for enterprise applications or web services but not necessarily for embedded systems where you often reach out for low-level stuff to deliver expected efficiency.<\/p>\n\n\n\n<p>This article aims at presenting methods and tools which might be quite helpful in dealing with nondeterminism and typical problems of multi-threaded applications.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Prerequisite<\/strong><\/h2>\n\n\n\n<p>In the article I mainly use tools available on Linux platform with kernel release 5.19.0. All the examples have been written in C++ programming language and compiled with <em>GCC 12.0 on Intel(R)Core(TM)i7-7700CPU@3.60GHz<\/em>&nbsp;processor. However, the presented techniques are universal so you can find analogous tools for different platforms and programming languages.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Analyzing deadlocks with &#8220;post-mortem&#8221; debugging<\/strong><\/h2>\n\n\n\n<p>Let\u2019s start with the first typical anomaly that occurs in multi-threaded environment called <em>deadlock<\/em>. <em>Deadlock<\/em>&nbsp;is a situation where two or more threads try to acquire the same resources and wait for others in a way that puts them in a deadly embrace. As a result, none of threads can proceed and do useful work.<\/p>\n\n\n\n<p>To illustrate the issue better it\u2019s good to evoke <em>dining philosophers&#8217; problem<\/em>. In a nutshell you have a finite number of philosophers that sit at the same table and each one has a plate. There is only one fork between each plate and each philosopher can do two actions: eat or think. <\/p>\n\n\n\n<p>Certainly, a philosopher needs two forks to eat (left and right). Now you can imagine the situation where each philosopher wants to eat at the same time and grabs one of available forks. As a result, all the forks are occupied and none of philosophers can eat because they block each other. That\u2019s exactly what you understand by <em>deadlock<\/em>.<\/p>\n\n\n\n<p>Surprisingly, this class of problems is quite simple to investigate and fix. All you need is actually a <em>debugger<\/em>&nbsp;\ud83d\ude42 You already know the theory so you can jump into the Example 1.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\n#include &amp;lt;iostream&gt;\n#include &amp;lt;chrono&gt;\n#include &amp;lt;thread&gt;\n#include &amp;lt;mutex&gt;\n\n\nusing namespace std::chrono_literals;\n\nstatic std::mutex m1, m2;\n\nvoid Alice()\n{\n    std::lock_guard&amp;lt;std::mutex&gt; lock1{m1};\n    std::this_thread::sleep_for(1s);\n    std::lock_guard&amp;lt;std::mutex&gt; lock2{m2};\n}\n\nvoid Bob()\n{\n    std::lock_guard&amp;lt;std::mutex&gt; lock2{m2};\n    std::this_thread::sleep_for(1s);\n    std::lock_guard&amp;lt;std::mutex&gt; lock1{m1};\n}\n\n\nint main()\n{\n    std::thread aliceThread(Alice);\n    std::thread bobThread(Bob);\n    \n    aliceThread.join();\n    bobThread.join();\n    return 0;\n}\n<\/pre><\/div>\n\n\n<p>You can compile the Example 1 and run it as follows:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">$ g++ -ggdb -o deadlock -lpthread deadlock.cpp\n$ .\/deadlock\n<\/pre>\n\n\n\n<p>The code is compiled with a flag adding particular debugging symbols to the binary for <em>gdb<\/em>&nbsp;debugger only \u2014 that\u2019s the tool that will be used in the analysis. After running the above program multiple times, you can observe it freezes \u2014 you have classical <em>deadlock<\/em>. The very first step in dealing with such case is reading a PID of a hanged process and sending SIGABRT signal to it \u2014 this forces Linux kernel to generate core dump file which is crucial for further analysis (make sure that core file size is not set to 0 for your Linux OS. You can check it ulimit -c command). The following commands do the job:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">$ ps x | grep \"deadlock\"\n   5767 pts\/0    Sl+    0:00 .\/deadlock\n   5785 pts\/1    S+     0:00 grep --color=auto deadlock\n$ kill -SIGABRT 5767\n<\/pre>\n\n\n\n<p>On most Linux distribution core dump files are stored in the current working directory but some of them define a specific global path for all processes \u2014 you need to check how it works for your platform (It\u2019s possible to edit a default core file path in <em>\/proc\/sys\/kernel\/core_pattern<\/em>&nbsp;file). Having a core dump file you can run <em>gdb<\/em>&nbsp;on it.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">$ gdb .\/deadlock core_dump<\/pre>\n\n\n\n<p>Now you can display information about threads:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">(gdb) info threads\n Id   Target Id                        Frame \n  * 1  Thread 0x7f2ec07043c0 (LWP 6179) __futex_abstimed_wait_common64\n  2    Thread 0x7f2ebffff640 (LWP 6180) futex_wait\n  3    Thread 0x7f2ebf7fe640 (LWP 6181) futex_wait\n<\/pre>\n\n\n\n<p>As you see it seems that two threads are waiting for a lock. Let\u2019s dig into call stack of specific threads. You can simply do it by running the following:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">(gdb) thread 2\n(gdb) bt\n#0  futex_wait  \n#1  __GI___lll_lock_wait\n#2  0x00007f2ec0098082 in lll_mutex_lock_optimized \n#3  ___pthread_mutex_lock \n#4  0x0000559f706ea5d5 in __gthread_mutex_lock\n#5  0x0000559f706ea74c in std::mutex::lock\n#6  0x0000559f706ea7c6 in std::lock_guard&lt;std::mutex&gt;::lock_guard \n#7  0x0000559f706ea371 in Alice () at deadlock.cpp:15\n#8  0x0000559f706eb24d in std::__invoke_impl&lt;void, void (*)()&gt; \n#9  0x0000559f706eb1f9 in std::__invoke&lt;void (*)()&gt; \n...\n<\/pre>\n\n\n\n<p>If you go carefully via stack trace you notice that the problem occurs in line 15 \u2014 that\u2019s the place where <em>Alice<\/em>&nbsp;is blocked:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">std::lock_guard&lt;std::mutex&gt; lock2{m2};<\/pre>\n\n\n\n<p>Now if you repeat the same steps for <em>Bob<\/em>,&nbsp;you realize that this thread is also blocked in line 22:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">std::lock_guard&lt;std::mutex&gt; lock1{m1};<\/pre>\n\n\n\n<p>It\u2019s relatively simple to explain the issue \u2014 a thread <em>Bob<\/em>&nbsp;tries to obtain a resource <em>m1<\/em>&nbsp;which has been grabbed by <em>Alice<\/em>&nbsp;and vice versa. So how can you fix the problem? In this particular case you need to make sure that the resources are locked in the same order by <em>Alice<\/em>&nbsp;and <em>Bob<\/em>. You can look at the Example 2.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nvoid Alice()\n{\n    std::lock_guard&amp;lt;std::mutex&gt; lock1{m1};\n    std::this_thread::sleep_for(1s);\n    std::lock_guard&amp;lt;std::mutex&gt; lock2{m2};\n}\n\nvoid Bob()\n{\n\n    std::lock_guard&amp;lt;std::mutex&gt; lock1{m1};\n    std::this_thread::sleep_for(1s);\n    std::lock_guard&amp;lt;std::mutex&gt; lock2{m2};\n}\n<\/pre><\/div>\n\n\n<p>So simple isn\u2019t it? Let\u2019s move on.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Tracking race conditions with capture and replicate technique<\/strong><\/h2>\n\n\n\n<p>Now things are getting much more interesting as this time you are going to face <em>race condition<\/em>&nbsp;:). In the previous example it was quite easy to track the problem as the program suspended and you could simply take a snapshot of process\u2019s memory. In case of <em>race condition<\/em>&nbsp;it is, to put it mildly, a little bit more challenging.<\/p>\n\n\n\n<p>In general, this anomaly is a situation where two or more operations are performed simultaneously and a specific order of those operations is required to make the program work properly. In a nutshell the race condition results in undefined behaviour of a program. Ok, that\u2019s enough in term of theory. Let\u2019s have a look at the Example 3:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\n#include &amp;lt;iostream&gt;\n#include &amp;lt;thread&gt;\n#include &amp;lt;vector&gt;\n\n\nstatic auto counter = 0u;\nconstexpr auto MAX_ITER_NUMBER = 20000000u;\n\n\nvoid task()\n{\n\tauto tmp = counter;\n\t++tmp;\n\tcounter = tmp;\n}\n\nvoid worker()\n{\n    for(auto idx=0; idx&amp;lt;MAX_ITER_NUMBER; ++idx)\n    {\n\t    task();\n    }\n}\n\n\nint main()\n{\n\tconst auto threadNumber = std::thread::hardware_concurrency();\n\tstd::vector&amp;lt;std::thread&gt; threads;\n\tthreads.reserve(threadNumber);\n\tfor(auto idx=0;idx&amp;lt;threadNumber; ++idx)\n\t{\n\t\tthreads.emplace_back(worker);\n\t}\n\tfor(auto&amp;amp; th : threads)\n\t{\n\t\tth.join();\n\t}\n\tstd::cout &amp;lt;&amp;lt; &quot;Expected calls: &quot; &amp;lt;&amp;lt; threadNumber*MAX_ITER_NUMBER &amp;lt;&amp;lt; &quot; Real calls: &quot; &amp;lt;&amp;lt; counter &amp;lt;&amp;lt; std::endl;\n    \treturn 0;\n}\n<\/pre><\/div>\n\n\n<p>This time the program creates multiple workers depending on the number of, so called, hardware threads supporting by your processor. Each thread just increments a <em>counter<\/em>&nbsp;and in the end of the program the number of increments is written to <em>stdout<\/em>. In theory, you expect that value of <em>counter<\/em>&nbsp;is predictable and equal to a number of thread multiple by a number of iterations. Let\u2019s compile the code and run it.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">$ g++ -o race -ggdb -lpthread race.cpp\n$ .\/race\n$ Expected calls: 160000000 Real calls: 25869183\n<\/pre>\n\n\n\n<p>As you see the displayed values are different and what is more if you run the program multiple times you realize that the final <em>counter<\/em>&nbsp;value is completely random \u2014 that\u2019s a pure <em>race condition<\/em>. It\u2019s worth emphasizing that in the example you get a wrong result almost each time you fire the code \u2014 that\u2019s the best possible case. Usually, you witness the failure quite rare and thus <em>race conditions<\/em>&nbsp;are irritating and hard to detect (especially in complex systems). Anyway, let\u2019s discuss one of techniques which is quite flexible and effective in dealing with this kind of bugs.<\/p>\n\n\n\n<p>The method is called <em>capture and replicate<\/em>&nbsp;(some people say time travel debugging :)) which actually can be described with the following steps:<\/p>\n\n\n\n<ol class=\"wp-block-list\" type=\"1\">\n<li>Run the program repeatedly with recording until you reproduce an issue.<\/li>\n\n\n\n<li>Analyse the recording.<\/li>\n\n\n\n<li>Replay the recording running the program under a debugger until a bug occurred.<\/li>\n\n\n\n<li>Analyse a program\u2019s state at that point.<\/li>\n<\/ol>\n\n\n\n<p>Unfortunately, this time a standard debugger is not sufficient \u2014 you need support from some external tool. In fact, the most mysterious thing in terms of this approach is the recording functionality. There are multiple (commercial and non-commercial) tools that offer this feature and in this specific case you will use which is a free and lightweight recording tool (<em>Pin$PLay<\/em>&nbsp;tool created by Intel and <em>UDB<\/em>&nbsp;are also interesting options).<\/p>\n\n\n\n<p>It is also worth mentioning that <em>gdb<\/em>\u00a0itself also has the recording feature. Unfortunately, it does not support some crucial system calls for simulating concurrency behaviour \u2014 it is expected to be added in the future. Ok, now you know the theory so it\u2019s high time to put it into practise. Let\u2019s check what tool can do:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">$ rr record -h .\/race\r\n(rr): Saving execution to trace directory \/root\/.local\/share\/rr\/race-20.\r\nExpected calls: 60000000 Real calls: 25757037\r<\/pre>\n\n\n\n<p>As you see, we\u2019ve succeeded in recording the failure \u2014 that\u2019s half the battle. The cool thing about the recording is the fact that now you can repeat the sequence of events that leads to <em>race condition<\/em>\u00a0deterministically (time travel). The second super feature of <em>rr<\/em>\u00a0is a possibility of running <em>gdbserver<\/em>\u00a0with our recording and simply attaching to it with <em>GNU<\/em>\u00a0debugger.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">$ rr replay race-20\r\nFor help, type \"help\".\r\nType \"apropos\\ word\" to search for commands related to \"word\"...\r\nReading symbols from \/home\/michal\/race...\r\nReally redefine built-in command \"restart\"? (y or n)\r\nRemote debugging using 127.0.0.1:7599\r\nReading symbols from \/lib64\/ld-linux-x86-64.so.2...\r\n(rr)\r<\/pre>\n\n\n\n<p>Looks good, isn\u2019t it? You have a <em>gdb<\/em>\u00a0session fired which allows an advanced analysis of the misbehaviour. In the beginning you can set a conditional breakpoint in order to stop a program at some point and then check what kind of operations are performed by specific threads.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">(rr) break 14 if counter>15757037\r\nBreakpoint 1 at 0x5626f30a73be: file race.cpp, line 14.\r\n(rr) c\r\nContinuing.\r\n[New Thread 21868.21872]\r\n[New Thread 21868.21871]\r\n[New Thread 21868.21870]\r\n[New Thread 21868.21869]\r\n[Switching to Thread 21868.21871]\r\nThread 3 hit Breakpoint 1, task () at race.cpp:14\r\n(rr) info thread\r\n  Id   Target Id                 Frame \r\n  1    Thread 21868.21868 (race) __futex_abstimed_wait_common64\r\n  2    Thread 21868.21872 (race) task () at race.cpp:14\r\n* 3    Thread 21868.21871 (race) task () at race.cpp:14\r\n  4    Thread 21868.21870 (race) task () at race.cpp:14\r\n  5    Thread 21868.21869 (race) task () at race.cpp:15\r\n(rr) thread 2\r\n[Switching to thread 2 (Thread 21868.21872)]\r\n#0  task () at race.cpp:14\r\n14      counter = tmp;\r\n(rr) print(tmp)\r\n$8 = 15355793\r\n(rr) thread 3\r\n[Switching to thread 3 (Thread 21868.21871)]\r\n#0  task () at race.cpp:14\r\n14      counter = tmp;\r\n(rr) print(tmp)\r\n$9 = 15757044\r\n(rr) thread 4\r\n[Switching to thread 4 (Thread 21868.21870)]\r\n#0  task () at race.cpp:14\r\n14      counter = tmp;\r\n(rr) print(tmp)\r\n$10 = 13297385\r<\/pre>\n\n\n\n<p>As you see 3 threads out of 5 try to write completely different values to <em>counter<\/em>\u00a0at the same time! Now everything becomes clear \u2014 the source of the problem is lack of atomicity in case of read and write operations on <em>counter<\/em>\u00a0variable. You can simply fix it by adding some synchronization mechanism.<\/p>\n\n\n\n<p>It\u2019s worth to remember that adding synchronization always affect the performance of a program so be careful. In this case the best option is an atomic variable with relaxed memory ordering (actually memory barriers and memory ordering are other interesting topic in terms of concurrency, but I will not touch them in this paper). You can check the full solution in Example 4:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\n#include &amp;lt;iostream&gt;\n#include &amp;lt;thread&gt;\n#include &amp;lt;vector&gt;\n#include &amp;lt;atomic&gt;\n\n\nstatic std::atomic_int counter = 0u;\nconstexpr auto MAX_ITER_NUMBER = 20000000u;\n\n\nvoid task()\n{\n\tcounter.fetch_add(1, std::memory_order_relaxed);\n}\n<\/pre><\/div>\n\n\n<p>Finally let\u2019s make sure that everything works fine.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">$ g++ -o fix_race -ggdb -lpthread fix_race.cpp\r\n$ .\/fix_race\r\n$ Expected calls: 160000000 Real calls: 160000000\r<\/pre>\n\n\n\n<p>In the end it is significant to mention about limitations of <em>rr\u00a0<\/em>which touch different recording tools as well:<\/p>\n\n\n\n<ol class=\"wp-block-list\" type=\"1\">\n<li><em>rr<\/em>\u00a0works only with intel processors and requires a brand-new linux kernel.<\/li>\n\n\n\n<li>Despite being quite effective in reproducing concurrency environment, it is still single core simulation which might behave different than expected.<\/li>\n\n\n\n<li>A great majority of system calls is supported (recorded) but not all of them.<\/li>\n\n\n\n<li>Debugging the recording is a slow process \u2014 moving from one breakpoint to another take some time, especially when multiple threads must be simulated.<\/li>\n<\/ol>\n\n\n\n<p>Anyway, time travel tools really make your life easier in terms of finding bugs (not necessarily always in mutli-threaded applications) so I encourage you to play with them a little bit.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Investigating performance issues with performance counters<\/strong><\/h2>\n\n\n\n<p>So far you have been investigating issues affecting correctness of programs. This time let\u2019s try to dig into another class of concurrency problems connected with performance. Let\u2019s start with some portion of theory at first\u2026<\/p>\n\n\n\n<p>In modern multi-core processor architectures, each core possesses his own internal memory (cache). The reason for that is obviously \u2026performance. If some data stored in a specific memory region is accessed frequently it is just worth caching it to make the access faster. The data are transferred between main memory and cache using fix size blocks called <em>cachelines<\/em>.<\/p>\n\n\n\n<p>And here comes the question: what if multiple threads share data from the same cache line? At first glance it looks like a potential race condition. However, you do not have to care of that \u2014 the processor cores use cache coherency protocol which synchronizes data between threads. Unfortunately, this synchronization primitive has a negative impact on performance. This effect is well-known as <em>false sharing<\/em>. Let\u2019s have a look at the following piece of code in Example 5:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\n#include &amp;lt;random&gt;\n#include &amp;lt;algorithm&gt;\n#include &amp;lt;array&gt;\n\n#include &amp;lt;omp.h&gt;\n\nconstexpr auto THREAD_NUMBER = 8;\nconstexpr auto N = 10000000;\n\nstd::array&amp;lt;unsigned, THREAD_NUMBER&gt; sums;\nstd::array&amp;lt;unsigned, N&gt; values;\n\n\nvoid calculateSums()\n{\n\tauto threadId = 0;\n#ifdef _OPENMP\n\tomp_set_num_threads(THREAD_NUMBER);\n\t#pragma omp parallel private(threadId)\n\t{\n\t\tthreadId = omp_get_thread_num();\n#else\n\tfor(threadId=0; threadId&amp;lt;THREAD_NUMBER; ++threadId)\n\t{\n#endif\n\t\tfor(auto idx=0; idx&amp;lt;N; ++idx)\n\t\t{\n\t\t\tsums&#x5B;threadId] += values&#x5B;idx] &gt;&gt; threadId;\n\t\t}\n\t}\n}\n\n\nint main() \n{\n\tstd::random_device rDevice;\n\tstd::mt19937 engine {rDevice()};\n\tstd::uniform_int_distribution&amp;lt;unsigned&gt; dist {1, N};\n\tstd::generate(values.begin(), values.end(), &#x5B;&amp;amp;dist, &amp;amp;engine]() {\n\t\treturn dist(engine);\n\t});\n\tcalculateSums();\t\n\treturn 0;\n}\n<\/pre><\/div>\n\n\n<p>The program calculates 8 sums divided by a power of 2 in two modes: sequentially and in parallel using <em>OpenMP<\/em>\u00a0library. In theory you should expect the second approach to be much faster \u2014 eventually each sum is calculated independently using 8 threads. Let\u2019s find out how fast it works with <em>OpenMP<\/em>\u00a0support.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">$ g++ -o false_sharing_par -fopenmp false_sharing.cpp\r\n$ time .\/false_sharing\r\n$ real  0m1,097s\r\n$ user  0m6,061s\r\n$ sys   0m0,004s\r<\/pre>\n\n\n\n<p>As you see it takes about 1.097s to perform 8 sums. Now let\u2019s compile the code without <em>OpenMP<\/em>\u00a0support and repeat the experiment.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">$ g++ -o false_sharing_seq false_sharing.cpp\r\n$ time .\/false_sharing_seq\r\n$ real  0m0,928s\r\n$ user  0m0,907s\r\n$ sys   0m0,020s\r<\/pre>\n\n\n\n<p>This time it took about 0.928s so the time of performing the same task sequentially decreased\u2026 Are you surprised? What you\u2019re currently witnessing is certainly <em>false_sharing<\/em>. All 8 threads modify the same buffer while calculating the sums. As a result, CPU cores fire synchronization protocol and you observe performance degradation.<\/p>\n\n\n\n<p>Now let\u2019s try to dig into a problem and track a bottleneck. In order to debug this, you will tap into CPU\u2019s performance counters. The counters track event connected with \u2026CPU\u2019s performance for instance the number of executed instructions or CPU cycles etc. There are obviously multiple tools to analyze performance counters but as long as you use Linux, <em>perf<\/em>\u00a0tool is sufficient for your needs.<\/p>\n\n\n\n<p>In order to investigate the issue, you will mainly focus on <em>LLC &#8211; loads<\/em>\u00a0counter which stands for last-level cache misses. In fact, when <em>falsesharing<\/em>\u00a0occurs the number of cache misses (lack of specific data in CPU core\u2019s cache memory) dramatically increases because the coherency protocol kicks into action. Let\u2019s compare the counter for sequential and parallel versions of the code.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">$ perf stat -e LLC-loads .\/false_sharing_seq \r\n  Performance counter stats for '.\/false_sharing_seq':\r\n            339 028      LLC-loads\r\n$ perf stat -e LLC-loads .\/false_sharing_par\r\n  Performance counter stats for '.\/false_sharing_seq':\r\n            2 343 694      LLC-loads\r<\/pre>\n\n\n\n<p>As you see the difference is meaningful \u2014 the number of misses is approximately 7 times higher for parallel code. Now you need to locate a source of the problem (to be more specific a place where <em>false_sharing<\/em>\u00a0affects the code). In order to do that you can use <em>perf<\/em>\u00a0tool again. The following command will record the event which is interesting from your perspective.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">$ perf record -e LLC-loads .\/false_sharing_seq <\/pre>\n\n\n\n<p>Finally, you can analyze the results of recording with the following command.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">$ perf record -e LLC-loads .\/false_sharing_seq\r\nPercent | Source code &amp; Disassembly of false_sharing_par for LLC-loads \r\n-------------------------------------------\r\n         :\r\n         :\r\n         :\r\n         : 3    Disassembly of section .text:\r\n         :\r\n         : 5    000000000000263f &lt;calculateSums()\r\n         : 6    calculateSums() [clone ._omp_fn.0]:\r\n         : 19   void calculateSums()\r\n         : 19   void calculateSums()\r\n         : 20   {\r\n         : 19   void calculateSums()\r\n         : 20   {\r\n         : 21   auto threadId = 0;\r\n         : 22   #ifdef _OPENMP\r\n         : 23   omp_set_num_threads(THREAD_NUMBER);\r\n         : 24   #pragma omp parallel private(threadId)\r\n    0.00 :   263f:   endbr64\r\n    0.00 :   2643:   push   %rbp\r\n    0.00 :   2644:   mov    %rsp,%rbp\r\n    0.00 :   2647:   push   %rbx\r\n    0.00 :   2648:   sub    $0x28,%rsp\r\n    0.00 :   264c:   mov    %rdi,-0x28(%rbp)\r\n         : 21   {\r\n         : 22   threadId = omp_get_thread_num();\r\n    0.00 :   2650:   call   2310 &lt;omp_get_thread_num@plt>\r\n    0.00 :   2655:   mov    %eax,-0x14(%rbp)\r\n         : 26   #else\r\n         : 27   for(threadId=0; threadId&lt;THREAD_NUMBER; ++threadId)\r\n         : 28   {\r\n         : 29   #endif\r\n         : 30   for(auto idx=0; idx&lt;N; ++idx)\r\n    0.00 :   2658:   movl   $0x0,-0x18(%rbp)\r\n    0.00 :   265f:   nop\r\n    0.00 :   2660:   cmpl   $0x98967f,-0x18(%rbp)\r\n    0.02 :   2667:   jg     26ae &lt;calculateSums()\r\n         : 28   {\r\n         : 29   sums[threadId] += values[idx] >> threadId;\r\n    0.02 :   2669:   mov    -0x18(%rbp),%eax\r\n    0.00 :   266c:   cltq\r\n    0.00 :   266e:   mov    %rax,%rsi\r\n    0.00 :   2671:   lea    0x39e8(%rip),%rax\r\n    0.02 :   2678:   mov    %rax,%rdi\r\n    0.00 :   267b:   call   2996 &lt;std::array&lt;>::operator[](unsigned long)>\r\n    0.00 :   2680:   mov    (%rax),%edx\r\n    0.29 :   2682:   mov    -0x14(%rbp),%eax\r\n    0.00 :   2685:   mov    %edx,%ebx\r\n    0.59 :   2687:   mov    %eax,%ecx\r\n    0.00 :   2689:   shr    %cl,%ebx\r\n    2.92 :   268b:   mov    -0x14(%rbp),%eax\r\n    0.00 :   268e:   cltq\r\n    0.00 :   2690:   mov    %rax,%rsi\r\n   32.50 :   2693:   lea    0x39a6(%rip),%rax\r\n   49.12 :   269a:   mov    %rax,%rdi\r\n    0.01 :   269d:   call   296c &lt;std::array&lt;unsigned int, 8ul>::operator[](unsigned long)>\r\n    0.00 :   26a2:   mov    (%rax),%edx\r\n   13.48 :   26a4:   add    %ebx,%edx\r\n    0.97 :   26a6:   mov    %edx,(%rax)\r\n         : 26   for(auto idx=0; idx&lt;N; ++idx)\r\n    0.04 :   26a8:   addl   $0x1,-0x18(%rbp)\r\n    0.03 :   26ac:   jmp    2660 &lt;calculateSums()\r\n         : 19   #pragma omp parallel private(threadId)\r\n    0.00 :   26ae:   nop\r\n    0.00 :   26af:   mov    -0x8(%rbp),%rbx\r\n    0.00 :   26b3:   leave\r<\/pre>\n\n\n\n<p>The command produces low-level AT&amp;T assembler code with information about cache misses generated by each instruction. Thanks to debugging symbols the output is more human-readable and you can simply notice that almost all the misses occur after executing in line 28.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">sums[threadId] += values[idx] >> threadId;<\/pre>\n\n\n\n<p>So, the problem lies in writing data to the shared <em>sums<\/em>\u00a0buffer in the loop. Fixing the issue requires to limit the number of writes to the <em>sums<\/em>. You can simply do that by adding a stack-base variable which stores a temporary sum which will be finally assigned to the buffer after summing all elements from values. Check the code in Example 6:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nvoid calculateSums()\n{\n\tauto threadId = 0;\n#ifdef _OPENMP\n\tomp_set_num_threads(THREAD_NUMBER);\n\t#pragma omp parallel private(threadId)\n\t{\n\t\tthreadId = omp_get_thread_num();\n#else\n\tfor(threadId=0; threadId&amp;lt;THREAD_NUMBER; ++threadId)\n\t{\n#endif\n\t\tauto tmp = 0;\n\t\tfor(auto idx=0; idx&amp;lt;N; ++idx)\n\t\t{\n\t\t\ttmp += values&#x5B;idx] &gt;&gt; threadId; \n\t\t}\n\t\tsums&#x5B;threadId] = tmp;\n\t}\n}\n<\/pre><\/div>\n\n\n<p>Let\u2019s find out how this cosmetic change affects the performance of the program.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">$ g++ -o false_sharing_par -fopenmp false_sharing.cpp\r\n$ time .\/false_sharing\r\n$ real  0m0,387s\r\n$ user  0m0,908s\r\n$ sys   0m0,000s\r<\/pre>\n\n\n\n<p>Looks much better! This time it takes 0.387s which is quite satisfying.<\/p>\n\n\n\n<p>As you see performance counter statistics allowed to track the issue connected with memory. It is worth mentioning that <em>perf\u00a0<\/em>is quite a powerful tool and can be used to investigate different problems connected with performance.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Summary<\/strong><\/h2>\n\n\n\n<p>Finally, you got to the end \ud83d\ude42 As you see analysing the concurrent code is not a piece of cake. Even for simple examples presenting in this article it\u2019s not so easy to find a source of a problem (for instance in case of <em>false_sharing<\/em>).<\/p>\n\n\n\n<p>I hope that the techniques described in the article will help you improve your debugging skills and track some concurrency issues in your application faster. Using low-level threads or processes stuff requires good understanding of OS and processors\u2019 internals, so think twice about the design before implementing it. If you spend a lot of time dealing with new multi-threaded code you should definitely consider some high-level solutions.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>References<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>D. Spinellis, \u201cEffective Debugging: 66 Specific Ways to Debug Software and Systems,\u201d 2017<\/li>\n\n\n\n<li>R. Stallman, \u201cDebugging with Gdb: The Gnu Source-Level Debugger,\u201d 2018<\/li>\n\n\n\n<li><a href=\"https:\/\/en.cppreference.com\/\" target=\"_blank\" aria-label=\" (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"ek-link\" rel=\"nofollow\" >https:\/\/en.cppreference.com\/<\/a> [online, accessed 01-August-2023]<\/li>\n\n\n\n<li><a href=\"https:\/\/rr-project.org\/\" target=\"_blank\" aria-label=\" (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"ek-link\" rel=\"nofollow\" >https:\/\/rr-project.org\/<\/a> [online, accessed 01-August-2023]<\/li>\n<\/ul>\n\n\n<div class=\"kk-star-ratings kksr-auto kksr-align-left kksr-valign-bottom\"\n    data-payload='{&quot;align&quot;:&quot;left&quot;,&quot;id&quot;:&quot;23291&quot;,&quot;slug&quot;:&quot;default&quot;,&quot;valign&quot;:&quot;bottom&quot;,&quot;ignore&quot;:&quot;&quot;,&quot;reference&quot;:&quot;auto&quot;,&quot;class&quot;:&quot;&quot;,&quot;count&quot;:&quot;5&quot;,&quot;legendonly&quot;:&quot;&quot;,&quot;readonly&quot;:&quot;&quot;,&quot;score&quot;:&quot;5&quot;,&quot;starsonly&quot;:&quot;&quot;,&quot;best&quot;:&quot;5&quot;,&quot;gap&quot;:&quot;11&quot;,&quot;greet&quot;:&quot;&quot;,&quot;legend&quot;:&quot;5\\\/5 ( votes: 5)&quot;,&quot;size&quot;:&quot;18&quot;,&quot;title&quot;:&quot;Effective multi-threaded code debugging&quot;,&quot;width&quot;:&quot;139.5&quot;,&quot;_legend&quot;:&quot;{score}\\\/{best} ( {votes}: {count})&quot;,&quot;font_factor&quot;:&quot;1.25&quot;}'>\n            \n<div class=\"kksr-stars\">\n    \n<div class=\"kksr-stars-inactive\">\n            <div class=\"kksr-star\" data-star=\"1\" style=\"padding-right: 11px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 18px; height: 18px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" data-star=\"2\" style=\"padding-right: 11px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 18px; height: 18px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" data-star=\"3\" style=\"padding-right: 11px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 18px; height: 18px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" data-star=\"4\" style=\"padding-right: 11px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 18px; height: 18px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" data-star=\"5\" style=\"padding-right: 11px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 18px; height: 18px;\"><\/div>\n        <\/div>\n    <\/div>\n    \n<div class=\"kksr-stars-active\" style=\"width: 139.5px;\">\n            <div class=\"kksr-star\" style=\"padding-right: 11px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 18px; height: 18px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" style=\"padding-right: 11px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 18px; height: 18px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" style=\"padding-right: 11px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 18px; height: 18px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" style=\"padding-right: 11px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 18px; height: 18px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" style=\"padding-right: 11px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 18px; height: 18px;\"><\/div>\n        <\/div>\n    <\/div>\n<\/div>\n                \n\n<div class=\"kksr-legend\" style=\"font-size: 14.4px;\">\n            5\/5 ( votes: 5)    <\/div>\n    <\/div>\n","protected":false},"excerpt":{"rendered":"<p>Concurrency issues are one of the hardest to investigate and fix. It is caused mainly by the nondeterministic nature of &hellip; <a class=\"continued-btn\" href=\"https:\/\/sii.pl\/blog\/en\/effective-multi-threaded-code-debugging\/\">Continued<\/a><\/p>\n","protected":false},"author":550,"featured_media":23324,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_editorskit_title_hidden":false,"_editorskit_reading_time":0,"_editorskit_is_block_options_detached":false,"_editorskit_block_options_position":"{}","inline_featured_image":false,"footnotes":""},"categories":[1320],"tags":[1770,1769,1342],"class_list":["post-23291","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-hard-development","tag-debugging","tag-code-2","tag-embedded-en"],"acf":[],"aioseo_notices":[],"republish_history":[],"featured_media_url":"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2023\/08\/Effective-multi-threaded-code-debugging.jpg","category_names":["Hard development"],"_links":{"self":[{"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/posts\/23291"}],"collection":[{"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/users\/550"}],"replies":[{"embeddable":true,"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/comments?post=23291"}],"version-history":[{"count":3,"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/posts\/23291\/revisions"}],"predecessor-version":[{"id":23323,"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/posts\/23291\/revisions\/23323"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/media\/23324"}],"wp:attachment":[{"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/media?parent=23291"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/categories?post=23291"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/tags?post=23291"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}