What Are Data Races and How to Avoid Them During Software Development

Data races are a common problem in multithreaded programming. Data races occur when multiple tasks or threads access a shared resource without sufficient protections, leading to undefined or unpredictable behavior.

When you author software to simultaneously handle multiple tasks, you may use multithreaded programming, that is, programs with constructs such as multiple entry points, interleaving of threads, and asynchronous interrupts. However, multithreaded programming can be highly complex and introduce subtle defects such as data races and deadlocks. When such a defect occurs, it can take a long time to reproduce the issue and even longer to identify the root cause and fix the defect.

Example of a Data Race

Let us start with the simplest example of a data race. In the following diagram, Task1 and Task2 write values to the shared resources, sharedVar1 and sharedVar2. The tasks later read the values of the shared resources through the functions, do_sth_with_shared_resources1() and do_sth_with_shared_resources2(). Let us begin with a simple situation that has no protection mechanisms established in the operations.

Figure 1. Simultaneous access to shared resources by two tasks without specific protection

Figure 1. Simultaneous access to shared resources by two tasks without specific protection

You may ask: what value of sharedVar1 does the function do_sth_with_shared_resources1()read? You may expect the value to be 11 since this value was written in Task1 immediately before the function call. However, without any protection mechanisms, the value read may be 21, or in some situations, even a corrupt random value. Because of the concurrent execution of Task1 and Task2, the shared resource sharedVar1 may be rewritten in Task2 before being read again in Task1.

In other words, both sequences can happen:

  • Sequence 1:
    • Task1: sharedVar1 = 11;
    • Task1: do_sth_wth_shared_resources1();
  • Sequence 2:
    • Task1: sharedVar1 = 11;
    • Task2: sharedVar1 = 21;
    • Task1: do_sth_wth_shared_resources1();

Without imposing protection mechanisms, any code you write in do_sth_wth_shared_resources1() cannot rely on a particular sequence occurring and, therefore, a particular value of sharedVar1. If your code relies on a particular value of sharedVar1, then the data race becomes a bug.

Data races occur when a shared resource is unpredictably accessed by multiple tasks. Data races may not be easy to understand because the execution of instructions does not follow the sequence in which the instructions are written. Also, the result can change in each test run, making a data race difficult to reproduce and fix.

How to Prevent Data Races with Mutual Exclusion Locks (Mutexes)

A common mechanism to avoid data races is to force a mutual exclusion. In the previous example, you can enforce sequence 1 by:

  • Locking a mutex before Task1: sharedVar1 = 11;
  • Unlocking the mutex after Task1: do_sth_wth_shared_resources1();

Other tasks, such as Task2, have to wait for the mutex to be unlocked before accessing sharedVar1; however, the placement of mutex locks and unlocks is not as simple as it sounds. Here is a C code example that implements the tasks shown in Figure 1 with the POSIX-based pthread_ family of functions. The example attempts to protect against data races by using functions such as pthread_mutex_lock and pthread_mutex_unlock to lock and unlock a mutex.

You can see this full code example to review the details.

The code starts two threads, each with its own temporary variable tmp. The temporary variable reads the value of a shared resource (sharedVar1 or sharedVar2) immediately after the resource is written. The write and subsequent read operations are protected using mutexes. As a result, the values of the temporary variable and the shared resource are expected to be the same. If the values do not agree, the threads print a message such as thread:1, sharedVar2 = 22 and tmp = 12 differ.

You can look at the code to review the details, or run the above code in a real environment for the following results.

Figure 2. Data race seen in program output

You can see that the message for unintended values, thread:1, sharedVar2 = 22 and tmp = 12 differ, appears several times. Despite the placement of mutexes, the data race continues to occur.

Debugging such data race in a real application can take several hours because of the non-deterministic nature of the issue. As you can see in Figure 2, the message for unintended values appears only sporadically. Also, once reproduced, the issue can be difficult to fix. It is not sufficient to simply use mutexes: their placement in the code is also critical.

How to Detect and Fix Data Races

A static analysis tool that automatically detects data races and suggests possible fixes can save a lot of debugging effort.

To understand why the data race continues to occur in the above example despite the use of mutexes, we used the data race checkers of a static analysis tool, Polyspace Bug Finder™. This tool can detect the data race that we saw earlier through the program output.

Figure 3. Data race on shared resource sharedVar2 from two tasks

Figure 3. Data race on shared resource sharedVar2 from two tasks

Figure 4. Checking the program flow with Access Graph

In Figure 4, you can see the program control flow that leads to each operation. The circles marked with ‘t’ show the beginning of two different tasks, task_main::thread1() and task_main::thread2(). The subsequent circles show how the control flow goes through functions, thread1_main and thread2_main, and eventually to the write operations. A shield icon on the write operation in the second task indicates that some protection mechanisms are used on this operation. The absence of a similar icon in the first task confirms the earlier suggestion that write operations on sharedVar2 are not protected in this task.

From this suggestion, you can check the function thread1_main and see that the mutex in this function is prematurely unlocked before all shared resources are accessed. You can change the placement of the mutex so that it occurs after sharedVar2 is accessed, and fix the data race.

Figure 5. Resolving the issue by changing the timing of unlocking mutex

Figure 5. Resolving the issue by changing the timing of unlocking mutex

Summary

In the example from the section above, you can spot the data race during a visual inspection, but in real applications of hundreds of files and thousands of lines of code, data races can be difficult to detect because:

  • Problems occur sporadically and can be hard to reproduce
  • Results can differ for each run. Even setting a breakpoint with a debugger can influence the result.
  • Incorrect placement of mutexes may not fix the root cause or may introduce other problems such as deadlocks or double locks

It is important to run a static analysis tool at a regular cadence to identify data races as soon as possible. A static analysis tool creates an abstraction of the concurrency model used in your program, and it can easily detect whether the established protections are sufficient to prevent data races.

Polyspace Bug Finder offers several features to identify concurrency issues such as data races and deadlocks, along with features that ease their review, such as the above textual and graphical representation of conflicting operations. These features help you identify the root cause of a data race more easily.

In the next post, we’ll look at another common concurrency issue known as deadlock.

Written by Yoo Yong-chul and Anirban Gangopadhyay.

Yoo Yong-chul works as an application engineer at MathWorks Korea and is responsible for code verification products.

Anirban Gangopadhyay works as a documentation writer at MathWorks US. He oversees technical documentation of Polyspace® products.

Original post: Naver blog post

MathWorks Korea 2021.4.17