OpenMP Threading

OpenMP is an API for multi-platform, shared-memory, parallel programming in C, C++, and Fortran. OpenMP commands take the form of special comments called pragmas. Threads will spawn where instances of #pragma omp parallel are.

// There's a race condition here, we'll talk about it below
#pragma omp parallel for
for (long i{0}; i < 20'000'000l; ++i) {
   counter += 1;
}

If a compiler doesn’t understand OpenMP’s special comments, then this code will still compile and will run on a single thread. If it does understand OpenMP, then where the OMP parallel directive is, this will spawn threads for the next block. This allows serial and parallel code to be generated from the same program with very little effort.

HPC Tutorials volume 1 has an OpenMP section that you can reference for further information if you’d like.

Compiling with OpenMP

g++ -fopenmp ... will compile with OpenMP; it’s usually easy to find the required flag on the man page or the internet for any compiler. CMakeLists.txt requires two things: finding OpenMP, and linking it to a target:

# At the top, under the `project` call:
find_package(OpenMP REQUIRED)
# With other compilation calls
add_executable(blah blah.cpp)
target_link_libraries(blah PRIVATE OpenMP::OpenMP_CXX)

Running with OpenMP

export OMP_NUM_THREADS=N (where N is some integer) will set the number of threads that OpenMP programs run with. Otherwise, OpenMP programs are executed just as any other would be.

Directives

The most useful directives (which come after #pragma omp) are:

parallel: create a team of OpenMP threads that execute a region
for: create a parallel for loop
barrier: create an explicit barrier; there are implied barriers at the end of for, parallel, and single blocks, which can be removed with nowait
atomic: ensure that a specific storage location is accessed atomically
critical: restrict execution of a block to a single thread at a time
single: specify that a block is executed by only one thread
simd: indicate that a loop can be vectorized

Reductions

Reductions use the associative and commutative properties of an operation to be able to parallelize those operations by doing them out of order. For example, the following two problems are equivalent:

\[(1 + 2 + 3) + (4 + 5 + 6) = 21\] \[(4 + 1) + (3 + 2) + (6 + 5) = 21\]

Because of that, we could have thread A do the sum of 1, 2, and 3 and thread B do the sum of 4, 5, and 6, then sum the results of both threads. Only that last sum, the one where the sums of threads A and B are added together will need to be synchronized.

OpenMP comes with these built-in reduction operators in C and C++:


Arithmetic	`+`	`*`	`-`	`min`	`max`
Logical	`&`	`&&`	`\|`	`\|\|`	`^`

The race condition in the example at the top of this page, which results from multiple threads trying to simultaneously modify counter, can be fixed with a sum reduction:

#include <iostream>

int main() {
    size_t counter = 0;
    #pragma omp parallel for reduction(+:counter)
    for (size_t i = 0; i < 20000000; ++i) {
       counter += 1;
    }
    std::cout << counter << std::endl;
    return 0;
}