OpenMP Threading
OpenMP is an API for multi-platform, shared-memory, parallel programming in C, C++, and Fortran. OpenMP commands take the form of special comments called pragmas. Threads will spawn where instances of #pragma omp parallel
are.
// There's a race condition here, we'll talk about it below
#pragma omp parallel for
for (long i{0}; i < 20'000'000l; ++i) {
counter += 1;
}
If a compiler doesn’t understand OpenMP’s special comments, then this code will still compile and will run on a single thread. If it does understand OpenMP, then where the OMP parallel directive is, this will spawn threads for the next block. This allows serial and parallel code to be generated from the same program with very little effort.
HPC Tutorials volume 1 has an OpenMP section that you can reference for further information if you’d like.
Compiling with OpenMP
g++ -fopenmp ...
will compile with OpenMP; it’s usually easy to find the required flag on the man
page or the internet for any compiler. CMakeLists.txt
requires two things: finding OpenMP, and linking it to a target:
# At the top, under the `project` call:
find_package(OpenMP REQUIRED)
# With other compilation calls
add_executable(blah blah.cpp)
target_link_libraries(blah PRIVATE OpenMP::OpenMP_CXX)
Running with OpenMP
export OMP_NUM_THREADS=N
(where N
is some integer) will set the number of threads that OpenMP programs run with. Otherwise, OpenMP programs are executed just as any other would be.
Directives
The most useful directives (which come after #pragma omp
) are:
parallel
: create a team of OpenMP threads that execute a regionfor
: create a parallel for loopbarrier
: create an explicit barrier; there are implied barriers at the end offor
,parallel
, andsingle
blocks, which can be removed withnowait
atomic
: ensure that a specific storage location is accessed atomicallycritical
: restrict execution of a block to a single thread at a timesingle
: specify that a block is executed by only one threadsimd
: indicate that a loop can be vectorized
Reductions
Reductions use the associative and commutative properties of an operation to be able to parallelize those operations by doing them out of order. For example, the following two problems are equivalent:
\[(1 + 2 + 3) + (4 + 5 + 6) = 21\] \[(4 + 1) + (3 + 2) + (6 + 5) = 21\]Because of that, we could have thread A do the sum of 1, 2, and 3 and thread B do the sum of 4, 5, and 6, then sum the results of both threads. Only that last sum, the one where the sums of threads A and B are added together will need to be synchronized.
OpenMP comes with these built-in reduction operators in C and C++:
Arithmetic | + |
* |
- |
min |
max |
Logical | & |
&& |
| |
|| |
^ |
The race condition in the example at the top of this page, which results from multiple threads trying to simultaneously modify counter
, can be fixed with a sum reduction:
#include <iostream>
int main() {
size_t counter = 0;
#pragma omp parallel for reduction(+:counter)
for (size_t i = 0; i < 20000000; ++i) {
counter += 1;
}
std::cout << counter << std::endl;
return 0;
}