This is a guided assignment that should give you an idea of what to look for when optimizing C++ code.

Download optimize.cpp on the supercomputer. Compile it with optimization and see how fast it runs:

g++ -std=c++20 -O3 -o optimize optimize.cpp # the "O" in "-O3" is the letter
time ./optimize

You’ll submit an informal “essay” detailing the changes you made and the associated performance gains, so write down answers to the bolded questions that follow as you get to them. Since you’ll be submitting with git, it’s recommended that you do all your work in your scicomp repository.

Throughout the assignment, the result needs to remain correct–if the output simulation time changes you’ll need to fix it before proceeding.

Eliminating Unnecessary Work

Many of the functions in optimize.cpp take copies of large variables. To see how huge an effect this can have, change the laplacian function to take x by const reference rather than by value. Re-compile and time it again after doing so. How much of a speedup do you get? Why do you think it’s so dramatic?

Set rows to 800, where it will stay for the rest of the assignment, and compile and time once more for comparison to subsequent changes.

Fix any other functions (along with lambdas, range-based for loops, etc.) that should take arguments by const reference rather than by value. When you’re done and have recompiled, the system field from time’s output should be a very small portion of the real time. How much more of a speedup do these changes yield? Why isn’t it as significant?

Cache Efficiency

Some of the for loops in optimize.cpp iterate over columns first, then rows. Change the order of iteration on these loops, iterating over i then j rather than j then i. How much faster does optimize run with this change? Why do you think that is?

Vectorization

To see the results of this change, you’ll need to coax the compiler to vectorize; in this case, you can do so by adding -Ofast to your compile flags. Do so, then time optimize again for a point of comparison.

Remove the if (...) continue; statements in the inner loops of energy and step, instead iterating over the correct bounds (you’ll have to modify each for loop to do so). How much of a speedup results, and why?

Threading

Now that most of the program’s inefficiencies are eliminated, it’s worth turning to threading. Add OpenMP threading to your program, parallelizing each outer loop in energy and step. Add -fopenmp to your compiler flags, and use export OMP_NUM_THREADS=2 (or 4, or 8, etc.) to set the number of threads. You’ll need a reduction clause on each loop in energy to correctly sum E. You can look to the #pragma openmp ... lines in MountainRange.hpp for an example; there’s no need to overthink it, OpenMP threading is very simple. What sort of speedup do you get when you run with 2 threads? 4? 8? 16? Why do you think the diminishment of returns is so severe?

Other Optimizations

Since you’ll probably be applying some of the changes you made above to your own code anyway, feel free to use it as the basis for any further optimizations. I only care that the optimize binary that your CMakeLists.txt outputs is created, behaves correctly, and is fast.

Make any other optimizations you think are prudent. Some things to consider:

  • Does the problem’s symmetry lend itself to algorithmic optimization?
  • Does the energy calculation need to continue once the floor has been exceeded?
  • Given the predictable exponential decay of energy, does it need to be calculated every iteration?
  • Does the potential energy calculation really require two separate loops over u?
  • Is iterating over separate, rather than contiguous, vectors the most efficient?
  • Would it be more efficient to interleave u and v?
  • Could an iterative method (that yields exactly mathematically equivalent results, of course) speed things up?

How much more of a speedup were you able to get? What seemed to help the most, and why do you think that’s the case? Did you find out anything interesting or unexpected during the course of this optimization?

Submission

Update your CMakeLists.txt to create two more binaries, optimize and wavesolve_openmp; make sure to compile with OpenMP. You can default to -Ofast (rather than -O3) for release builds by putting the following line before the call to project in CMakeLists.txt:

set(CMAKE_CXX_FLAGS_RELEASE "-Ofast -DNDEBUG" CACHE STRING "Flags used by the CXX compiler during RELEASE builds")

Make sure to run CMake with cmake -DCMAKE_BUILD_TYPE=release to take advantage of this setting.

optimize should give the same result as the binary that results from compiling the original optimize.cpp, but needs to be much faster: on a full m9 node (which you can request with salloc -p m9 -N 1 -n 28 ...) with 8 OpenMP threads (export OMP_NUM_THREADS=8) it should run in less than a second. Feel free to fine-tune compilation flags in your CMakeLists.txt to get an efficient binary.

wavesolve_openmp will be much like wavesolve_serial, but will add OpenMP threading and has speed requirements. It must run on 2d-medium-in.wo from wavefiles.tar.gz with 8 OpenMP threads on a full m9 node in 20 seconds.

Add an informal plain text “essay” titled cpp-optimization.txt (or cpp-optimization.md if you want markdown formatting) that contains answers to the bolded questions above.

Develop in a branch named phase3 or tag the commit you’d like me to grade from phase3 and push it.

Grading

This phase is worth 20 points. The following deductions, up to 20 points total, will apply for a program that doesn’t work as laid out by the spec:

Defect Deduction
Failure to compile 2 points per binary
Failure of wavesolve_openmp to work on each of 3 test files 2 points each
Failure to run successfully (e.g. due to a segmentation fault or hang) 2 points per binary
Failure of wavesolve_openmp to checkpoint correctly 2 points
Failure of wavesolve_openmp to run on 2d-medium-in.wo on m9 with 8 threads in 20 seconds 4 points
…in 60 seconds 4 points
Failure of optimize to output “5.39”, and only “5.39”, to stdout 4 points
Failure of optimize to run within a second on m9 with 8 threads 4 points
optimize isn’t relevant to the assignment, or just prints “5.39” or similar 8 points
The informal essay doesn’t answer the bolded questions above, or the answers are nonsensical 1-8 points

An extra credit point will be awarded if optimize runs correctly in less than 0.1 seconds with 8 threads on m9.