The OpenMP ARB’s web page [1] offers an easy to understand OpenMP programming example: A small program solving the Helmholtz equation with a simple Jacobi algorithm as the hot spot. Though it might not be the most sophisticated numerical method to solve this problem, it is nevertheless quite useful to demonstrate and experiment with different OpenMP programming strategies.
Inside the iteration loop there are two loop nests which can be automatically parallelized. The 2D-array “uold” is used to store the results of the previous iteration and the 2D-array “u” is used to store the results of the current iteration. In the first loop nest “u” is copied to “uold” and in the second loop nest the sweep operation is executed including the sum of the squared residuals used for the error estimation and the termination condition of the surrounding iteration loop (see jacobi_omp_1.f90).
The program as it is presented on the above mentioned web site, shows how the OpenMP parallelization overhead can be reduced by extracting the parallel region around two parallelized loops, where today’s auto-parallelizing compilers would create one parallel region for each of these loops (see jacobi_omp_2.f90).
In EWOMP02 [2] was demonstrated how the parallel region can be further extended to contain the whole iteration loop, finally containing three barriers (see jacobi_omp_4.f90).
But it seems that we cannot do with less than three barriers within the iteration loop. The reuse of the reduction variable error in each iteration generates a false dependency. Omitting one of these barriers would clearly lead to a data race: error has to be initialized before and protected against the summation process of the second parallel loop. A barrier is needed in order to guarantee that the initialization is finished before the first thread updates this variable. The result of the reduction process is available after the next barrier. Then it can be used in the termination condition of the iteration loop. Because the iteration loop is inside the parallel region, all threads have to take the same decision, otherwise the program will hang. So “error” cannot be initialized again for the next iteration before all threads there take decision to go on iterating. As a side effect the same barriers separate the two parallelized loops from each other. Copying the array “u” to “uold” has to be separated from the calculation of the new approximation of the solution “u”.

But it turns out that after manually unrolling the iteration loop twice and using two different variables for the error estimation alternately (“error1” and “error2”) software pipelining can be used. By placing the usage of “error1” and the initialization of “error2” and vice versa in the same synchronization phase, one barrier per iteration can be eliminated (see jacobi_omp_8.f90).

So far the algorithm has not been changed, but obviously we can save the copying of the array “uold” to “u” by alternately using the values of one array, say “u1”, to calculate the new values, say “u2”, and vice versa. As a consequence we only need one barrier to protect the update of “u1” using “u2” against the update of “u2” using “u1” and the other way round. After fourfold unrolling the iteration loop and using four different variables for the error estimation, the same software pipelining technique can be employed again to reduce the number of barriers per iteration to only one (see jacobi_omp_9.f90).

The code is getting a bit lengthy with this technique. It can be condensed by putting the reduction variables into a vector (here: “errorh(1:3)”). Unfortunately then the reduction variable will be an array element, which is not allowed in the OpenMP reduction clause. So the reduction has to be programmed explicitly with a critical region and an explicit barrier (see jacobi_omp_10.f90).

Also the two sweep operations between “u1” and “u2” can be combined by adding a third dimension to the u array (here: “real*8 uh(n,m,0:1)”) and then alternating the value of the third dimension. But it turns out that toggling between the planes of this array might hinder the compiler optimization, because it is no longer trivial to verify that the accesses to the array are disjoint (see jacobi_omp_11.f90).

The table below shows some measurements in MFlop/s of six different versions of the Jacobi algorithm with a matrix size of 200x200 run on a Sun Fire 6800 with 24 UltraSPARC III Cu 900 MHz processors. The latest Sun ONE Studio 8 Fortran95 compiler was employed. The used matrices already fit in a single L2 cache, such that the memory bandwidth is not a limiting factor. On the other hand the problem is so small that it does not scale well beyond about 8 threads.

  • The serial program runs at 838 MFlop/s.
  • V1 is the original serial version here compiled with autoparallelization turned on. It runs at the same speed as a straight forward OpenMP version with 2 parallel regions in the iteration loop.
  • V2 is the original OpenMP from [1] version with one parallel region in the iteration loop containing two parallel loops (see jacobi_omp_2.f90). V2 is clearly faster than V1.
  • V3 contains one parallel region containing the whole iteration loop with 3 barriers (see jacobi_omp_4.f90). V3 is slightly faster than V2
  • In V4 one barrier has been saved by twofold unrolling and software pipelining (see jacobi_omp_9.f90). V4 is faster than V3 for more than 8 threads.
  • V5 with fourfold unrolling and software pipelining (see jacobi_omp_10.f90) plus avoiding the copy operation and toggling between two planes of a three dimensional array. This version contains only one barrier per iteration step and is by far the fastest.
  • The condensed version V6 is even slower than V1 because it could not be optimized efficiently by the compiler (see jacobi_omp_11.f90).
#threadsV1V2V3V4V5V6
18128158168031286333
215141534154515462412653
4262427172738271743351284
6331435253598359457171891
8365539854062408065292395
12407845744757500774433283
16386143654107435765523859
Performance in MFlop/s of six different versions of the Jacobi algorithm

The code in worksharing constructs is typically outlined into separate routines by an OpenMP compiler, which produces additional overhead. Also the splitting of the loop range into chunks causes additional index calculations which may slightly slow down the OpenMP version of a program.
After some minor program modifications, however, the two workshared do loops inside the iteration loop of the Jacobi program can be slightly modified to always have the same iteration space.
So the worksharing directives can be easily eliminated and the loop chunks precalculated in front of the iteration loop (see jacobi_omp_5.f90). But additional barriers have to be inserted where beforehand the end do directives synchronized the threads. Also the reduction clause has to be replaced by a summation of private partial sums in a critical or atomic region.

Source Files and Makefiles

 

Fortran90 Versions
Makefile_fMakefile for the Fortran90 versions, you may want to adapt the settings to your local programming environment.
type “gmake” to print help text
driver_ser.f90Driver routines for serial versions
driver_auto.f90Driver routines for autoparallelization
driver_omp.f90Driver routines for OpenMP versions
driver_mpi.f90Driver routines for MPI versions
driver_hyb.f90Driver routines for hybrid (OpenMP+MPI) versions
jacobi_ser_1.f90Jacobi solver - serial version/autoparallelizable version
jacobi_ser_2.f90Jacobi solver - serial version/autoparallelizable version, without copying arrays
jacobi_ser_3.f90Jacobi solver - serial version/autoparallelizable version, without copying arrays, using a 3D array uh(.,.,0:1), and thus loosing performance
jacobi_omp_1.f90Jacobi solver - OpenMP version,
two parallel regions with one parallel loop each, the naive approach
basically equivalent to what autoparallelizing compilers will currently do
jacobi_omp_2.f90Jacobi solver - OpenMP version, original OpenMP version from www.openmp.org, 2 parallel loops in one parallel region (PR)
jacobi_omp_3.f90Jacobi solver - OpenMP version, one PR outside the iteration loop, four Barriers
jacobi_omp_4.f90Jacobi solver - OpenMP version, one PR outside the iteration loop, three Barrier
jacobi_omp_5.f90Jacobi solver - OpenMP version, like jacobi_omp_4.f90, precalculation of the loop limits
jacobi_omp_6.f90Jacobi solver - OpenMP version, like jacobi_omp_5.f90, trying to optimize the reduction
jacobi_omp_8.f90Jacobi solver - OpenMP version, one PR outside the iteration loop, twofold unrolling, software pipelining, two barriers per iteration
jacobi_omp_9.f90Jacobi solver - OpenMP version, one PR outside the iteration loop, no copying, fourfold unrolling, software pipelining, one barrier per iteration
jacobi_omp_10.f90Jacobi solver - OpenMP version, one PR like V9, with errorh(0:2) to reduce code length
jacobi_omp_11.f90Jacobi solver - OpenMP version, one PR like V9, with errorh(0:2) and uh(.,.,0:1) to reduce code length - bad performance
jacobi_mpi_1.f90Jacobi solver - MPI version, domain decomposition in one dimension
jacobi_mpi_2.f90Jacobi solver - MPI version, domain decomposition in one dimension, simultaneous transfer and computationk
jacobi_hyb_1.f90Jacobi solver - hybrid (OpenMP+MPI) version
jacobi_omp_3err.f90Jacobi solver - OpenMP version containing a data race, to be detected with KSL's Assure or ThreadChecker
jacobi_omp_4err.f90Jacobi solver - OpenMP version containing a data race, to be detected with KSL's Assure or ThreadChecker
gethrtime.f90Timing routine
realtime.f90Timing routine
C Versions
Makefile_cMakefile for the Fortran90 versions, you may want to adapt the settings to your local programming environment.
type “gmake” to print help text
driver.cDriver routine for serial and OpenMP versions
jacobi.cJacobi solver - serial version
jacobi_99.cJacobi solver - serial version written in C99
jacobi_omp1.cJacobi solver - OpenMP version, two parallel regions with one parallel loop each, the naive approach.
jacobi_omp2.cJacobi solver - OpenMP version, two parallel loops in one parallel region (PR)
jacobi_omp3.cJacobi solver - OpenMP version, one PR outside the iteration loop, four Barriers
realtime.cTiming routine
Input datasets
input.smallsmall dataset - for debugging purposes
input.mediummedium dataset - in L2 cache (about 1 MB working set)
input.largelarge dataset - out of L2 cache (about 600 MB working set)
Archives
jacobi.sharall source files, Makefiles and input datasets in a self extractable shell archive
jacobi.tarall source files, Makefiles and input datasets in a tar archive

References

  1. The OpenMP Architecture Review Boards (ARB), http://www.openmp.org
  2. Dieter an Mey, Thomas Haarmann, Wolfgang Koschel, “Pushing Loop-Level Parallelization to the Limit”, EWOMP’02 , http://www.caspur.it/ewomp02
  3. Dieter an Mey, “Two OpenMP Programming Patterns”, EWOMP’03,
    http://www.rz.rwth-aachen.de/ewomp03