The OpenMP ARB’s web page [1] offers an easy to understand OpenMP programming example: A small program solving the Helmholtz equation with a simple Jacobi algorithm as the hot spot. Though it might not be the most sophisticated numerical method to solve this problem, it is nevertheless quite useful to demonstrate and experiment with different OpenMP programming strategies.

Inside the iteration loop there are two loop nests which can be automatically parallelized. The 2D-array “uold” is used to store the results of the previous iteration and the 2D-array “u” is used to store the results of the current iteration. In the first loop nest “u” is copied to “uold” and in the second loop nest the sweep operation is executed including the sum of the squared residuals used for the error estimation and the termination condition of the surrounding iteration loop (see jacobi_omp_1.f90).

The program as it is presented on the above mentioned web site, shows how the OpenMP parallelization overhead can be reduced by extracting the parallel region around two parallelized loops, where today’s auto-parallelizing compilers would create one parallel region for each of these loops (see jacobi_omp_2.f90).

In EWOMP02 [2] was demonstrated how the parallel region can be further extended to contain the whole iteration loop, finally containing three barriers (see jacobi_omp_4.f90).

But it seems that we cannot do with less than three barriers within the iteration loop. The reuse of the reduction variable error in each iteration generates a false dependency. Omitting one of these barriers would clearly lead to a data race: error has to be initialized before and protected against the summation process of the second parallel loop. A barrier is needed in order to guarantee that the initialization is finished before the first thread updates this variable. The result of the reduction process is available after the next barrier. Then it can be used in the termination condition of the iteration loop. Because the iteration loop is inside the parallel region, all threads have to take the same decision, otherwise the program will hang. So “error” cannot be initialized again for the next iteration before all threads there take decision to go on iterating. As a side effect the same barriers separate the two parallelized loops from each other. Copying the array “u” to “uold” has to be separated from the calculation of the new approximation of the solution “u”.

But it turns out that after manually unrolling the iteration loop twice and using two different variables for the error estimation alternately (“error1” and “error2”) software pipelining can be used. By placing the usage of “error1” and the initialization of “error2” and vice versa in the same synchronization phase, one barrier per iteration can be eliminated (see jacobi_omp_8.f90).

So far the algorithm has not been changed, but obviously we can save the copying of the array “uold” to “u” by alternately using the values of one array, say “u1”, to calculate the new values, say “u2”, and vice versa. As a consequence we only need one barrier to protect the update of “u1” using “u2” against the update of “u2” using “u1” and the other way round. After fourfold unrolling the iteration loop and using four different variables for the error estimation, the same software pipelining technique can be employed again to reduce the number of barriers per iteration to only one (see jacobi_omp_9.f90).

The code is getting a bit lengthy with this technique. It can be condensed by putting the reduction variables into a vector (here: “errorh(1:3)”). Unfortunately then the reduction variable will be an array element, which is not allowed in the OpenMP reduction clause. So the reduction has to be programmed explicitly with a critical region and an explicit barrier (see jacobi_omp_10.f90).

Also the two sweep operations between “u1” and “u2” can be combined by adding a third dimension to the u array (here: “real*8 uh(n,m,0:1)”) and then alternating the value of the third dimension. But it turns out that toggling between the planes of this array might hinder the compiler optimization, because it is no longer trivial to verify that the accesses to the array are disjoint (see jacobi_omp_11.f90).

The table below shows some measurements in MFlop/s of six different versions of the Jacobi algorithm with a matrix size of 200x200 run on a Sun Fire 6800 with 24 UltraSPARC III Cu 900 MHz processors. The latest Sun ONE Studio 8 Fortran95 compiler was employed. The used matrices already fit in a single L2 cache, such that the memory bandwidth is not a limiting factor. On the other hand the problem is so small that it does not scale well beyond about 8 threads.

- The serial program runs at 838 MFlop/s.
- V1 is the original serial version here compiled with autoparallelization turned on. It runs at the same speed as a straight forward OpenMP version with 2 parallel regions in the iteration loop.
- V2 is the original OpenMP from [1] version with one parallel region in the iteration loop containing two parallel loops (see jacobi_omp_2.f90). V2 is clearly faster than V1.
- V3 contains one parallel region containing the whole iteration loop with 3 barriers (see jacobi_omp_4.f90). V3 is slightly faster than V2
- In V4 one barrier has been saved by twofold unrolling and software pipelining (see jacobi_omp_9.f90). V4 is faster than V3 for more than 8 threads.
- V5 with fourfold unrolling and software pipelining (see jacobi_omp_10.f90) plus avoiding the copy operation and toggling between two planes of a three dimensional array. This version contains only one barrier per iteration step and is by far the fastest.
- The condensed version V6 is even slower than V1 because it could not be optimized efficiently by the compiler (see jacobi_omp_11.f90).

#threads | V1 | V2 | V3 | V4 | V5 | V6 |

1 | 812 | 815 | 816 | 803 | 1286 | 333 |

2 | 1514 | 1534 | 1545 | 1546 | 2412 | 653 |

4 | 2624 | 2717 | 2738 | 2717 | 4335 | 1284 |

6 | 3314 | 3525 | 3598 | 3594 | 5717 | 1891 |

8 | 3655 | 3985 | 4062 | 4080 | 6529 | 2395 |

12 | 4078 | 4574 | 4757 | 5007 | 7443 | 3283 |

16 | 3861 | 4365 | 4107 | 4357 | 6552 | 3859 |

Performance in MFlop/s of six different versions of the Jacobi algorithm |

The code in worksharing constructs is typically outlined into separate routines by an OpenMP compiler, which produces additional overhead. Also the splitting of the loop range into chunks causes additional index calculations which may slightly slow down the OpenMP version of a program.

After some minor program modifications, however, the two workshared do loops inside the iteration loop of the Jacobi program can be slightly modified to always have the same iteration space.

So the worksharing directives can be easily eliminated and the loop chunks precalculated in front of the iteration loop (see jacobi_omp_5.f90). But additional barriers have to be inserted where beforehand the end do directives synchronized the threads. Also the reduction clause has to be replaced by a summation of private partial sums in a critical or atomic region.

### Source Files and Makefiles

Fortran90 Versions | |
---|---|

Makefile_f | Makefile for the Fortran90 versions, you may want to adapt the settings to your local programming environment. type “gmake” to print help text |

driver_ser.f90 | Driver routines for serial versions |

driver_auto.f90 | Driver routines for autoparallelization |

driver_omp.f90 | Driver routines for OpenMP versions |

driver_mpi.f90 | Driver routines for MPI versions |

driver_hyb.f90 | Driver routines for hybrid (OpenMP+MPI) versions |

jacobi_ser_1.f90 | Jacobi solver - serial version/autoparallelizable version |

jacobi_ser_2.f90 | Jacobi solver - serial version/autoparallelizable version, without copying arrays |

jacobi_ser_3.f90 | Jacobi solver - serial version/autoparallelizable version, without copying arrays, using a 3D array uh(.,.,0:1), and thus loosing performance |

jacobi_omp_1.f90 | Jacobi solver - OpenMP version, two parallel regions with one parallel loop each, the naive approach basically equivalent to what autoparallelizing compilers will currently do |

jacobi_omp_2.f90 | Jacobi solver - OpenMP version, original OpenMP version from www.openmp.org, 2 parallel loops in one parallel region (PR) |

jacobi_omp_3.f90 | Jacobi solver - OpenMP version, one PR outside the iteration loop, four Barriers |

jacobi_omp_4.f90 | Jacobi solver - OpenMP version, one PR outside the iteration loop, three Barrier |

jacobi_omp_5.f90 | Jacobi solver - OpenMP version, like jacobi_omp_4.f90, precalculation of the loop limits |

jacobi_omp_6.f90 | Jacobi solver - OpenMP version, like jacobi_omp_5.f90, trying to optimize the reduction |

jacobi_omp_8.f90 | Jacobi solver - OpenMP version, one PR outside the iteration loop, twofold unrolling, software pipelining, two barriers per iteration |

jacobi_omp_9.f90 | Jacobi solver - OpenMP version, one PR outside the iteration loop, no copying, fourfold unrolling, software pipelining, one barrier per iteration |

jacobi_omp_10.f90 | Jacobi solver - OpenMP version, one PR like V9, with errorh(0:2) to reduce code length |

jacobi_omp_11.f90 | Jacobi solver - OpenMP version, one PR like V9, with errorh(0:2) and uh(.,.,0:1) to reduce code length - bad performance |

jacobi_mpi_1.f90 | Jacobi solver - MPI version, domain decomposition in one dimension |

jacobi_mpi_2.f90 | Jacobi solver - MPI version, domain decomposition in one dimension, simultaneous transfer and computationk |

jacobi_hyb_1.f90 | Jacobi solver - hybrid (OpenMP+MPI) version |

jacobi_omp_3err.f90 | Jacobi solver - OpenMP version containing a data race, to be detected with KSL's Assure or ThreadChecker |

jacobi_omp_4err.f90 | Jacobi solver - OpenMP version containing a data race, to be detected with KSL's Assure or ThreadChecker |

gethrtime.f90 | Timing routine |

realtime.f90 | Timing routine |

C Versions | |

Makefile_c | Makefile for the Fortran90 versions, you may want to adapt the settings to your local programming environment. type “gmake” to print help text |

driver.c | Driver routine for serial and OpenMP versions |

jacobi.c | Jacobi solver - serial version |

jacobi_99.c | Jacobi solver - serial version written in C99 |

jacobi_omp1.c | Jacobi solver - OpenMP version, two parallel regions with one parallel loop each, the naive approach. |

jacobi_omp2.c | Jacobi solver - OpenMP version, two parallel loops in one parallel region (PR) |

jacobi_omp3.c | Jacobi solver - OpenMP version, one PR outside the iteration loop, four Barriers |

realtime.c | Timing routine |

Input datasets | |

input.small | small dataset - for debugging purposes |

input.medium | medium dataset - in L2 cache (about 1 MB working set) |

input.large | large dataset - out of L2 cache (about 600 MB working set) |

Archives | |

jacobi.shar | all source files, Makefiles and input datasets in a self extractable shell archive |

jacobi.tar | all source files, Makefiles and input datasets in a tar archive |

### References

- The OpenMP Architecture Review Boards (ARB), http://www.openmp.org
- Dieter an Mey, Thomas Haarmann, Wolfgang Koschel, “Pushing Loop-Level Parallelization to the Limit”, EWOMP’02 , http://www.caspur.it/ewomp02
- Dieter an Mey, “Two OpenMP Programming Patterns”, EWOMP’03,

http://www.rz.rwth-aachen.de/ewomp03