Sparse Matrix Vector Multiplication
In many of the PDE solvers running on our machines a CG type of linear equation solver is used in the heart of the computations. The most time consuming part of such a solver frequently is a sparse matrix vector multiplication (sMxV), which accesses the main memory using index lists.
Here we extracted the sMxV-kernel out of the DROPS FEM Navier-Stokes solver which has been parallelized with OpenMP and also took two stiffness matrices as test cases. The small one has a memory footprint of 18 MB and the larger one takes 317 MB.
The program models two-phase flows by so-called level set functions in order to describe the interface between the two phases. The picture shows a silicon oil drop in D2O.
The first graph depicts the performance of the sMxV kernel for the small matrix on different machines in MFlop/s over the number of OpenMP threads.
The small and cache-friendly memory footprint gives rise to a rather high absolute performance of well over 2 GFlop/s on the DPE1905W with 4 threads, on the SFV40z machine with 8 threads and on the SFE2900 with 16 threads, whereas the ST5x20 scales well up to 32 threads, but only obtains some 1.345 GFlop/s because of the single L2 cache of 4 MB which all threads have to share.
Now the next graph depicts the performance of the sMxV kernel for the large matrix, which is much more important for production.
Here the caches of all machines are too small to hold a reasonable portion of the matrix and the memory bandwidth becomes the limiting factor. Now the SFV40z outperforms the DPE1905W because of the superior memory bandwidth and because the code takes care of proper placement of the data. Both machines clearly outperform the SFE2900, even if more threads are employed.
Here the single UltraSPARC T2 chip clearly wins. The program scales extremely well up to 64 threads obtaining 2.56 GFlop/s. Surprisingly, when overlaoding that machine with up to 112 threads, the performance even increases further up to 3.216 GFlop/s !.