Throughput Performance
Our current compute cluster consists of a whole bunch of different machines adding up to some 5 TFLop/s of compute power in about 100 nodes with about 2000 cores. We are running 100s or even 1000s batch jobs per day. On the average each of these jobs use about 8 processor cores. Some employ message passing with MPI, others are parallelized with OpenMP and need shared memory, there are still serial jobs as well and there are hybrid jobs combining both paradigms MPI+OpenMP.
As a consequence we are interested in optimizing the throughput for a whole variety of parallel codes. But how can we measure and compare the throughput of a machine? Our ultimate goal is to develop a suitable framework to automatically run a job mix which reflects our actual job load.
Here we take single (parallelized) applications and run them n times simultaneously. We vary the number of instances n in order to find out what the optimal number n is to maximize the number of runs over time.
Throughput = #instances / average runtime
The absolute throughput number is not very meaningful. It just serves to compare different machines.
The following graphs compare the results of throughput measurements on the Sun Fire E6900 server (with 48 cores) and the UltraSPARC T2 based ST5x20 beta machine.
The first graph displays an image recognition application FIRE which has been parallelized with OpenMP on two levels. We ran this codes with 16 threads (4 on each level) per instance. both machines perform best with 4 instances running simultaneously. The application is cache friendly and thus the SFE6900 outperformans the ST51x20 by a factor of almost 3.
On the second graph you can observe that the ST51x20 outperforms the SFE6900 with a Navier Stokes Solver FLOWER, which has been run with 4 MPI processes and 4 threads per process for each instance. The ST51x20 still performs very well with 8 instances, allthough the machine is already overloaded with 8*4*4 = 128 threads.
The bevel gear simulation again is very cache friendly and thus, the SFE6900 clearly wins by a factor of over 2.
The Navier Stokes Solver TFS consumes quite some memory bandwidth. The hybrid benchmark version with 4 MPI processes and 4 threads per process reveals that the SFE6900 and the ST51x20 perform similarly. We did not yet investigate the variation of the SFE6900 measurments.
Another version of the TFS code is parallelized with OpenMP only. When ran with 8 threads the ST51x20 slightly outperforms the SFE6900.
The following table collects some of the above results. We add some performance information of the Opteron and Clovertown based machines. As these only have 8 cores, it is not very meaningful to start more than a single instance of two applications and the other three applications don't fit at all.
 SFE6900 24 x 2 cores  Sun T5x20 beta 1 x 8 cores  SFV40z 4 x 2 cores  DPE19050C 2 x 4 cores 

TFS (omp8)  4.7 (7x)  5.1 (10x)  1.9 (1x)  ~ 2.5 (1x) 
BevelGears (omp8)  6.8 (8x)  3.16 (11x)  5.0 (1x)  6.5 (1x) 
FIRE (omp4x4)  1.06 (4x)  0.43 (3x) 


FLOWER (hyb4x4)  0.53 (2/8x)  0.72 (8x) 


TFS (hyb4x4)  5.24 (2x)  4.47 (6x)  2.91(1x)*  2.5(1x)** 
*) with 4 MPI processes and 2 threads each only.
**) with 4 MPI processes and 1 thread each only on the Woodcrest based DPE1950W