Taking a Look at Memory Performance
A critical performance criterium for many an HPC application is memory bandwidth. We used a very simple program kernel and ran this kernel with multiple OpenMP threads working on private data at a time.
Explicit processor binding is employed to control the placement of the threads (Linux: taskset command, Solaris: SUNW_MP_PROCBIND environment variable). The memory footprint is large enough to not fit into the machines' caches.
long long *x, *xstart, *xend, mask; for (x = xstart; x < xend; x++) *x ^= mask;
The following table contains the results of our memory bandwidth measurements in GigaByte per second.
3.100 - 3.137
4.335 - 8.210
4.737 - 7.890
3.998 - 6.871
2.395 - 2.470
4.333 - 5.295
3.977 - 16.020
4.660 - 8.009
4.152 - 4.935
6.056 - 6.402
3.98* - 18.470
7.389 - 9.688
14587 - 15.584
13.365 - 13.486
*) this value is estimated
The obtainable bandwidth heavily depends on the placement of threads and data in many cases.
This is most critical on the SFV40z because of its ccNUMA architecture. If data is not located close to the thread which accesses it, the HyperTransport link can easily become a severe bottleneck, a fact which takes many OpenMP programmers by surprise.
In contrast the SFE2900 has a flat memory, the bandwidth does hardly depend on the data placement, but in total the bandwidth is limited by the snooping mechanism.
The DPE1950W and DPE1950C are both sensitive to the placement of the threads. If they share paths to memory, the bandwidth suffers clearly.
On the ST5x20 machine placement of threads matters only very little. The bandwidth is optimal with 16 threads and falls off slightly when running up to 64 threads.
The single thread memory bandwidth is best on the DPE1950W (4.583 GB/s) and almost as on the SFV40z and DPE1950C processors (almost 4 GB/s). The SFE2900 and ST5x20 only offer 1.609 and 1.237 GB/s respectively.
With 15.584 GB/s the per chip bandwidth is by far best on the UltraSPARC T2 compared to 3.1 GB/s on UltraSPARC IV and 3.998-4.737 on the other processors in this comparison.
Considering the total bandwidth of all machines, only 4 Opteron processors all together (18.47 GB/s) outperform a single UltraSPARC T2 chip, whereas all the machines are limited to 8.009 - 8.710 GB/s in this test.
This is only a simple toy application. Still, there is an important lesson to learn. The UltraSPARC T2 processor offers an emazing memory bandwidth, if multiple threads can be employed. And when parallelizing with OpenMP, the placement of threads and data is not critical and also Solaris does a supperb job in this respect already, whereas Linux on the Xeon and Opteron based system requires user attention.