Comparing performance versus power consumption on four different compute servers
The growing clock cycle has lead to a harsh increase in power consumption of compute servers over the last years and has been stretching the limits of power supply and cooling equipment of many a large computing center.As a consequence the increase rate of the processor chips' clock cycle has suddenly come to an end, in some cases manufacturers decided to take step back (see fo example IBM's Blue Gene computer). Also the architects of the UltraSPARC T2 made a similar choice.
We measured the power consumption of four different compute servers in our center
- A one rack Sun Fire E6900 system equipped with 96 GB of memory and 24 dual-core UltraSPARC IV processors (130 nm technology) running at 1.2 GHz, installed in 2001 and upgraded in 2004 (SFE6900),
- A 3 rack units Sun Fire V40z system equipped with 16 GB of memory and 4 dual-core Opteron 875 processors (90 nm technology) running at 1.2 GHz, installed in 2005 (SFV40z),
- A 1 rack unit Dell PowerEdge 1950 equipped with 16 GB of memory and 2 quad-core Xeon 5355 processors (65 nm technology) running at 2.66 GHz, installed in 2007 (DPE1950C), and
- A 2 rack units alpha version of the upcoming Niagara 2 based Sun SPARC Enterprise T5x20 server, which has 32 GB of memory and one 8-core processor (65 nm technology) running at 1.4 GHz (ST5x20). Each core is able to execute 8 threads at a time turning this chip into a 64-way SMP from the user perspective.
While measuring the power input by inductivity we ran a few compute and/or memory intensive kernel programs (stream, linpack) to find out in how far the load has an impact on power consumption. The effect of memory use was negligible with respect to the power consumption of all of the four systems, whereas the CPU load had quite an impact for all systems except for the older Sun Fire E6900 machine. The newer systems consume considerably less power when idling, whereas the Sun Fire E6900 only displays a margin of some 5 percent.
The table below summarizes our findings.
power consumption when idling
power consumption at 50% CPU load [W]
power consumption at 100% CPU load [W]
Percentage of maximal power consumption at 50% CPU load [%]
Percentage of maximal power consumption when idling [%]
1 full cabinet
24 x US IV
4 x Opteron 875
2 x Xeon 5355 quadcore 2.66 GHz
1 x US T2
Now it is of course interesting to relate the power consumption to the performance of these systems. As they differ quite a bit in their performance characteristics, we considered two extreme cases, which are of interest for HPC: The stream benchmark measures the memory bandwidth, which is very critical for many PDE solvers at our Center. On the other hand, the linpack benchmark measures the peak performance of the processors, as it can be nicely tuned to profit from cache locality (here we employed the multi-threaded Sun Performance and the Intel Math Kernel Libraries respectively).
Please note that the following measurements still are very preliminary!
The table contains best efforts with a decent sized memory footprint
(our best efforts so far)
These results do not surprise. The DPE1950C with 2 Intel’s Clovertown processors outperforms the others for cache-friendly and floating-point intense applications because of its high clock cycle and 4 floating point results per cycle. Sun’s UltraSPAR T2 outperforms the others for memory bandwidth bound applications with its 4 on chip controllers. The Opteron based machine SFV40z is on the second rank with respect to both criteria.
Of course the newer machines clearly outperform the older Sun Fire E6900 with respect to performance per Watt. For real application codes the truth will be somewhere in between, depending on their characteristic. So for some codes the Opteron might be a good choice, if the ccNUMA charactistics are properly taken into account.
Now let us take a look at performance per power consumption for some application codes.
On one hand we have the TFS Navier Stokes Solver, which is quite memory hungry, and on the other side we have the program FVA346 to simulate bevel gears, which is very cache friendly.
The metric which we use is the number of (test) program runs per KWh. (Production runs typically take hours, days or weeks ...)
24 x 2 cores
1 x 8 cores
4 x 2 cores
2 x 4 cores
These numbers display the same trend, the UltraSPARC T2 clearly performance best for TFS and the Clovertown based machine clearly performance best for the FVA346.