The UltraSPARC T1 ("Niagara") based Sun Fire T2000 Server


Sun Microsystems UltraSPARC T1 Prozessor

    ProsessorUltra SPARC T1
  • Architecture
SPARC V9
  • Adress space
48-bit virtual, 40-bit physically
  • Cores
(up to) 8 cores running 4 threads each
  • Pipelines

8 integer units with 6 stages,
4 threads running on a single core share one pipeline

  • Clock speed
1.0 GHz (or 1.2 GHz )
  • L1 Cache (per Core):

16 KByte instruction cache,
8 KB data cache (4-way set-assoziativ)

  • L2 Cache
3 MByte on chip
12-way associative, 4 banks
  • Memory Controler
four 144-bit DDR2-533 SDRAM interfaces
4 DIMMS per Controller - 16 DIMMS total
Optional: 2-Channel operation mode
  • JBUS Interface
3.1 GByte/sec bandwidth (peak)
128 bit address/data bus
150 - 200 MHz
  • Technology
CMOS, 90nm, 9-Layer Cu Metal,
  • Power Consumption
72 Watt


Since november 2005 Sun Microsystems offers the new UltraSPARC T1 processor, codenamed "Niagara". The processor has 8 parallel cores, which are 4 times multi-threaded, i.e each core can run 4 processes quasi simultaneously. Each core has an Integer pipeline (length 6) with is shared between the 4 threads of the core, to hide memory latency.

Since there is only one floating point unit (FPU) for all cores, the UltraSparc T1 processor is suited for programs with few or none floating point operations, like web servers or databases. Nevertheless the UltraSparc T1 is binary compatible to an UltraSparc IV CPU.

Each of the 8 processor cores has its own level 1 instruction and data caches. All cores have access to one common level 2 cache of 3 MB and to the common main memory. Thus the processor behaves like an UMA (uniform memory archtiecture) system.


The Sun Fire T2000 Server

Since december 2005 a Sun Fire T2000 server with a 1GHz UltraSparc T1 "Niagra" processor from Sun Microsystems is installed at the RWTH Aachen Universities Center for Computing and Communication (CCC). This system has 8 GByte main memory, and is running Solaris 10.

After the upgrade of all big Sun Fire servers of the Center with dual core UltraSPARC IV chips late 2004 and after the recent purchase of 4 new dual-core Opteron based V40z servers, the installation of the UltraSPARC T1 based Sun Fire T2000 system marks an important step towards the employment of new types of micro-processors with chip multi-processing (CMP) and chip multi-threading (CMT) technologies which will most likely dominate the market of HPC systems in the future.


An "integer-only" Machine in a Center of Excellence for Engineering Sciences and Computational Fluid Dynamics?

Why installing a machine which is only capable of delivering some 100 MFlop/s in a compute environment dominated by technical applications?

On the first sight, this does not seem to fit well. But we want to be prepared for future technologies. For sure future multi-threading processors will be capable of executing floating point operations at the same rate as the Niagara processors executes integer opterations today. So we want to look at the questions of how to use this kind of architecture properly. Will this kind of machines be able to suite our needs in future?

The stagnation of the performance growth of single processors is very bad news for the HPTC community. As a consequence parallelization is getting even more important. For many engineering and scientific applications parallelization is not at all trivial. Unfortunately in our environment we do not see many codes which are "embarrassingly parallel". Therefore we think that in future parallelization has to be on multiple levels. Hybrid parallelization using MPI plus OpenMP and autoparallelization, nested parallelization with MPI and with OpenMP will be needed to keep even more processors busy and to cut down the turnaround time of large simulation jobs.

So we want to investigate in how far the well-known techniques of MPI and OpenMP programming work on this brand new chip. Therefore we are looking at several benchmarks and applications, which are not dominated by floating point operations.


First Experiences using the Niagara Processor - A Word of Caution!

Of course we are very curious about first performance results using this brand new processor architecture. And most likely others are curious to see our first benchmark results, too. But please handle these results with care. Performance results can only be as good as the people who run the experiments, and we are new to this system. They can be only as good as the compiler supports the architecture, and the Sun Studio compiler version 11 is the first version to support the Niagara processor. There were several people working on the used systems at a time. We try not to interfere with each other, but you never know. Also time is always short - we might have overlooked something.

So take all these numbers as preliminary!

We have been using the compiler switches -fast [-xtarget=ultraT1] [-g] [-xarch=v9b] throughout the tests unless otherwise noted.

The picture shows the machines which have been used for many of the comparisons below. The Sun Fire T2000 is the 2U silver box on top of two blue Sun Fire E2900 boxes below.

Indeed, after receiving the hint from Sun Microsystems to change a system parameter, we found out that this parameter might heavily impact the performance of the Sun Fire T2000. So we basically have to repeat all our measurements! We are also awaiting another patch for the system software ...
So far we added set consistent_coloring=2 in the /etc/systems file.
Have a look at the altered performance curve of John-the-Ripper


A Look at the Memory Performance of the Sun Fire T2000

A tiny serial program measures the memory latency using pointer chasing by a long sequency of identical instructions:

p = (char **)*p;

A look into the disassembly reveals that in fact one load instructions follows another:

... 
ld [%g5], %g2
ld [%g2], %g1
ld [%g1], %o7
ld [%o7], %o5
ld [%o5], %o4 ...

The content of the memory location just read from memory delivers the address of the following load instruction. Thus these load instructions cannot be overlapped. The memory latency measured is: 107 ns

Now how about memory bandwidth? The code segment which is timed is

long *x, *xstart, *xend, mask; 
...
for ( x = xstart; x < xend; x++ ) *x ^= mask;

So each loop iteration involves one load and one store of a variable of type long. The memory footprint of this loop is always much larger than the level 2 cache, so each load and store operations goes to the main memory. The memory bandwidth which has been measured depends on the size of the long type, which is 4 bytes when compiled for 32 bit addressing and 8 bytes when compiled for 64 bit addressing: 463 MB/s and 873 MB/s respectively.

Now the idea of the multi-threading architecture, as has been explained in a presentation given by Partha Tirumalai and Ruud van der Pas during the SunHPC colloquium in Aachen in October 2004, is bridging the growing gap between processor and memory speed by overlapping the stall time of one thread when waiting for data from memory with the activity of other threads running on the same hardware, thus leading to a much better utilitsation of the silicon.

So an obvious experiment is to run the same kernel measuring memory latency and bandwidth several times in parallel in order to find out in how far multiple threads running on the same processor or even on the same core on this processor interfere which each other. For this purpose we took the same tiny program kernels, parallelized it using MPI and used explicit processor binding to carefully place processes onto the processor cores. The given numbers are for 64-bit addressing mode.

For comparison we include measurement results on the Sun Fire E2900 which is equipped with 12 UltraSPARC IV processors running at 1200 MHz.

# MPI processes
Niagara
# cores used
Niagara
# threads per core used
Niagara
latency [ns]
Niagara
bandwidth per process [MB/s]
Niagara
total bandwidth [MB/s]
SF E2900
latency [ns]
SF E2900
bandwidth per process [MB/s]
SF E2900
total bandwidth [MB/s]
1
1
1
107
863
863
~232
~1813
~1813
2
1
2
107
825
1650
~232
~1601
~3202
2
2
1
107
861
1722
   
4
1
4
108
705
2820
~249
~1200
~4800
4
2
2
108
820
3280
   
4
4
1
108
859
3436
   
8
2
4
110
698
5584
~250
~857
~6856
8
4
2
109
802
6416
   
8
8
1
109
847
6776
   
16
4
4
113
669
10706
~262
~446
~7136
16
8
2
113
426
6816
   
24
     
~314
~350
~8400
32
8
4
129
144
4608
   

measuring memory latency and bandwidth with a parallel kernel program

The experiment nicely shows that the memory performance scales quite well. The memory latency increases only up to 129 ns when running 32 processes. For up to 8 processes it is profitable to distribute them across all eight cores instead of filling part of the cores with processes and leaving others empty. The only surprising exception is the 16 process case, where it seems to be more profitable to start 4 threads on 4 cores each leaving the other 4 cores empty.
The kernel program challenges the memory bandwidth considerably and reveils that in such a case the bandwidth might get a limiting factor for performance. The maximum total bandwidth which could be measured is about 12.4 times higher than the bandwidth which can be dedicated to a single process. It seems that the bandwidth is sufficiently scalable for up to eight processes, but may become a limiting factor for more processes in such extreme cases with no data locality.

 

The lat_mem_rd benchmark, which is part of the LMbench suite, can be used to look a bit closer into memory latency and the memory hierarchy. The pointer chasing mechanism works like described above, but we vary the stride between successive memory accesses and also the memory footprint.

 

The figure shows the memory performance as measured by the serial lat_mem_rd program for various strides

At first we ran the original serial version varying the memory footprint between 1KB and 8 MB and choosing a stride of 8, 16, 32, 64 and 128 bytes. As long as the memory footprint is below 8 KB, all access can be satisfied in the L1 cache and the "memory" latency (the average latency of each load instructiong) is 3 ns. When the memory footprint is between 8 KB and 3 MB, the accesses are all satisfied in the L2 cache and the "memory" latency is about 22 ns, with the only exception when the stride is only 8 bytes. In this case every second load instructions hits the cache line which has previously been fetched into the L1 cache, as the L1 cache line is 16 bytes long. With a stride of 16 bytes or larger, each load instruction misses the L1 cache. If the memory footprint is larger than 3 MB the latency of fetching a cache line of 64 bytes into the L2 cache takes 107 ns and if the stride is less than 64 bytes, this cache line is reused leading to lower average latencies.

Also this kernel program was parallelized using MPI and ran on the Niagara processor with a varying number of processes and with the strides 8, 64 and 8192. All MPI processes execute the same pointer chasing loop simultaneously of course on their private piece of memory. We plot the average latencies of all processes for each measurement.

The figure shows the memory performance as measured by the mpi version of the lat_mem_rd program for a stride of 8 bytes and various number of MPI processes.

This was measured after we changed the system parameters.

With a stride of 8 Bytes there is a lot of cache line reuse of course. But the most striking information which is given by the above figure is that for a large memory footprint the latency does not really get worse when the number of MPI processes increases. The L2 cache which is shared by all cores leads to shift of the slope when the number of processes running simultaneously increases, which has to be expected.

The latency for a small memory footprint raises for 16 or more threads. This might be consequence of the sharing of the L1 caches between all threads of a single core. Here further investigations would be necessary to understand this effect. This is also the case for larger strides as can be seen below.

The figure shows the memory performance as measured by the mpi version of the lat_mem_rd program for a stride of 64 bytes and various number of MPI processes.

This was measured before we changed the system parameters, note the difference!

The figure shows the memory performance as measured by the mpi version of the lat_mem_rd program for a stride of 64 bytes and various number of MPI processes.

This was measured after we changed the system parameters, note the difference!

The same is true for a stride of 64 bytes, which leads to cache misses for each load operation. The latency is well below 140 ns in all cases. Again the effect of sharing a common L2 cache is clearly visible.

 

The figure shows the memory performance as measured by the mpi version of the lat_mem_rd program for a stride of 8192 bytes and various number of MPI processes.

This was measured after we changed the system parameters, which did not make big difference for this case.

A stride of 8 KB may lead to all kind of nasty effects which nead further investigation. It can be expected that TLB misses have played an important role in slowing down the latency in this serios of measurments.


The EPCC OpenMP Micro Benchmark

The EPCC OpenMP Micro Benchmark carefully measures the performance of all major OpenMP primitives. The first test focusses on the OpenMP directives and the second test take a closer look at the performance of parallel loops using the various OpenMP schedule kinds. The Sun Fire T2000 shows a very similar behaviour in all aspects than the Sun Fire E2900 system, just the T2000 is clearly faster by a factor of roughly two! This correlates nicely to the difference in the memory latency.

 


NAS Parallel Benchmark NPB 3.2 OpenMP - Integer Sort

The NAS Parallel Benchmark Suite is very well-know in the HPC community. It is available in serial, OpenMP and MPI versions and contains one program focussing on integer performance: is. We ran this program for 4 different test cases W, A, B, C which work datasets of increasing size.

Comparing the performance of the Sun Fire T2000 system with the Sun Fire E2900 system, it can clearly be seen that the large caches of the UltraSPARC IV processors are profitable for the smaller testcases W, A, and B, whereas for the largest test case C a single Niagara processor running 16 threads outperforms 12 UltraSPARC IV processors.

This was measured after we changed the system parameters, which did not make big difference for this case.


Integer Stream Benchmark

The OpenMP Stream Benchmark was changed to do integer instead of floating point operations. Both a Fortran and a C++ version where used for measurements.

Overall the UltraSPARC T1 system scales well, in some cases even up to 32 threads.


Parallel Partitioning of Graphs with ParMETIS

The ParMETIS program package is frequently used for partitioning unstructured graphs, in order to minimize the communication of large-scale numerical simulations.We selected a test case (mtest) which is not dominated by floating point operations. It is used to compute a k-way partitioning of a mesh containung elements of different types on p processors.

The measurements cannot be used to judge scalability, as the partitioning depends on the number of processors used. Here we compare the performance of a 24-way Sun Fire E2900 system, equipped with 12 dual-core UltraSPARC IV processors running at 1200 MHz versus the 1 GHz single-socket SunFire T2000 system:

#MPI processes
UltraSPARC IV
UltraSPARC IV
UltraSPARC T1
factor
 
total MFlop/s
seconds
seconds
UltraSPARC T1:UltraSPARC IV
1
1
1.4
3.1
2.2
2
15
1.5
3.3
2.2
4
22
1.0
2.1
2.1
8
29
0.9
1.6
2.0
16
35
0.8
1.8
2.3
24
36
0.8
2.4
2.8
32
  
3.3
 

When using the same number of MPI processes the UltraSPARC IV-based machine solves these problems about two times faster than the Niagara-based system. Remember, we are comparing a 12-socket SMP machine with a single socket system.


Password Cracking with "John the Ripper"

"John the Ripper" is a popular password crack program used by system administrators to search for weak user passwords.
Hardware Counter analysis reveals that it does not use many floating point instructions and thus it might be a good candidate for the Sun Fire T2000 as a parallel version is publicly available too.
We employed mpich2 to compile and run this parallel version, as the test mode of "John" uses the SIGALARM signal which is suppressed by Sun's MPI implementation (HPC ClusterTools V6).

The performance measure is checks per second and we only looked at traditional DES encryption so far.

The figure depicts that the MPI version of John scales very well on the Sun Fire E2900 whereas on the Sun Fire T2000 it scales up to 8 MPI processes and then it drops down again. This was measured before we changed the system parameters...

Looking at the absolute performance, the Sun Fire E2900 clearly outperforms the Sun Fire T2000.This was measured before we changed the system parameters...

 

This figure depicts the scalability of "John" on the Sun Fire T2000 after we changed the system parameters...

Why does "John the Ripper" not scale on the Niagara processor?
Now, if the US IV chip does not suffer from stalls because a program displays a very good data locality such that almost always data can be kept in the caches, than multiple US IV processors will of course outperform a single Niagara processor. And indeed ...



... hardware counter measurements reveal that instruction and data (level 1) cache misses badly hurt the Niagara processor. The number of data cache misses increases, when more processes are running per core, as they have to share a common L1 cache of 8 KB, whereas a single core on the UltraSPARC has a data cache of 64 KB.
Plotted are the number of misses per MPI process for the whole test run.
Please note that the scales differ by an order of magnitude between the Sun Fire T2000 (left y-axis) and the Sun Fire E2900 (right y-axis)!

This was measured before we changed the system parameters...

The UltraSPARC IV (US IV) and the Niagara processor are both able to initiate 8 instructions per cycle, the US IV being a dual core 4-issue superscalar processor and the Niagara with 1 instruction per cycle for each of the eight cores.
Now comparing the number of checks per processor chip the 1 GHz Niagara chip even performs a little better than the 1.2 GHz US IV: 1283599 checks/sec versus 1187115 checks/sec.


What if the Niagara could count using floating point numbers ...

In many of the PDE solvers running on our machines a CG type of linear equation solver is used in the heart of the computations. The most time consuming part of such a solver frequently is a sparse matrix vector multiplication, which accesses the main memory using index lists. Now how would a processor like the Niagara perform, if it had a floating point unit per core, like it has integer units today. In order to find out, we just changed the data type in one of our sparse matrix vector multiplication codes from double to long long int knowing, that the results of course would not be very meaningful. Well, just an experiment. This is what we get:

Sparse matrix vector multiplication on the SF E2900 and the SF T2000 when using 64 bit floating point or integer numbers.
Whereas the Niagara only reaches up to about 100 MFlop/s it matches the speed of 12 UltraSPARC IV processors when using integers.

Summary

The single-thread performance of the UltraSPARC IV processor running at 1.2 GHz is about 2 -3 times higher than the single-thread performance of the 1 GHz Niagara chip for many of the programs which we looked at so far and which do not execute many floating point operations.
But the lower memory latency even for a high number of threads and the high memory bandwidth lead to a good performance for multi-threaded programs. This is particularly impressive for the class C integer sort benchmark of the NPB OpenMP collection.
Also the ParMETIS results are impressive, when taking into account that the performance of a single Niagara chip is compared to up to 12 UltraSPARC IV chips.


 

 


You want to get Hands-on Experience with the Niagara Processor?

Users having accounts on the machines of the CCC can seemlessly use the Sun Fire T2000 system upon request. The system is fully compatible to the other UltraSPARC IV based systems. It is running Solaris 10 and the Sun Studio compilers. Nevertheless we recommend that you load the latest Sun Studio 11 compiler into your environment by

module switch studio studio/11

and then recompile your application using the compiler flag

-xtarget=ultraT1

which will be expanded to

-xarch=v8plus -xcache=8/16/4/4:3072/64/12/32 -xchip=ultraT1

by the compiler (Keep in mind that the rightmost compiler option dominates if you for example want to add the -fast or the -xarch=v9b flag !)

The Solaris 10 operating systems deals with the system like a full blown 32-way shared memory machine. Keep in mind, that floating point intense application will run very slow! The machine is not built for such applications!

 

  • Keine Stichwörter