Bull Nehalem/Westmere Cluster

Since Monday 2011-10-17  the new Bull HPC-Cluster is accessible for all users.

It offers a new cluster environment which provides - among others - the following new features

  • 300 TFlop/s compute power

  • a powerful GPU cluster

  • Linux upgrade to Scientific Linux

  • a new batch system Platform LSF

  • OpenMPI becomes default

  • a new powerful Work-Filesystem ($HPCWORK)

The Windows partition will not be changed for the time being. The new hardware will be mainly used to run Linux.

Sun Nehalem Cluster

October 2009

200 nodes of the new Nehalem Cluster are put into operation.


Fujitsu-Siemens Harpertown Cluster

June 2008

  • The early 2008 installed HPC-Cluster from Fujitsu-Siemens is listed among the 100 fastest Systems worldwide in the June 2008 release of the TOP500 List. 256 node with two processor each, in sum 512 Intel Xeon Quad-Core processors with 3 GHz clock speed, achieved 18,81 Teraflops solving the Linpack benchmark. Theoretical peak performance of the cluster is 24,58 Teraflops.
  • The benchmark run was done by Computing Center and Microsoft staff, using the Windows HPC 2008 operating system. Windows HPC 2008 has proven its efficiency compared to the well established Unix operation systems in the HPC arena. Three systems using Windows are noted among the top 100 sites.
  • The Computing Center is a designated Windows High Performance Cluster Competence Center (WinHP3C). Aim of the cooperation between RWTH and Microsoft is to boost Windows as a HPC platform in the engineering area.
  • The Cluster is part of the integrative hosting concept of the Computing Center. The majority of  systems are owned by RWTH institutes but are hosted and operated by the Computing Center.



Reorganization of the Compute Cluster

November 2007

Due to power- and cooling problems we are forced to replace the majority of the UltraSPARC based computers with more energy efficient systems at mid-December.

Systems to be installed during the next 3 weeks:

20 UltraSPARC T2-based Systems (64 parallel processes, 32 GB RAM).
The operating system used will be Solaris 10. The systems are binary compatible with the actual UltraSPARC IV Systems. Codes will run without modification.

40 Xeon-based systems (dual socket, 4-core Xeon prozessors, 16/32 GB RAM).
All Systems are connected via an InfiniBand-Network. The operating system used will be Scientific Linux or Windows Server 2008 (HPC Edition).

The number of UltraSPARC IV-based systems installed will be reduced to two Sun Fire E25Ks. All Opteron-based Linux systems are not concerned and will remain in operation. Cluster overall peak performance will increase to about six TeraFlops.

Codes depending on UltraSPARC architecture can run unmodified on the Sun Fire 25ks and the 20 new UltraSPARC T2-based systems. To reduce the overall Solaris workload, we will transfer commercial application software to the linux systems, if possible. This should be problem-free for all new calculations. If you have to convert data (i.e. binary restart files), because of the little/big endian problem, please contact us.





UltraSPARC T2 ("Niagara2") CPU

In August 2007 Sun Microsystem announced the sucessor of the UltraSPARC T1 processor UltraSPARC T2, code-named "Niagara 2". The prozessor contains up to eight processor cores, which are able to execute 8 threads simulaneoulsy each. Thus, within a single processor chip 64 processes can operate on eight 8-stage Integer pipelines and eight 12-stage floating point pipelines. The objective of this design is, to overlap computation and waiting for memory or to have multiple threads wait for memory simultaneously. Whereas traditional processors only ran at some 5 percent of the peak performance when executing memory hungry programs, the UltraSPARC T2 processor  promises to operate at a higher percentage of its theoretical peak performance.

Each of the processor cores has a separate instruction and data cache and accesses a shared L2 cache and the shared main memory via an internal crossbar. Thus the UltraSPARC T2 processor is a shared memory machine on a single chip with a flat memory ( UMA = uniform memory architecture) from the programmer's perspective.

A single processes achieves up to 1.4 GFlop/s, because one core can only execute one floating point operation per cycle. Therefore the peak performance of the whole chip is quite moderate:  11,2 GFlop/s. The high potential of the Niagara 2 reveals, if many threads are active and the high memory bandwidth of some 60 GB/s can be exploited - a frequent bottelneck of standard architecures. Furthermore, the UltraSPARC T2 processor contains two  10/1 Gbit-Ethernet (up to 3,125 Gb/s), and one PCI-Express x8 1.0A Port (2,5 Gb/s) "on Chip".

UltraSPARC T1 ("Niagara") based Sun Fire T2000 installed

November 2005. Future processors will be able to run more and more theads simultaneously. Sun now released a new machine built around a very inovative new chip. The UltraSPARC T1 processor chip (the code name was "Niagara") has eight cores and each of these cores is able to execute 4 threads simultaneously. So the Operating system sees 32 threads all runnning on one single chip. The UltraSPARC T1 processor is targeted to a specific work load which does not contain many floating point operations though. But it might run extremely cost-effictive for integer intense workloads.

The Sun Fire T2000 system serves as an important technology preview to explore how well the current programming and parallelization paradigms work on such an innovative processor architecture.

Have a look at our first experiments with the Niagara chip here.


Four Dual-Core Quad-Opteron Sun Fire V40z installed

October 2005. Following the success of the Opteron-based Sun Fire Cluster we installed 4 Sun Fire V40z systems equipped with 4 brand new dual-core Opteron processors each. Running at 2.2 GHz and having access to 16 GB of main memory the machines are operated with Solaris 10 and Linux. They will be attached to a new Infiniband network as a testbed for future networking technologies.



Over 2 TeraFlop/s Linpack Performance

April 2005. The upgrade from UltraSPARC III to UltraSPARC IV including an increase of the main memory capacity more than doubled our Linpack performance!

On the occasion of a major check of our uninterruptible power supply system we took the opportunity to run the Linpack benchmark on the 20 biggest of our UltraSPARC IV-based compute servers.

We were amazed that these throughput oriented servers, which run thousands of user jobs in a daily routine, were able to deliver 67% of the theoretical peak performance for a single application.

A linear system with 499,200 unknowns was solved in 11:12:48.8 hours at an average speed of 2054.4 billion floating point operations per second (GFlop/s). The program had a total memory footprint of 2 Terabyte. The 20 compute nodes are equipped with 672 dual core UltraSPARC IV processors running at 1050 or 1200 MHz clock speed. 1276 processor cores were kept busy with 82,930,000,000 million floating point operations leaving 68 cores free for networking and system tasks.

The following components contributed to this unexpected good result:

  • the Sun Performance Library, a highly tuned mathematical library, was employed to squeeze out every machine cycle when multiplying matrices
  • the extremely fast Sun Fire Link network together with
  • the fast and thread-safe implementation of the message passing interface (MPI), which is part of Sun HPC ClusterTools and
  • facilitated a very smart hybrid (MPI+OpenMP) implementation of the linear equation solver by Eugene Loh (Sun).
  • The different clock speeds of the available UltraSPARC IV processors where adjusted with a simple thread balancing technique (see below).

In Aachen four Sun Fire E25K nodes and two groups of 8 Sun Fire E6900 nodes each are connected with the extremely fast Sun Fire Link network. Gigabit Ethernet is used to connect these three Fire Link groups.
Another 8 Sun Fire E2900 nodes could not contribute to the total performance as they are connected only through Gigabit Ethernet.

The Algorithm

The cluster Linpack implementation used macro dataflow techniques to maximize concurrency. Such techniques are used today by the Sun Performance Library to deliver optimal scalability for dense linear algebra routines on shared-memory systems.

Thread Balancing

In order to fill the performance gap between the slower clock rate of the 72 dual core processors of the Sun Fire E25K nodes (1050 MHz) versus the clock rate of the 24 dual core processors of the Sun Fire E6900 nodes (1200MHz) we used a thread balancing technique. We started 2 MPI processes with 23 threads on each of the 16 Sun Fire E6900 nodes and 5 MPI processes with 27 threads on each of the four Sun Fire 25K nodes.

As most of the compute time is spent in multiplying matrices and most of the communication time is hidden behind computation we used a very simple approach to extrapolate the Linpack performance as proportional to the processor’s peak performance.

If we would have started 6 MPI processes with 23 threads each on each of the Sun Fire 25K nodes, their slower processors would have dominated the total performance. Therefore we decided to sacrifice one MPI process in favour of four additional threads for the remaining 5 MPI processes on each of the big nodes. (We have successfully used a dynamic version of this technique in another context in order to alleviate load imbalances in a hybrid application.)

Sun Fire Opteron-Cluster in Operation

Since October 2004 the new Opteron-Cluster consisting of 64 Sun Fire V40z nodes with 4 AMD Opteron 848 processors each is in operation. Currently 56 nodes are running under a 64-bit Linux operating system delivering a total of about 1 TFlop/s of Linux power. Job preparation is on a front-end machine and the back-end nodes are controlled by the Sun Grid Engine.

Four nodes are already running Solaris 10 and the Sun Studio 10 compiler suite, which facilitates the migration from UltraSPARC to Opteron. Four other nodes are running Windows 2003 and they will be open for the first "friendly users" soon.


Sun Fire SPARC-Cluster upgraded to UltraSPARC IV

September 2004. All Sun Fire 15K and Sun Fire 6800 nodes have been upgraded to UltraSPARC IV this month. With the new processors they obtain new model names: Sun Fire E25K and Sun Fire E6900. Together with 8 Sun Fire E2900 nodes, which have been installed in July, we are now running 768 UltraSPARC IV processors in the cluster. As these processors contain two CPU cores each, the programmer has an impression of having 1536 CPUs available.
The following table gives an overview of the current configuration including the new Opteron-Cluster.

16 Sun Fire E2900 + 4 Sun Fire E25K64 Sun Fire V40z
peak performance
16 Sun Fire E6900
1.2 GHz
16 x 24 (dual core)
16 x 96
16 x 24 x 2 x 2.4
= 1843.2

4 Sun Fire E25K

1.05 GHz
4 x 72 (dual core)
4 x 288
4 x 72 x 2 x 2.1
= 1209.6
8 Sun Fire E2900
1.2 GHz
8 x 12 (dual core)
8 x 48
8 x 12 x 2 x 2.4
64 Sun Fire V40z
Opteron 848
2 .2 GHz
64 x 4
64 x 8
64 x 4 x 4.4
= 1126.4





The Sun Fire SMP-Cluster

In May 2001 we started to install the first Sun Fire 6800 servers in Aachen. Since then we constantly upgraded the system to 16 Sun Fire 6800 and 4 Sun Fire 15K servers equiped with UltraSPARC/Cu 900 Mhz chips. The cluster had 672 processors with a total peak performance of 1.2 TFlop/s and an aggregated main memory capacity of 1 TByte by 2002.

In 2003 the low latency - high bandwidth Sun Fire Link networks have been installed to tightly link together 2 groups of 8 Sun Fire 6800 systems and one group consisting of the 4 Sun Fire 15K systems. With this configuration we obtained the rank 151 of the top500 list of the fastest computers with respect to solving large linear equations in November 2003. A system of over 200,000 unkowns was solved running at 891,4 GFlop/s (Billion floating point operations per second)

A major source of information for the users of the systems is the Sun Fire Primer which has been prepared in cooperation with and with the kind support of two Sun application performance specialists.

There is much more information about high performance computing in the Aachen University on our web site (in German).

Once a year a one week Workshop on High Performance Computing on the Sun Fire SMP-Cluster takes place in Aachen sponsored by Sun Microsystems. The lectures are recorded and provided as streaming videos in the web. Please, find a list of upcoming and past HPC events here. A data base contains all the course material which is available online.

We are actively involved in the Sun HPC user community, which meets twice a year in the Sun HPC Consortium Meetings.

Furthermore Aachen is taking part in Sun Microsystems' strategic Centers of Excellence (CoE) program as the "Sun CoE for Engineering Sciences and Computational Fluid Dynamics".

  • Keine Stichwörter