Application Performance

 

In order to evaluate the suitability of the UltraSPARC T2 processor for high performance computing we selected a bunch of application codes which reflect the variety of programs typically executed on our compute cluster and also the SpecOMP benchmark.

 

Contact Analysis of Bevel Gears with FVA346

In the Laboratory for Machine Tools and Production Engineering of the RWTH Aachen University, the contact of bevel gears is simulated and analyzed in order to e.g. understand the deterioration of differential gears as they are used in car gearboxes. These simulations usually run for a few days when using the original serial code.

The program was parallelized using OpenMP and it turned out that it scales quite well on multicore architectures, as it is very cache friendly. The parallel code versions consists of some 90,000 lines of Fortran90 code containing 5 parallel OpenMP regions and 70 OpenMP directives.

Allthough the parallelization speedup on the UltraSPARC T2 processor based Sun T5120 is over 15, it cannot catch up with the other machines.

Simulation of the landing of  a Space Glider with FLOWer

In a project sponsored by the German Research Council (DFG), scientists of the Laboratory of Mechanics of RWTH Aachen University simulated PHOENIX, a small scale prototype of the Space Hopper, a space launch vehicle designed to take off horizontally and glide back to earth after placing its cargo in orbit. The corresponding Navier-Stokes Equations are solved on a block structured grid with FLOWer, a flow solver developed at the German Aerospace Center (DLR).

FLOWer is parallelized with MPI. In addition many loop nests can been parallelized automatically by the Fortran compiler. Thus on each platform the question arises, what is the optimal combination of number of MPI processes and threads per process? The following table compares the runtime for 10 iterations on the SFE2900 and the ST5x20 ("Niagara 2") for various combinations of process and thread counts.

When choosing the optimal combination, the S T5x20 outperforms the 24 core SFE2900 by a factor of 1.27.

 

Critical Point Detection in Simulation Output data for Virtual Reality with NestedCP

NestedCP is written in C++ and computes critical points in multi-block CFD datasets by using a highly adaptive algorithm which profits from the flexibility of OpenMP to adjust the thread count on all three parallel levels and to specify loop schedules on these parallel levels. In order to interactively analyze results of large-scale flow simulations in a virtual environment, different features are extracted and visualized from the raw output data. One feature that helps describing the topology is the set of critical points, where the velocity is zero.

The code scales very well on the SFE6900 and the SFV40z. On the SFT5x20 the speedup levels off with 16 and more threads.

 

Simulation of the Air Flow through the Human Nose with TFS

The Navier-Stokes Solver TFS developed by the Institute of Aerodynamics of the RWTH Aachen University is currently used in a multidisciplinary project to simulate the air flow through the human nose. TFS uses a multi-block structured grid with general curvilinear coordinates. OpenMP is employed on the block and also on the loop level. This application puts a high load on the memory system and thus is quite sensitive to ccNUMA effects

The optimal number of threads for each of the parallelization levels and optimal strategy for distributing the work to the threads differs between the platforms. The following table contains the best efforts on several machines. As the code orginially has been developped for vector computers, it still performs quite well on the NEC SX-8 (thanks for granting access to the machine at the HLRS).

 

 

 

 

Machine

Serial runtime [s]

#total
threads

#threads block level

#threads
loop level

Parallel runtime
[s]

Speed-up

Efficiency [%]

remark

SFESFE25K 72 US IV

342

32

8

4

20

17

53

Sorted blocks,
first touch placement

SFE25K
72 US IV

342

64

8

8

18

20

31

Sorted blocks, random placement

SFE25K
72 US IV

342

128

16

Balanced
( 2 – 11 )

14

25

39

Thread affinity
Thread balancing

SFE25K
72 US IV

342

128

16

8

13

27

21

Thread affinity Sorted blocks

SFE6900
24 US IV

312

48

8

6

19

16

33

Sorted blocks

SFV40z
4 dual Opt

148

8

8

1

26

5.6

70

Block groups, binding, migration

NECSX8
8 Vector

15.7

8

8

vector

5.8

2.7

34

Dynamic schedule

As the program is quite memory hungry, it also performs well on the UltraSPARC T2 processor. When adjust the thread counts properly the performance is only a factor of 1.55 away from a single NEC SX-8 vector processor. From a price/performance perspective this is an highly interesting result.

#total threads#threads on
block level
#threads on
loop level
UltraSPARC T2

1

 

 

475.7

8

2

4

62.0

8

4

2

61.9

8

8

1

63.8

16

4

4

34.7

16

8

2

35.2

32

4

8

26.2

32

8

4

26.7

64

4

16

24.4

64

8

8

25.0

 

We use the TFS code in a different version which has been parallelized with MPI on the block level and with OpenMP on the loop level for benchmarking, too. When only activating the loop level parallelization the code scales only to a modest number of threads as can be seen in the next graph. Thus, the UltraSPARC T2 cannot catch up with the other machines, when employing a high number of threads.

When MPI and OpenMP are both employed there is enough scalability for the UltraSPARC T2 to catch up. Only the SFE6900 performs better with 16 or more threads.

With a larger dataset the UltraSPARC T2 turns out to perform better the SFE2900 in both cases: If only loop level parallelization is activated the factor is 1.04, if MPI and OpenMP is employed the factor is 1.36.

 

The Spec OpenMP Benchmark OMP2001base

The Spec benchmarks are very popular for comparing machines. The manufactorers put an extremely high effort in presenting the optimal results for their machines. We took the OMP2001base benchmark suite and tried to behave just like a normal user would do: Turn on a reasonable set of compiler flag and then let it go - no profile-feedback optimization, no experiments with all kind of well hidden compiler options, no special setting of system tunables. On the SFE2900 the difference between our performance results and the manufacturer's is considerable: We are a factor of 1.6 away from the optimum!

Still, we think that it is reasonable approach to compare machines with a "standard setting". Now, comparing a SFE2900 with a ST5x20 beta machine, the later wins by a factor of 1.1.

Threads

Sun Fire E2900

UltraSPARC T2

64

 

9605

32

 

8783

24

8675

 

16

7874

6261

12

7748

 

8

6005

4063

4

3501

2174

2

1924

 

1

1048

 

 

Blood Pump Simulation with XNS

XNS is a finite element flow solver used for simulating the blood damage and cluttering caused by blood pumps. The code is developped by the Chair for Computational Analysis of Technical Systems (CATS). It has been very well parallelized with MPI and scales to thousands of processors on the IBM Blue Gene/L for large datasets. (Thanks to the Research Center in Jülich for granting access to their BG/L machine).

Here we only look at a small test case. The UltraSPARC T2 scales well up to 16 processes and then scalability levels off. The BlueGene/L processor performs similar to an UltraSPARC IV processor, but scales a little better, presumably the SFE6900 runs into memory bandwidth shortage with many processes. 16 processes on one UltraSPARC T2 chip perform like 8 BlueGene/L processes on 4 chips. It would be nice to try this code on a bunch of UltraSPARC T2 machines connected by a fast network ...

  • Keine Stichwörter