Writing Efficient Programs in C++ - Tutorial and Workshop
|Time:||Tutorial Mon, Sep 23, 14:00-17:30 and Tue, Sep 24 - Thu, Sep 26, 9:00 - 17:30, respectively. |
Tuning Workshop: Fri, Sept 27, 9:00 - 12:30 (limited number of participants !)
|Speakers:||Ruud van der Pas (Sun Microsystems, Application Performance Specialist HPC)|
Jörg Striegnitz (Research Center Jülich, ZAM)
- Tuning Workshop
- Getting to Aachen
- Further Information
- Additional Course Material
This tutorial intends to make C++ programmers aware of the strengths and weaknesses of the C++ programming language in the field of scientific computing.
Based on the good experience that we made last year, we want to extend this event to 3 1/2 days and append a half day tuning workshop at the end.
It will begin with a short introduction into basic performance tuning aspects and the Sun programming environment (compilers and performance analyzer) and will then focus on C++ specific programming techniques.
Classes, inheritance, operator overloading, and polymorphism are very suitable and powerful tools for programming at a high level of abstraction. Unfortunately, especially the use of these concepts is often very contrary to the expectation of high performance. Recently, new techniques have been developed that help to bring in line a high level of abstraction and high performance. Some of these concepts are template meta-programming, expression templates, traits classes, partial and lazy evaluation.
During this course the cost of C++ abstractions will be investigated and thoroughly explained; several solutions to overcome the abstraction penalty will be presented and applied during the exercises.
Special issues of the Sun C++ Compiler will be covered in more detail.
The tutorial is open to Sun customers, partners and employees.
Attendees should already have some experience with C++.
In this tuning workshop we are particularly interested in helping users of our Sun Fire SMP Cluster to improve the efficiency of their applications.
You will have the opportunity to ask the experts for advise and help on tuning your application. As we plan to accomodate several representatives from the user community, we will have to time slice between the participants. Where needed, we will give advise and then ask you to try it, while we work with other attendees.
Performance tuning is still often a matter of some experimentation, but we can give you advise on a best effort basis. Hopefully this will lead to a noticeable performance improvement, but guarantees cannot be given.
When parallelizing an application, it is important to have tuned for single processor ("serial") performance first. Otherwise, one can more quickly run into scalability problems. Therefore most of the focus will be on serial performance, but we will also consider shared memory parallelization with OpenMP where relevant and desired.
To maximize the efficiency of the workshop, we would like to ask you to prepare a test case that reflects a typical production run, but does not take too long to execute. In the ideal case, a run should not take more than 5 to 10 minutes to finish.
It is also important to have an easy way of verifying that the results of this test run are correct.
Use of a make file to (re)build the application is highly recommended. If you need help with this set up, please contact us.
The seminar is organized in cooperation with the Aachen University of Technology (RWTH) and the Research Center Jülich, and Sun Microsystems. There is no seminar fee. All other costs (e.g. travel, hotel, and consumptions) are at your own expenses.
Registration for the Tutorial is mandatory until Sept 15.
We allocated additional places for the labs, so that we were able increase the number of participants.
Please, note as a remark if you rely on the talks to be given in English language.
Please, fill out the registration form carefully, as we will generate certificates of attendance automatically with these data.
There is no open registration for the Tuning Workshop. Participation is after personal consultation only.
|Monday, Sept 23||14:00 - 17:30||Tutorial part I||Ruud van der Pas|
|Tuesday, Sept 24||09:00 - 12:30||Tutorial part II||Jörg Striegnitz|
|14:00 - 17:30||Lab exercises part I||Jörg Striegnitz, Ruud van der Pas|
|Wednesday, Sept 25||09:00 - 12:30||Tutorial part III||Jörg Striegnitz|
|14:00 - 17:30||Lab exercises part II||Jörg Striegnitz|
|Thursday, Sept 26||09:00 - 12:30||Tutorial part IV||Jörg Striegnitz|
|14:00 - 17:30||Lab exercises part III||Jörg Striegnitz|
|Friday, Sept 27||09:00 - 12:30||Tuning Workshop||Jörg Striegnitz, Ruud van der Pas, Dieter an Mey|
Accommodation and general visitor information for Aachen:
Getting to Aachen:
The web pages of the Aachen Tourist Service nicely explains, "how to get to" Aachen.
A detailed description of the location of the Computing Center is also available.
and a picture which shows, how to get to the Computing Center by car.
You may as well download a sketch of the city with some points of interest marked.
- Compile-Time and Run-Time Performance with the SunC++ Compiler by Lawrence Crowl (pdf)
- C++ information
- Object oriented numerics
- Abstractions and their cost - Part 1
- Abstractions and their cost - Part 2
- Abstractions and their cost - Part 3
- Template Metaprogramming - Part 1
- Template Metaprogramming - Part 2
- Advanced Techniques - Part 1
- Advanced Techniques - Part 2
- Advanced Techniques - Part 3
- Advanced Techniques - Part 4
Additional Course Material, Solution of the Lab exercises:
- Ruud's slides (2 slides per page)
- Jörg's slides (1 slide per page)
- Jörg's slides (6 slides per page)
- Exercise 1 a-d
- Exercise 2
- Exercise 3a, 3b, 3c
- Exercise 4
Exercise 1 focusses the hidden usage of temporaries when overloaded operators are employed. An array class with a type and a size parameter is defined and the operation
Array <double,SIZE> res, a, b, c, d res = a * b - c + d
Various methods for avoiding temporaries have been discussed during the workshop and we compared the timing of some of the program versions developed during the lab sessions. We also used different C++ compilers installed on the SunFire 6800 system and included C and Fortran codes for comparison.
The following table contains the number of generated temporaries and the runtime in machine cycles per loop step. The array size was set to 500.
|compiler:||CC (Sun)||KCC (KAI)||g++ (GNU)||g++ (GNU)||g++ (GNU)|
The following Program versions have been measured
|1b||template array class with operator overloading and local temporary|
|1b_gnu_return||template array class with GNU NRV (named return value) optimization|
|1c||template array class with computational constructors|
|1d||template array class with reuse of arithmetic assignment operators|
|1d_2||explicit usage of arithmetic assignment operators|
|pete||expression templates using the PETE toolset|
Measurements of similar C, Fortran77 and Fortran90 programs are included for comparison. It can be seen that with a good C++ compiler - here KCC in combination with the native Sun C compiler - and the usage of expression templates - here using the PETE toolset - the same performance than with C or Fortran can be achieved. Whereas in C and in Fortran77 the array instructions have to be coded with loops, Fortran90 offers the array syntax as intrinsic language elements. Because on the UltraSPARC-III processor up to one memory operation can be issued per cycle, the minimum number of cycles per loop step is 5 in this example. So we are close to the optimum. If the data is not in the L1 cache, the number of cylces per loop step will raise.
In some cases KCC seems to offer the best high level optimizations. The Sun compiler generates much better code than the public domain g++ compiler. In particular the new g++ version 3.0.1 performs worse than the older 2.95.2 one.
|compiler:||cc (Sun)||f77 (Sun)||f90 (Sun)||g++ (GNU)|
The following compiler options have been used.
|CC (Sun)||7.0||-fast -xarch=v8plusb -xchip=ultra3|
|KCC (KAI)||4.0||+K3 --backend -fast --backend -xarch=v8plusb --backend -xchip=ultra3|
|g++ (GNU)||2.95.2||-O6 -mv8|
|g++ (GNU)||3.0.1||-O6 -mv8plus|
See this shar file for all the timing examples and the makefile.