Current Topics in High-Performance Computing (HPC)

Content

High-performance computing (HPC) is applied to speedup long-running scientific applications, for instance the simulation of computational fluid dynamics (CFD). Today's supercomputers often base on commodity processors, but also have different facets: from clusters over (large) shared-memory systems to accelerators (e.g. GPUs). Leveraging these systems, parallel computing with e.g. MPI, OpenMP or CUDA must be applied.

This seminar focuses on current research topics in the area of HPC and bases on conference and journal papers. Topics might cover e.g. parallel computer architectures (multicore systems, Xeon Phis, GPUs etc.), parallel programming models, performance analysis & correctness checking of parallel programs or performance modeling.

Schedule

The topics are assigned at the beginning of the lecture period (October 20th, 2015, 9-10.30am). Then, the students work out the topics over the course of the semester. The corresponding presentations take place as block course (one or two days) at the end of the lecture period or at the beginning of the exam period. Attendance is compulsory.
More information in L²P: https://www3.elearning.rwth-aachen.de/ws15/15ws-29794

Requisites

The goals of a seminar series are described in the corresponding Bachelor and Master modules.
In addition to the seminar thesis and its presentation, Master students will have to lead one set of presentations (roughly 1-3 presentations) as session chair. A session chair makes sure that the session runs smoothly. This includes introducing the title of the presentation and its authors, keeping track of the speaker time and leading a short discussion after the presentation. Further instructions will be given during the seminar.
Prerequisites
The attendance of the lecture "Introduction to High-Performance computing" (Müller/ Bientinesi) is helpful, but not required.
Language
We prefer and encourage students to do the report and presentation in English. But, German is also possible.
Topics
Some topics are already described below to get an idea of the range of topics. A comprehensive description of all topics is coming soon.

Legion - Task-parallel Distributed Memory Parallelism

Modern parallel architectures have both heterogeneous processors and deep, complex memory hierarchies. This requires special consideration of the programming model. Legion is a recent programming model and runtime system organized around logical regions, which express both locality and independence of program data, and tasks, functions that perform computations on regions.
This seminar article and talk is expected to provide a general overview about Legion and to briefly describe an application example.

Supervisor: Christian Terboven

An Overview of Implementation and Performance of Coupled Simulations by the Example of TerrSysMP

In some areas of research, coupling multiple models can lead to better scientific results. For example climate models contain separate models for the atmosphere, land or ocean model in a coupled simulation. Often these models are implemented independently and combined by a coupler framework which starts different MPI processes for different models. These kind of programs are called MPMD (multiple-program multiple-data) applications. Coupling several models is not easy at all and getting good performance is even harder. In this study an overview of coupled simulations shall be given and the general challenges regarding performance optimizations shall be discussed. As an example application the TerrSysMP code can be used.

Supervisor: Dirk Schmidl

Using Notified Access for Remote Memory Access Programming Models for Producer-Consumer Synchronization

In parallel computing the producer-consumer communication is a typical pattern across process domains. Although Remote Memory Access (RMA) programming enables direct access to low-level hardware features, the design of RMA programming schemes focus on memory access without taking process synchronization into account. As a consequence, new schemes for distributed programming paradigms like the Message Passing Interface (MPI) have to be considered in order to use RMA efficiently.
This seminar article and talk is expected to examine this issue in detail and present the idea of a Notified Access in the context of existing cluster and network technology.

Reference: Roberto Belli, Torsten Hoefler: Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization. In: Proceedings of 29th IEEE International Parallel & Distributed Processing Symposium, pp. 871-881. May, 2015.

Supervisor: Tim Cramer

An overview of Fenix - Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales

Application resilience is one of the key challenges in realizing the exascale (>=10^12 FLOP/s) vision. High performance systems today already consist of millions of cores and future systems are likely to be even more complex. With vastly increasing core counts the MTTF (mean-time-to-failure) decreases rapidly. While todays systems have MTTF's which have an order of magnitude in days, future exascale systems will most likely decrease to MTTF's in the magnitude of minutes. This requires that scientific applications are designed to be able to deal with node failures. Fenix is a framework which is designed to enable recovery from such failures in an online (without disrupting the job) and transparent manner.
This seminar should give an overview of resilience techniques implemented with Fenix and briefly describe the integration into an application example.

Reference: Marc Gamell, Daniel S. Katz, Hemanth Kolla, Jacqueline Chen, Scott Klasky, Manish Parashar: Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales. In: SC '14 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 895-906. Nov 2014.

Supervisor: Felix Münchhalfen

Efficient MPI Implementation for Shared-Memory Architectures

Despite its portability and ability to run on broad set of parallel architectures, MPI remains a suboptimal choice of programming paradigm when it comes to efficient use of shared-memory multi-core compute nodes. The OS-level process separation forces MPI to unnecessary copy message data even though the processes share the same physical memory space. MPI-3.0 comes with provisions for portable use of shared memory, but it requires the use of the obscure API for one-sided operations and forces (sometimes big) changes to the existing code base.
This seminar article and talk is expected to provide an overview of the research on alternative MPI implementations that specifically target shared-memory architectures, like thread-based MPI and shared process heaps. The technicalities of each implementation should be presented and the most important pros and cons should be discussed.

Reference: A. Friedley, G. Bronevetsky, A. Lumsdaine, T. Hoefler, Hybrid MPI: Efficient Message Passing for Multi-core Systems. In: IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC13), pp. 18:1-18:11, ISBN: 978-1-4503-2378-9, Nov. 2013.

Supervisor: Hristo Iliev

Debugging Interface for Parallel Runtime Libraries

In HPC we use parallel programming paradigms like OpenMP or MPI. Some of these paradigms extend the basic programming language others just provide and implement an API. Commonly the implementation comes with a runtime library, that might be implemented by some hardware vendor. When it comes to debugging, debuggers have problems understanding the internals of such a runtime library. To be helpful the debugger would need help by the runtime library implementor. Recently, there are efforts to standardize debugging interfaces for OpenMP as well as for MPI. The debugging interface defined for POSIX Threads might be a kind of template for these efforts.
This seminar article and talk is expected to decribe the architecture and model of such a debugging interface and how the debugger would use the interface in some common use cases.

Reference: K. Pouget, M. Pérache, P. Carribault, and H. Jourdren: User level DB: a debugging API for user-level thread libraries. In: Proc. IPDPS Workshops, 2010, pp.1-7.

Supervisor: Joachim Protze

Are ARM Processors ready for HPC?

Moving towards the exa-scale era, reducing the power consumption of HPC systems becomes more important. Recent investigations on improving energy efficiency cover the usage of clusters of low-power ARM processors in the domain of HPC.
In this seminar thesis, it shall be examined whether ARM processors are ready for HPC. The thesis shall look at recent developments in research projects and technologies. Further, it shall give an overview on different aspects like tradeoffs between performance, energy, costs and usability as well as give a critical outlook on what might must be improved in future.

Supervisor: Sandra Wienke

An overview of a parallel eigensolver by the example of ELPA library

Obtaining the eigenvalues and eigenvectors of large matrices is a key problem in many areas of computational science. Since the computational effort scales as O(N^3) the development of the efficient parallel algorithms is highly desirable.
The seminar talk is expected to review some current developments regarding dense eigenvalue solvers and then focus on the Eigenvalue soLvers for Petascale Applications (ELPA) library, which facilitates the efficient algebraic solution of symmetric and Hermitian eigenvalue problems for dense matrices on parallel computer platforms.

Reference: A Marek, V Blum, R Johanni, V Havu, B Lang, T Auckenthaler, A Heinecke, H-J Bungartz and H Lederer: The ELPA library: scalable parallel eigenvalue solutions for electronic structure theory and computational science. In: J. Phys.: Condens. Matter 26 (2014) 213201

Supervisor: Uliana Alekseeva

Energy-Efficient Data Movement and Network Usage in High Performance Computing

Distributed HPC programming models, such as MPI, cause large amounts of data movement over the network, which consumes a lot of energy. In the light of green computing and efforts in reducing the energy bills, a need for optimized data movement strategy for energy-efficient utilization of the network arises. Achieving both minimal energy consumption and best system performance at the same time is impossible and one could only search for the best tradeoff.
The challenge is to find an efficient data movement scheme or an optimal network configuration that reduces the energy consumption as far as possible without severely affecting the performance.

Reference: S. Jana, O. Hernandez, S. Poole and B. CHapmana: Power Consumption Due To Data Movement in Distributed Programming Models. In: Euro-Par 2014 Parallel Processing. Springer International Publishing, 2014. 366-378.

Supervisor: Bo Wang

DI-MMAP - a scalable memory-map runtime for out-of-core data-intensive applications

Persistent memory technologies, for example direct I/O-bus-attached Non-Volatile RAM, such as Flash arrays today, and STT-RAM, PCM, or memristor in the future, provide new opportunities for existing system software solutions. DI-MMAP is a high-performance run-time that memory-maps large external data sets into an application's address space to increase performance.
The seminar participant is expected to present DI-MMAP, its basic approach, application, and performance potential. Another task is to give an outlook on emerging NVRAM technologies and its role in HPC.

Reference: Brian Van Essen, Henry Hsieh, Sasha Ames, Roger Pearce, Maya Gokhale, DI-MMAP - a scalable memory-map runtime for out-of-core data-intensive applications, Cluster Computing, March 2015, Volume 18, Issue 1, pp 15-28


Supervisor: Pablo Reble

Kokkos: Enabling many-core performance portability

Kokkos implements a programming model in C++ for writing performance portable applications targeting all major HPC platforms. It targets abstraction for parallel execution of code and data management.
The seminar thesis should cover a presentation of Kokkos.

Reference: H. Carter Edwards, Christian R. Trott, Daniel Sunderland: Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. Journal of Parallel and Distributed Computing Volume 74, Issue 12, December 2014, Pages 3202-3216

Supervisor: Pablo Reble

The RAJA Portability Layer

RAJA is a programming approach developed at LLNL to encapsulate platform-specific concerns, related to both hardware and parallel programming models. It targets abstraction for parallel execution of code and data management.
In this seminar thesis, RAJA should be presented.

Reference: R. D. Hornung, J. A. Keasler (2014). The RAJA Portability Layer: Overview and Status (No. LLNL-TR-661403). Lawrence Livermore National Laboratory (LLNL), Livermore, CA.

Supervisor: Christian Terboven

Instructors

Prof. Matthias S. Müller
Uliana Alekseeva
Tim Cramer
Hristo Iliev
Felix Münchhalfen
Joachim Protze
Pablo Reble
Dirk Schmidl
Christian Terboven
Bo Wang
Sandra Wienke

  • No labels