Current Topics in High-Performance Computing (HPC)

Content

High-performance computing (HPC) is applied to speedup long-running scientific applications, for instance the simulation of computational fluid dynamics (CFD). Today's supercomputers often base on commodity processors, but also have different facets: from clusters over (large) shared-memory systems to accelerators (e.g. GPUs). Leveraging these systems, parallel computing with e.g. MPI, OpenMP or CUDA must be applied.
This seminar focuses on current research topics in the area of HPC and bases on conference and journal papers. Topics might cover e.g. parallel computer architectures (multicore systems, Xeon Phis, GPUs etc.), parallel programming models, performance analysis & correctness checking of parallel programs or performance modeling.

Schedule

The topics are assigned at the beginning of the lecture period (25th April 2017, 10pm - 11:30pm). Then, the students work out the topics over the course of the semester. The corresponding presentations take place as block course one day at the end of the lecture period or at the beginning of the exam period. Attendance is compulsory. More information in L²P: https://www3.elearning.rwth-aachen.de/ss17/17ss-53769 

Requisites

The goals of a seminar series are described in the corresponding Bachelor and Master modules. In addition to the seminar thesis and its presentation, Master students will have to lead one set of presentations (roughly 1-3 presentations) as session chair. A session chair makes sure that the session runs smoothly. This includes introducing the title of the presentation and its authors, keeping track of the speaker time and leading a short discussion after the presentation. Further instructions will be given during the seminar.
Prerequisites
The attendance of the lecture "Introduction to High-Performance computing" (Müller) is helpful, but not required.
Language
We prefer and encourage students to do the report and presentation in English. But, German is also possible.
Topics

An automated interactive infrastructure and database for scientific high-throughput simulations

Computer simulations is nowadays an essential part of research in science and industry. In many areas there is a demand of performing thousands of routine simulations, which provide a huge amount of data. These data should be managed, stored and analyzed. The throughput capacity of current HPC architectures makes this possible, but it calls for the development of concepts and tools to organize calculations and data. It is quite challenging: the managing infrastructure must be flexible but easily usable. Besides, it has to ensure reproducibility of the results and allow for sharing of workflows in the community. The seminar paper and talk is expected to discuss the general requirements that any such infrastructure should have to create, manage, analyze and share data and simulations and how these have been addressed in the AiiDA, automated interactive infrastructure and database for computational science.
Supervisor: Uliana Alekseeva

Standard-compliant Performance Analysis for OpenMP on Heterogeneous Architectures

The demand for more and more compute resources in high performance computing (HPC) led to a trend to different heterogeneous Systems for nowadays and future supercomputers. Furthermore, also the amount of features of common programming paradigms like OpenMP increases in order to improve the programmability of these systems. To improve the performance analysis in a portable and vendor-independent fashion, the OpenMP Architecture Review Board (ARB) published a Technical Report (TR4), which integrates a standard-compliant tools interface (OMPT) into the OpenMP specification. This includes the monitoring of the execution of an application on a target device, as well as transfer of the collected performance information. Recent research shows that this approach is not only applicable for GPGPUs or the Intel Xeon Phi coprocessor, but also for FPGAs.
This seminar article and talk is expected to provide a general overview on OMPT regarding the support for accelerator devices. Furthermore, a detailed evaluation of an implementation for a FPGA has to been discussed.
Supervisor: Tim Cramer

Elimination of Unnecessary Data Transfers by Translating OpenMP Target Constructs into OpenCL

The introduction of heterogeneous programming models such as Nvidia CUDA or OpenCL is driven by the current development of heterogeneous hardware. Since the applications of these programming paradigms is typically difficult and complicated other approaches such as OpenMP 4.x or OpenACC have been introduced in order to lower the programming complexity. Unfortunately, for None of these approaches an implementation exists which supports all kinds of accelerators. Furthermore, the expressiveness of less complex paradigms might lower the performance optimization opportunities. At least it might be non-trivial to avoid unnecessary data transfers between a host and an accelerator. In order to overcome these issues, solutions for the source-to-source translation from one paradigm into another came to existence.
This seminar article and talk is expected to give a detailed overview and discussion for such an framework translating from OpenMP target constructs into OpenCL. Furthermore, a discussion on the automatic performance optimization potential is expected.
Supervisor: Tim Cramer

Techniques and Approaches for Job Placement on Supercomputers to Improve Performance and Reduce Runtime Variability

Due to the increasing usage of nowadays large supercomputers job placements play a significant role in terms of application performance. Especially performance of big jobs that require a large portion of the cluster's nodes/resources might suffer from being split into multiple spatially separated fragments. Knowledge about the network topology of the cluster and the communication pattern of the application can also be crucial to ensure well-placed jobs without impacting the utilization of the cluster. This seminar topic focuses on describing existing approaches to improve performance and reduce network latency and runtime variability.
Supervisor: Jannis Klinkenberg

Identifying Performance Issues in Parallel Applications using Machine Learning Techniques

Nowadays commodity computer hardware is not just used in everyday life but also combined for building state of the art supercomputers. These computers typically have multiple cores. To exploit the full potential and performance of these multicore systems parallel programming becomes more and more important. Due to the increasing complexity of computer architecture, networks and programming languages for parallel executing there are multiple issues that can arise that might vastly impede the performance such as false sharing, contention, remote memory accesses on NUMA systems or bad memory access patterns. Although these programs still produce correct results it is hard to identify whether parallel code has such performance problems and current performance analysis tools only provide limited information about that. Thus new ideas came up e.g. how to use integrated performance counters and machine learning techniques to identify such performance issues. Focus of this seminar topic is to give an overview of the techniques that are currently investigated and how they work.
Supervisor: Jannis Klinkenberg

Chances of thread-safe languages for HPC

Data races are a common threat in parallel computing. Just in recent years, some thread-safe languages like Rust evolved. While the languages conceptually prevent data races, using such a language also means a shift in the programming paradigm.
This seminar work will give an overview on the techniques leading to the thread-safety. Furthermore the work will show whether this techniques are applicable to HPC programming and explain what is missing in the language to implement HPC applications in Rust.
We encourage a comparison of C/C++ and Rust code and doing own measurements.
Supervisor: Joachim Protze

An Efficient Algorithm for On-the-Fly Data Race Detection Using an Epoch-Based Technique

Data races represent the most notorious class of concurrency bugs in multithreaded programs. To detect data races precisely and efficiently during the execution of multithreaded programs, the epoch-based FastTrack technique has been employed. However, FastTrack has time and
space complexities that depend on the maximum parallelism of the program to partially maintain expensive data structures, such as vector clocks. This paper presents an efficient algorithm, called iFT, that uses only the epochs of the access histories. Unlike FastTrack, the algorithm
requires O(1) operations to maintain an access history and locate data races, without any switching between epochs and vector clocks. We implement this algorithm on top of the Pin binary Instrumentation framework and compare it with other on-the-fly detection algorithms,
including FastTrack, which uses a state-of-the-art happens before analysis algorithm. Empirical results using the PARSEC benchmark Show that iFT reduces the average runtime and memory overhead to 84% and 37%, respectively, of those of FastTrack.
Supervisor: Joachim Protze

A comparative survey of the HPC and Big Data Paradigms

 Most scientific applications are compute intense applications that are run on HPC clusters and make use of paradigms like MPI and OpenMP for parallel performance. Big Data is an emerging paradigm where applications deal with large amount of data on commodity hardware and use technologies like Spark and Hadoop for parallel performance. However, with the march towards exascale, there is an increasing overlap between the characteristics and requirements of the HPC and Big Data paradigms. HPC applications have to deal with large amounts of I/O, heterogeneous hardware and and should have resilience. Therefore, there is a need to look at the concepts and technologies of the two paradigm to tackle these impending challenges. This seminar topic will look at the comparative studies that have been done on the topic and provide an assessment of the technologies.
Supervisor: Aamer Shah

PIPES: A Language and Compiler for Task-based Programming on Distributed-Memory Clusters

Most HPC applications are programmed in MPI and OpenMP, following an SPMD model. An alternative way of programming parallel´applications is to divide them into tasks, and specifying their data and control flows. Such an approach can increase productivity but faces difficulties in the form of task mapping and scheduling, impeding high-performance execution. The PIPES language and compiler infrastructure, based on Intel Concurrent Collections (CnC) runtime, tackles these problems by allowing the user to define virtual topologies, task and data mapping, and scheduling priorities. The framework also offers optimisations based on task coarsening. This seminar topic will look at the PIPES framework, providing an overview and eliciting its strengths and weaknesses by comparing it with conventional approaches.
Supervisor: Aamer Shah

Power tuning based upon performance counters

Power tuning of individual applications is the keyword towards energy-efficient computing. As applications distinguish from each other in respect of performance and power consumption, it is very important to find a relationship between them, since we want to save power and keep performance as much as possible. One idea to clarify this relationship is based upon performance counters, which are integrated into modern hardware and are used widely for tracing performance of applications at runtime in a fine granularity. There are lots of such counters. However, none of them gives a direct mapping between performance and power. In this work, you should investigate these counters, select and combine them in a way that a strategy for power tuning can be derived.
Supervisor: Bo Wang

Job scheduling under a power cap

Besides computing capability of a cluster, its power consumption becomes another limitation, as this consumption is raising more rapidly than upgrading the infrastructures. The clusters' power consumption will be capped in the future. Under this assumption, two issues must be solved: how to cap the power and how to maximize the throughput of the cluster.  As the central resource manager, job scheduler of a cluster must be extended to take care both issues. Currently, new power features are being added into a range of job schedulers, such as Slurm and LSF. In this work, you should investigate power features of job schedulers and understand the point of view behind them in a scientific way.
Supervisor: Bo Wang

Instructors

Prof. Matthias S. Müller
Uliana Alekseeva
Tim Cramer
Jannis Klinkenberg
Joachim Protze
Dirk Schmidl
Aamer Shah
Bo Wang
Sandra Wienke

Contact: contact@hpc.rwth-aachen.de
  • No labels