Current Topics in High-Performance Computing (HPC)

Bachelor/ Master

 

Students in Bachelor and Master programs attend the (compulsory) seminar events together. This enables mutual learning where the seminar presentations offer insights into a wide range of HPC-related topics. Bachelor and Master students have to follow slightly different rules (see below). All students are individually supervised.

Content

High-performance computing (HPC) is applied to speedup long-running scientific applications, for instance the simulation of computational fluid dynamics (CFD). Today's supercomputers often base on commodity processors, but also have different facets: from clusters over (large) shared-memory systems to accelerators (e.g. GPUs). Leveraging these systems, parallel computing with e.g. MPI, OpenMP or CUDA must be applied, while meeting constraints on power consumption.
This seminar focuses on current research topics in the area of HPC and is based on conference and journal papers. Topics might cover, e.g., parallel computer architectures (multicore systems, GPUs, etc.), parallel programming models, performance analysis & correctness checking of parallel programs, performance modeling or energy efficiency of HPC systems.

Schedule

This seminar belongs to the area of applied computer science. The topics are assigned in the introductory event at the beginning of the lecture period (April 3rd, 2019, 9.30am - 11.30am, Seminarraum 003, IT Center, Kopernikusstr. 6). Then, the students work out the topics over the course of the semester. The corresponding presentations take place as block course one day (or two days) at the end of the lecture period or in the exam period. Attendance is compulsory for the introductory event and the presentation block. Furthermore, students will have to attend an additional course on "Scientific Writing in Computer Science" of 1.5 hours. This course extends the official introductory course of the computer science department by practical tips and guidelines on how to write a scientific thesis. The time of the course will be announced shortly and participation will be compulsory for Bachelor students (and strongly recommended for Master students).

More information in RWTHmoodle.

Registration/ Application

Seats for this seminar are distributed by the global registration process of the computer science department only. We appreciate if you state your interest in HPC, and also your pre-knowledge in HPC (e.g., relevant lectures, software labs, and seminars that you have passed) in the corresponding section during the registration process.

Requisites

The goals of a seminar series are described in the corresponding Bachelor and Master modules. In addition to the seminar thesis and its presentation, Master students will have to lead one set of presentations (roughly 3 presentations) as session chair. A session chair makes sure that the session runs smoothly. This includes introducing the title of the presentation and its authors, keeping track of the speaker time and leading a short discussion after the presentation. Further instructions will be given during the seminar.

Prerequisites

The attendance of the lecture "Introduction to High-Performance computing" (Prof. Müller) is helpful, but not required.

Language

We prefer and encourage students to do the report and presentation in English. But, German is also possible.

Topics

Hardware-accelerated data race detection

Typical dynamic data race detection tools have a high runtime overheadof 2-100x. A new approach makes use of hardware transactional memory todetect a data race and then performs further analysis to verify theactual race. This can significantly reduce the runtime overhead of datarace analysis.

Focus of this seminar thesis is an analysis of drawbacks in thisapproach as well as applicability for typical workload in scientificcomputing.

Supervisor: Joachim Protze


Data race detection for OpenMP tasks, a qualitative comparison

Classical data race detection tools do not respect the concurrency oftask-parallel programs. Last year three approaches to detect data races in OpenMP task-parallel programs were presented: TaskSanitizer, ROMP and Archer with TLC-support.

Focus of this seminar thesis is a comparison of the different analysis approaches, highlighting the different techniques are used for the analysis. Based on the different analysis approaches, do you expectcorner cases which are not covered by an approach?

Supervisor: Joachim Protze

 

Data race detection for OpenMP tasks, a quantitative comparison

Classical data race detection tools do not respect the concurrency of task-parallel programs. Last year three approaches to detect data races in OpenMP task-parallel programs were presented: TaskSanitizer, ROMP andArcher with TLC-support.

Focus of this seminar thesis is a comparison of the results for the different approaches. What is the analysis cost for the different tools (can be measured on the cluster). Are there corner-cases which are understood by one or the other tool? I.e. does one aproach provide false positive/negative results where others don't?

Supervisor: Joachim Protze


Analysis of Task-Parallel Execution Behavior and Performance Anomalies

Due to the increasing computational demand of scientific and industrial applications parallel programming receives more and more attraction.
By distributing the workload across processing units of multi- and many-core architectures that are commonly used for state-of-the-art supercomputers time to solution can significantly be reduced. In contrast to classical concepts task-based programming paradigms allow parallelization of irregular and recursive applications and offer higher flexibility and better opportunities for load balancing. Identifying suboptimal execution behavior or performance anomalies however becomes a challenging task. Whereas hardware performance measurements can shed some light on hardware behavior, understanding how that data correlates to task performance is not always clear.

This seminar topic focuses on techniques to analyze the behavior of task parallel applications and find hardware behavior or events that slow down execution.These insights can further be used to recommend optimizations or modifications to increase performance.

Supervisor: Jannis Klinkenberg


Runtime-Assisted Cache Coherence Optimizations for Task Parallel Programs

Core counts of shared memory architectures used in Cloud and High Performance Computing setups tend to increase with time.
To speedup accesses to data (load and store) these cores are usually equipped with a hierarchy of caches to avoid slower accesses to main memory whereas possible. Due to consistency reasons for shared data directory-based cache coherence solutions keep track of data that is currently cached on core or socket level. With higher core counts the size of these cache directories and the work for managing and searching it increases as well. So does the power consumption. In contrast, data that is guaranteed to be used by a single core only does not necessarily need to be covered by cache coherence. However, hardware built-in heuristics and classification methods to detect those data entries often suffer from inaccuracy and require complex hardware support.

This thesis addresses solutions to improve the classification of privately used data to decrease size and complexity of cache directories that can lead to power savings and faster directory accesses.

Supervisor: Jannis Klinkenberg

 

The Impact of Taskyield on the Inter-Node Task Communication

Traditionally, parallelizing work across multiple processing units can roughly be divided into the following two categories.On the outer level work is distributed between multiple compute nodes, usually with explicit message and data transfers (inter-node). In order to exploit the full potential of such a shared-memory node work is then distributed across the available sockets and cores, usually by using threading approaches (intra-node). Hybrid approaches combine these two aspects of inter- and intra-node parallelization. The shared-memory parallelization API OpenMP allows task-based parallelization that is especially useful for irregular problems or in this case hybrid applications that also perform inter-process communication which can be executed by tasks. To hide communication latencies OpenMP provides a taskyield construct to potentially allow a thread to execute other tasks while completing the communication. However, the standard only provides little guarantees about the characteristics of the taskyield construct.

This thesis focuses on the taskyield behavior of different OpenMP implementations and the impact is has on the design of communication-heavy hybrid applications

Supervisor: Jannis Klinkenberg


Code Metrics for HPC Applications

Modern computer architectures are not only becoming increasingly parallel, but the trend is also towards the use of various kinds of accelerators for specialized computations. Such heterogeneous environments enforce applications to consist of several programming models, probably implemented non-uniformly in language extensions, compiler directives, libraries, and others. Ensuring maintainability and portability of those codes is a key challenge for the domain of high-performance computing (HPC).

This seminar should briefly introduce code metrics, their origin in the field of software engineering (SE) and the general conflict of deriving meaning from a numerical metric. As part of this first task, a collection of typical code metrics shall be explained and evaluated against their application in the HPC domain. The second part and the main focus of the seminar shall be the derivation of specific code metrics for parallel code. In particular, those metrics could measure the parallel fraction, the use of programming models, the frequency and domination of parallel patterns. If possible, a complexity measure for parallel code could be sketched and its application briefly explained (with respect to the motivation).

Supervisor: Julian Miller


Taxonomy of Parallel Patterns

Software design patterns describe commonly occurring problems in the design and development of software and provide a general and reusable solution to them. Analogously to the software engineering domain, there exist a multitude of parallel patterns with varying definitions and problem solutions.

In this seminar thesis, the different patterns and definitions shall be grouped and uniformly described. In a second step, the patterns shall be classified by the type of parallelism they describe and their scaling behavior shall be analyzed. If possible, architectural differences influencing the scaling behavior could be integrated which could be verified through experiments on, e.g., the different architectures available at the RWTH compute cluster.

Supervisor: Julian Miller


Source-to-source Compilers for HPC Applications

While the lifetime of typical scientific codes can reach decades, multi-core and many-core processors evolve very quickly. This requires large efforts in restructuring and optimizing codes for modern computer architectures. One possibility to reduce these efforts is the use of source-to-source compilers which could then transform, optimize and parallelize code for specific architectures.

In this seminar thesis, the commonly used techniques of compilers to transform code shall be investigated. A focus on transforming algorithmic classes and parallel patterns shall be given. If possible, different source-to-source compilers can be compared.

Supervisor: Julian Miller


The Future of Vector Computing: Performance Evaluation of New Approaches

In High Performance Computing (HPC) vector architectures have quite a long tradition. Although the emergence of x86 commodity clusters interrupted this trend for a while, we see a renaissance of SIMD capabilities in all modern architectures. Not only the success of general GPU computation during the last decade, but also the trend to longer SIMD registers in common CPUs contributes to this development.  New kinds of accelerators like the vector engine SX-Aurora TSUBASA underline importance for modern processor designs.

This seminar article and talk is expected to give a detailed overview of the principles of vector computing and how it is implemented in the SX-Aurora TSUBASA, including the hardware and the execution model. Furthermore, it is expected to discuss existing performance evaluations and compare the results with other architectures with vector capabilities (e.g. Intel Xeon or Nvidia GPUs). With respect to metric like the bytes/flops ration of the given  codes, the seminar candidate should give an overview on which code characteristics are promising for such a new architecture.

Supervisor: Tim Cramer


Runtime Data Management on Non-Volatile Memory-based Heterogeneous Memory for HPC Programs

Emerging non-volatile memory (NVM) technologies promise scalable and power-efficient storage for high performance computing (HPC) systems. With increasing bandwidth and decreasing latency, they have the potential to replace DRAM as main memory. However, as long as this performance gap still exists, a heterogeneous  memory system (HMS) can be used in order to increase the memory capacity for typical HPC workloads.

This seminar article and talk is expected to present a lightweight runtime solution called "Unimem", which automatically and transparently manages data placement on HMS. This includes the discussion of the underlying performance model used for the online profiling. Furthermore, it is expected to discuss the performance results of this solution for typical HPC benchmarks.

Supervisor: Tim Cramer


Parallel Graph-based What-If Analyses for OpenMP Programs

OpenMP is the de-facto standard for parallel shared memory programming. In order to reach a good performance and scalability performance tools can be use to analyze the runtime behavior. The tools interface introduces in OpenMP 5.0 enables developers to implement tool in standard-compliant way. Furthermore, it is possible to profile the runtime phases and represent these phases as parallel graph. A tool can perform what-if analyses based on this parallel graph in order to estimate improvements in parallelism.

This seminar article and talk is expected to give an overview on what-if analysis and a parallel graph-based implementation. This especially includes a discussion of an appropriate performance model and the assessment of a given performance evaluation.

Supervisor: Tim Cramer


Instructors

Tim Cramer
Jannis Klinkenberg
Julian Miller

Joachim Protze
Sandra Wienke

Contact: contact@hpc.rwth-aachen.de
  • No labels