The IT Center operates high performance computers in order to support institutions and employees in terms of education and research.
All machines are integrated into one “RWTH Compute Cluster” running under the Linux operating system.
General information about usage of the RWTH Compute Cluster is described in this area - whereas information about programming the high performance computers is described in RWTH Compute Cluster - Parallel Programming.
All members of the RWTH Aachen University have free access to the RWTH Compute Cluster. But the amount of resources they can use is limited.
Above a certain threshold applications for more resources have to be submitted which then are reviewed. This application process is also open to external German scientists in institutions related to education and research. Please find related information here.
Please find information about how to get access to the system here.
You can get information about using and programming the RWTH Compute Cluster online on this website, in our Primer which you can download as a single pdf file for print out, or during our HPC related Events. For many of these Events, particularly turorials, we collect related material on our web site as well - see here. And then there are regular lectures, exercises and software labs of the Chair for HPC covering related topics.
During several months, the team operating the cluster experienced bad Lustre performance. In a number of cases, the bandwidth came down to Kilobytes in an area where Gigabytes are expected. Several attempts to resolve the issue in cooperation with the involved suppliers failed, especially as the problems were difficult to reproduce. As time went by, the pressure to resolve the issue arose and the supplier requested the update of a number of OmniPath network components to the latest versions. On October 17th, access to $HPCWORK was stalled and even though measures were applied the following day $HPCWORK could not be taken into service again. A closer look revealed problems in 6 out of 32 ObjectStorageTargets (OSTs) due to inconsistencies in the underlying file system (ZFS) layer.
Besides the number of defect OSTs being small, the data at risk summed up to 50% due to the files in the parallel file system are striped. In the following days, ZFS experts had been contacted and advised the team working on the recovery process. In the end, the provider succeeded by importing, scrubbing and exporting the OSTs using the original (v0.6.5.11) and the latest (v0.8) version of ZFS. The recovery process took five days and reduced the data at risk to 22% after repairing 4 OSTs and, to approximately 1% after recovering all six OSTs. Especially the recovery of the last two OSTs took several days, as the data could not be repaired in place, but had to be transferred during the process. Random samples of 10% of the complete system showed no damaged files. A final file system check will be run to validate this result. During this check-up, the performance of the Lustre file system will be degraded.
As a whole, the operation on the Lustre file system in combination with other cluster components forms a complex system and leads to the offering of big amounts of storage resources including high bandwidths. This offering is deliberately combined with the decision to not backup the contents of $HPCWORK.