A Top500 (LINPACK) run on Windows
April 4, 2008
Over 18 TeraFlop/s Linpack Performance archieved running the latest pre-release of Windows HPC Server 2008
The BIOS update of all of the nodes of our just recently installed Xeon-cluster gave us a nice opportunity to make a sanity check of the whole cluster and run the Linpack benchmark. We were amazed that 2048 cores can deliver over 75% of the theoretical peak performance for a single application. We employed the unmodified HPL code to solve a linear system with 681,984 unknowns in a little over 3 hours at an average speed of more than 18,000,000,000,000 floating point operations per second (18 TFlop/s). The program had a total memory footprint of almost 4 Terabyte. The whole cluster consists of 256 compute nodes and are equipped with 2 brand new quad-core Intel Xeon E5450 processors running at 3 GHz clock speed each. All nodes are connected with each other by a fast Infiniband fabric with a 288-port switch from Cisco. The cluster nodes were build by Fujitsu-Siemens and delivered by Unicorner.
The following components contributed to this excellent first result:
- the Microsoft Visual Studio 2005 C++ compiler used to compile the HPL code,
- the Intel Mathematical Kernel Library (MKL 10.0.2), a highly tuned mathematical library, was employed to squeeze out every machine cycle of the fast Xeon processors when multiplying matrices,
- the fast Infiniband network supplied by Cisco together with
- the fast implementation of the message passing interface (MPI) based on the Network Direct API, which is part of the brand new Windows HPC Server 2008 release,
- and last but not least the kind competent support of Xavier Pillons, a Microsoft performance expert, who carefully put all the pieces together and adjusted a myriad of parameters, encouraged by the staff members of the Center for Computing and Communication.
Round About Midnight ...
... when pizza and beer provided the right inspiration to the benchmarking team, the machines began to speed up getting closer and closer to their peak performance. Cranking up the problem size a little bit higher this morning delivered the best results so far. There may still be some head room, but a good benchmarker needs to know when to stop and to reopen production.
The Benchmarking team on the image (right to left): Xavier Pillons (Microsoft), Christian Terboven (RZ), Dominik Friedrich (RZ), Michael Wirtz (RZ).
We were really impressed by the job startup performance on this Windows cluster, as launching a large MPI application was significantly faster than what we were used to from our Linux systems (we are working on improving the situation). We created a short video showing off how an MPI job with 2048 processes starts on a cluster with 256 nodes in about 10 (!!) seconds: Video: Linpack startup on Windows HPC Server 2008 (beta). We are using an Excel-based application (written by Xavier Pillons from Microsoft) to submit and control the LINPACK jobs with different parameters. So in the first part of the video, you will see how the input parameters are typed in the Excel spreadsheet. Then we click "Submit Job" and "Check Status" and see that the Excel application has realized that the job status as reported by the scheduler changed to running. We then switch to the Cluster Health Monitor application, which shows the status of all nodes. A white box means the machine is idle. The brown-like color indicates that a hard fault has occured. The greener the box gets, the more the CPU is loaded. After we have switched from Excel to the status application, you see the rest of a "swoosh", that was the job scheduler preparing the machines for execution of the job. Then it takes sometime for the job to actually start (the boxes are white), then the first boxes become green until in a final "swoosh" all boxes are green as the CPU went to 100% busy (job is running)!
Final result: Rank 100 in Top500
Our final LINPACK run achieved a performance of 18.81 Terraflops, out of 24.58 Terraflops theoretical peak performance (= 256 nodes * 8 cores per node * 3.0 GHz * 4 results per cycle). The efficiency of "only" 77% made some issues with the Infiniband network fabric (hardware!) obvious, which still have to be fully resolved. Nevertheless, this performance has lead to rank 100 in the current Top500 list of June 2008!