5–6 Nov 2019
IT4Innovations
Europe/Prague timezone

Optimization of Computationally and I/O Intense Patterns in Electronic Structure and Machine Learning Algorithms.

Not scheduled
3h
atrium (IT4Innovations)

atrium

IT4Innovations

Studentská 1B 708 33 Ostrava - Poruba
Poster Poster session Conference Dinner & Poster Session

Speaker

Dr Michal Pitoňák (Computing Center, Centre of Operations of the Slovak Academy of Sciences)

Description

Development of scalable High-Performance Computing (HPC) applications is already a challenging task even in the pre-Exascale era. Utilization of the full potential of (near-)future supercomputers will most likely require the mastery of massively parallel heterogeneous architectures with multi-tier persistence systems, ideally in fault tolerant mode. With the change in hardware architectures HPC applications are also widening their scope to `Big data' processing and analytics using machine learning algorithms and neural networks.

It this work we summarize our experience with the PGAS programming model GASPI for building highly scalable, parallel and fault tolerant HPC applications. We have implemented GASPI and MPI versions of selected algorithms relevant in both the HPC realm (e.g. computational quantum chemistry) and the `Big Data world', that has lately made its way to supercomputers. We utilized the GASPI asynchronous, one-sided operations to overlap computation and communication, its interoperability with MPI to use MPI-IO for efficient I/O operations on Lustre parallel file system, and its time-out mechanism to detect and recover from node(s) failure. All these GASPI features are essential for a programming tool with ambition to utilize the full potential of future Exascale systems.

The GASPI implementation of parallel matrix multiplication (or tensor contraction) is competitive with the state-of-the-art parallel DGEMM function distributed with Intel MKL package. Performance using a large number of parallel processes seems to be even superior to PDGEMM, but more precise and extensive benchmarking is required. Wall-clock timings may be sensitive to particular assignment of ranks to physical nodes, topology of underlying network and its utilization during tests. The GASPI program itself is remarkably simple, and uses collective or blocking operations only marginally.

Two selected machine learning algorithms, K-means and Terasort, were implemented using MPI and GASPI. In the case of K-means, we also implemented its fault tolerant version with non-shrinking recovery strategy and check-pointing, made simple by GPI CP library. Similarly to parallel matrix multiplication implementation, timings obtained for the GASPI code indicate better parallel scaling, especially for larger number of ranks. We have addressed two of three outstanding features of JVM Big data frameworks (such as the Apache Spark), that are not readily provided by MPI: fault tolerance and utilization of parallel (fault tolerant) file systems (Lustre instead of HDFS). The last, but certainly not least advantage of e.g. Apache Spark remained undaunted: easy of use. Despite that GASPI provides a truly straightforward (minimum boilerplate) approach to asynchronous PGAS HPC programming, its performance benefits may not be enough to motivate developers to sacrifice high-level abstractions, such as RDDs. Retaining the high productivity of developers, while keeping the performance benefits of MPI/GASPI, is not only desirable but seems to be possible too, as shown by several hybrid approaches, such as GPI-Space or Spark+MPI.

Primary author

Dr Michal Pitoňák (Computing Center, Centre of Operations of the Slovak Academy of Sciences)

Co-author

Dr Marián Gall (Computing Center, Centre of Operations of the Slovak Academy of Sciences)

Presentation materials

There are no materials yet.