9 November 2021
IT4Innovations
Europe/Prague timezone

HyperQueue: Simplifying Usage of PBS/SLURM Clusters

9 Nov 2021, 14:00
30m
Online (IT4Innovations)

Online

IT4Innovations

Poster Poster session Poster session

Speaker

Jakub Beránek (IT4Innovations)

Description

In recent years, HPC workloads and communities have undergone substantial paradigm shifts. There is an increasing amount of users that want to leverage HPC clusters to execute many simple and embarrassingly parallel tasks as easily as possible. However, due to the limitations of traditional HPC job managers, these users must often resort to manual aggregation of tasks into a smaller number of jobs to reduce job manager overhead. This approach is both labour-intensive and inefficient, as it lacks dynamic load balancing required to fully utilize computational nodes with tens or hundreds of cores. We introduce HyperQueue - a task scheduling runtime that can execute a large amount of tasks on top of an HPC job manager by automatically aggregating tasks into jobs and dynamically load balancing them across all allocated nodes and CPU cores. HyperQueue is an open-source tool that is designed for ease of use and deployment.

Primary authors

Stanislav Böhm (IT4Innovations) Jakub Beránek (IT4Innovations) Vojtech Cima Roman Machacek Vyomkesh Jha Alfred Koci Jan Martinovic Branislav Jansik

Presentation materials

There are no materials yet.