[ONSITE] GPU Programming - CUDA

Europe/Prague
207 (IT4Innovations National Supercomputing Center )

207

IT4Innovations National Supercomputing Center

Studentská 6231/1b, 708 00 Ostrava-Poruba
Description

Annotation

The main goal of the course is to introduce how to program GPU accelerated applications using CUDA programming.  

We will describe the main principles of heterogeneous or accelerated computing (with a short hardware description of the GPU-accelerated supercomputers) needed for a proper understanding of how to design CUDA code.

The course is designed for beginners in GPU programming using CUDA. It will explain how the parallelisation is done with basic examples, how data transfers are managed between CPU and GPU memory, what types of memory there are in GPU and how to use them, how the parallel threads are executed, and finally, we will explain several key parallel computing patterns in CUDA.

As the course will use the Karolina supercomputer, we will also demonstrate how to write single and multi-GPU applications.

Level

intermadiate in parallel programing, beginner in CUDA

Language

English

Prerequisites

Participants should be familiar with programming in either C/C++ or Fortran. Furthermore, they should be able to work on the Linux command line. Basic experience with development of scientific code is preferable. Basic knowledge of parallel programing is advantage.

About the tutor

Lubomír Říha, Ph.D. is the Head of the Infrastructure Research Lab at IT4Innovations National Supercomputing Center. Previously he was a senior researcher in the Parallel Algorithms Research Lab at IT4Innovations, and a research scientist in the High Performance Computing Lab at George Washington University, ECE Department. He received his Ph.D. and M.Sc. degrees in Electrical Engineering from the Czech Technical University in Prague, the Czech Republic, in 2011, and his Ph.D. degree in Computer Science from Bowie State University, USA. Currently, he is a local principal investigator of two EuroHPC projects: SCALABLE and EUPEX (designs a prototype of the European Exascale machine). Previously he was a local principal investigator of the H2020 Center of Excellence POP2 and H2020-FET HPC READEX projects and investigator of the FP7 EXA2CT project and the Intel Parallel Computing Center. He is also a co-founding developer of the ESPRESO finite element library, which includes a parallel sparse solver designed for supercomputers with tens or hundreds of thousands of cores, with support for GPU, Intel Xeon Phi, and other modern accelerators. His research interests are optimization of HPC applications, energy-efficient computing, acceleration of scientific and engineering applications using GPU and many-core accelerators, development of scalable linear solvers, parallel and distributed rendering on new HPC architectures, and signal and image processing.

Milan Jaroš, Ph.D. is a researcher at IT4Innovations National Supercomputing Center. He has 10 years of experience in professional programming (C++, C#, Java, etc.). He has developed several commercial software (including mobile applications). In recent years, he has focused on research in the area of HPC computing (including support of GPU and Intel Xeon Phi coprocessor), medical image processing and scientific data visualization (virtual reality, rendering, CFD postprocessing, etc.). He participates in the development of plugins for various software (Blender, COVISE/OpenCOVER, Unity, Monado, etc.).

Jakub Homola is a PhD. student of Computational Science and a Reseach Assistant at IT4Innovations, VSB-TUO. He graduated in Computational and Applied Mathematics at VSB - Technical University of Ostrava, specializing in Computational Methods and HPC. His research and professional interests are FEM-based numerical methods and GPU programming in CUDA and HIP. He is partially involved in the EUPEX project, where he is working with the ESPRESO library to optimize it for future ARM CPUs.

Kristian Kadlubiak is a researcher at the INFRA lab of IT4Innovations National Supercomputing Center where he is responsible for designing and developing various acceleration and optimization techniques in flagship application ESPRESO. He specializes in parallel and vector processing, accelerator offloading, and performance tuning in general. He holds a master's degree in embedded and computer systems from the Brno University of Technology where he is partially involved as a Ph.D. student. In his studies, he is developing modifications of the Local Fourier Basis (LFB) method to adapt it for efficient use on HPC systems.

 

Acknowledgements

This course is supported by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:90140). 

Surveys
GPU programming: CUDA satisfaction survey
    • 08:45 09:00
      Arrival of participants, check-in
    • 09:00 10:30
      Heterogeneous Parallel Computing and CUDA Parallelism Model
      1. Heterogeneous Parallel Computing
      2. GPU Architecture
      3. Hands-on: Accessing GPU accelerated nodes
      4. Hands-on: Benchmark HW properties
      5. CUDA Programming
    • 10:30 10:45
      Coffee Break 15m
    • 10:45 12:00
      CUDA programming – Memory and Data Locality I
      1. Hands-on: Hello World in CUDA
      2. CUDA Programming cont.
      3. Hands-on: Vector Addition (single GPU, two versions)
    • 12:00 13:00
      Lunch Break 1h
    • 13:00 14:15
      CUDA programming – Memory and Data Locality II
      1. Multi-GPU programming
      2. Hands-on: Vector Addition (multi-GPU, two versions)
      3. Multi-Dimensional Grids
      4. Hands-on: Image Blur
      5. Thread Execution
      6. CUDA Memories
    • 14:15 14:30
      Coffee Break 15m
    • 14:30 15:30
      CUDA programming – Parallel Computation Patterns I
      1. Global Memory
      2. Matrix Sum (live demo)
      3. Shared Memory
      4. Memory and Data Locality: Tiling Technique
      5. Parallel Computation Patterns: Stencil
      6. Hands-on: Stencil
    • 15:30 15:45
      Coffee Break 15m
    • 15:45 16:45
      CUDA programming – Parallel Computation Patterns II
      1. Parallel Computation Patterns: Reduction
      2. Hands-on: Reduction
      3. Parallel Computation Patterns: Histogram
      4. Histogram (live demo)
      5. Efficient Host-Device Data Transfer and CUDA Streams
    • 16:45 17:00
      Closing remarks, Q&A