[HYBRID] Data Parallelism: How to Train Deep Learning Models on Multiple GPUs (EuroCC)

Name: [HYBRID] Data Parallelism: How to Train Deep Learning Models on Multiple GPUs (EuroCC)
Start: 2023-10-04T08:45:00+02:00
End: 2023-10-04T16:00:00+02:00
Location: 207

Wednesday 4 Oct 2023, 08:45 → 16:00 Europe/Prague

207

Description

Annotation

Modern deep learning challenges leverage increasingly larger datasets and more complex models. As a result, significant computational power is required to train models effectively and efficiently. Learning to distribute data across multiple GPUs during training makes possible an incredible wealth of new applications that utilize deep learning.

Effectively using systems with multiple GPUs also reduces training time, allowing for faster application development and much faster iteration cycles. Teams who can train with multiple GPUs have an edge, building models trained on more data in shorter periods and with greater engineer productivity.

This workshop teaches you techniques for data-parallel deep learning training on multiple GPUs to shorten the training time required for data-intensive applications. Working with deep learning tools, frameworks, and workflows to perform neural network training, you’ll learn how to decrease model training time by distributing data to multiple GPUs while retaining the accuracy of training on a single GPU.

In this workshop, attendees will learn how to:

Perform data-parallel deep learning training with multiple GPUs
Achieve maximum throughput when training for the best use of multiple GPUs
Distribute training to multiple GPUs using PyTorch Distributed Data Parallel (DDP)
Understand and utilize algorithmic considerations specific to multi-GPU training performance and accuracy

Tools, libraries, and frameworks: PyTorch, PyTorch Distributed Data Parallel, NVIDIA Collective Communications Library (NCCL)

Level

Advanced

Language

English

Prerequisites

NVIDIA developer account is needed prior to the event. Please see the section "Practicalities" below.

Hardware requirements: Desktop or laptop computer capable of running the latest version of Chrome or Firefox. Each participant will be provided with dedicated access to a fully configured, GPU-accelerated workstation in the cloud.

Tutor

Georg Zitzlsberger is a research specialist for Machine and Deep Learning at IT4Innovations. For over four years he has been certified by NVIDIA as a University Ambassador of the NVIDIA Deep Learning Institute (DLI) program. This certification allows him to offer NVIDIA DLI courses to users of IT4Innovations' HPC services. In addition, in collaboration with Bayncore, he was a trainer for Intel HPC and AI workshops and conferences carried out across Europe. He has been contributing to these events, which are held for audiences from industry and academia, for five years. Recently, he also received instructor certifications from Intel for oneAPI related courses.

Acknowledgments

This project has received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 101101903. The JU receives support from the Digital Europe Programme and Germany, Bulgaria, Austria, Croatia, Cyprus, Czech Republic, Denmark, Estonia, Finland, Greece, Hungary, Ireland, Italy, Lithuania, Latvia, Poland, Portugal, Romania, Slovenia, Spain, Sweden, France, Netherlands, Belgium, Luxembourg, Slovakia, Norway, Türkiye, Republic of North Macedonia, Iceland, Montenegro, Serbia. This project has received funding from the Ministry of Education, Youth and Sports of the Czech Republic.

This course was supported by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:90254).

All presentations and educational materials of this course are provided under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Support

training@it4i.cz

- Time to join the meeting
- Introduction
- Stochastic Gradient Descent
- 10:00
  
  Coffee Break
- Hands-On: Stochastic Gradient Descent
- Introduction to Distributed Training
- 12:00
  
  Lunch break
- Introduction to Distributed Training
- Algorithmic Challenges of Distributed SGD
- 14:30
  
  Coffee Break
- Hands-On: Algorithmic Challenges of Distributed SGD
- Wrap up and Q&A

Choose timezone