The current wave of advances in Deep Learning (DL) has led to many exciting challenges and opportunities for Computer Science and Artificial Intelligence researchers alike. Modern DL frameworks like Caffe/Caffe2, TensorFlow, Cognitive Toolkit, Torch, and several others have emerged that offer ease of use and flexibility to describe, train, and deploy various types of Deep Neural Networks (DNNs) including deep convolutional networks. In this tutorial, we will provide an overview of interesting trends in DL and how cutting-edge hardware architectures are playing a key role in moving the field forward. We will also present an overview of DL frameworks from an architectural as well as a performance standpoint. Most DL frameworks have utilized a single GPU to accelerate the performance of DNN training and inference. However, approaches to parallelizing the process of training are also being actively explored. The DL community has moved along MPI based parallel and distributed training as well. Thus, we will highlight new challenges for MPI runtimes to efficiently support DNN training. We highlight how we have designed efficient communication primitives in MVAPICH2 to support scalable DNN training. Finally, we will discuss how co-design of the OSU-Caffe framework and MVAPICH2 runtime enables scale-out of DNN training to 160 GPUs.
Purpose of the course (benefits for the attendees)
1. Help newcomers to the field of distributed Deep Learning (DL) on modern high-performance computing clusters to understand various design choices and implementations of several popular DL frameworks.
2. Guide Message Passing Interface (MPI) application researchers, designers and developers to achieve optimal training performance with distributed DL frameworks like OSU-Caffe, CNTK, and ChainerMN on modern HPC clusters with high-performance interconnects (e.g., InfiniBand), Nvidia GPUs, and multi/many core processors.
3. Demonstrate the impact of advanced optimizations and tuning of CUDA-Aware MPI libraries like MVAPICH2 on DNN training performance through case studies with representative benchmarks and applications.
About the tutor(s)
Dhabaleswar K. (DK) Panda is a Professor and University Distinguished Scholar of Computer Science at the Ohio State University. He obtained his Ph.D. in computer engineering from the University of Southern California. His research interests include parallel computer architecture, high-performance computing, communication protocols, file systems, network-based computing, Big Data, and Deep Learning. He has published over 400 papers in major journals and international conferences related to these research areas.
Dr. Panda and his research group members have been doing extensive research on modern networking technologies including InfiniBand, Omni-Path, HSE and RDMA over Converged Enhanced Ethernet (RoCE). His research group is currently collaborating with National Laboratories and leading InfiniBand, Omni-Path, and Ethernet/iWARP companies on designing various subsystems of next generation high-end systems. The MVAPICH2 (High Performance MPI over InfiniBand, Omni-Path, iWARP, and RoCE) open-source software package, developed by his research group, are currently being used by more than 2,800 organizations worldwide (in 85 countries). These libraries are available from http://mvapich.cse.ohio-state.edu. This software has enabled several InfiniBand clusters (including the 1st one) to get into the latest TOP500 ranking. These software packages are also available with the Open Fabrics stack for network vendors (InfiniBand, Omni-Path and iWARP), server vendors and Linux distributors. The RDMA-enabled Apache Hadoop, Spark and Memcached packages, consisting of acceleration for HDFS, MapReduce, RPC and Memcached and support for clusters with Lustre file systems, are publicly available from http://hibd.cse.ohio-state.edu. These libraries are being used by more than 245 organizations in 31 countries.
The group has also been focusing on co-designing Deep Learning Frameworks and MPI Libraries. A high-performance and scalable version of the Caffe framework is available from the High-Performance Deep Learning (HiDL) Project site (http://hidl.cse.ohio-state.edu). Dr. Panda's research is supported by funding from US National Science Foundation, US Department of Energy, US Department of Defense, and several industry sponsors including Intel, Cisco, SUN, Mellanox, QLogic, Microsoft, NVIDIA and NetApp. He is an IEEE Fellow and a member of ACM.
More details about Dr. Panda, including a comprehensive CV and publications are available at: http://web.cse.ohio-state.edu/~panda.2/
Dr. Hari Subramoni is a research scientist in the Department of Computer Science and Engineering at the Ohio State University, USA, since September 2015. His current research interests include high performance interconnects and protocols, parallel computer architecture, network-based computing, exascale computing, network topology aware computing, QoS, power-aware LAN-WAN communication, fault tolerance, virtualization, big data and cloud computing. He has published over 50 papers in international journals and conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences. Dr. Subramoni is doing research on the design and development of MVAPICH2 (High-Performance MPI over InfiniBand, iWARP, and RoCE) and MVAPICH2-X (Hybrid MPI and PGAS (OpenSHMEM, UPC, and CAF)) software packages. He is a member of IEEE.
More details about Dr. Subramoni are available at: http://www.cse.ohio-state.edu/~subramon