Using R for HPC Data Science (IT4I training)

Europe/Prague
207 (VŠB - Technical University Ostrava, IT4Innovations building)

207

VŠB - Technical University Ostrava, IT4Innovations building

Studentská 6231/1B 708 33 Ostrava–Poruba Czech Republic
Description

Annotation

This two day course covers topics that arise in the use of R on large data for data science computations on HPC systems. Aimed at intermediate level for current R users who want to speed up and scale up their codes on larger platforms. The course begins with some basics but quickly moves on to strategies to make existing R code faster and toward the use of HPC parallel platforms. There will be code demonstrations and hands-on exercises.

Purpose of the course (benefits for the attendees)

The course aims to demonstrate and teach how R can be used for truly large data on large systems. While R is a high-level scripting language where small missteps can be costly in execution time and memory use, more careful coding practices can lead to more efficient use of what is often compiled code underneath. The course will cover intermediate to advanced aspects of R and relevant packages for data science aimed at the utilization of high performance computing resources. Participants will learn strategies to make R code faster, methods to include new compiled code, methods to utilize multiple cores for parallel speedup, and methods to distribute data and compute on large distributed platforms with the pbdR package ecosystem.

Participants familiar with HPC resources and without a substantial R background can also benefit. Advance preparation with some introductory resources on R, which are widely available on the web as well as in many books, will be helpful. Consider e.g.

R and its packages are free and open source. R is able to run across Windows, Mac, and Unix architectures. It is capable of efficient utilization of multicore platforms as well as large distributed architectures with thousands of nodes, providing leading data science scalability among technical computing languages. A recent HPC Wire story documents complex data science computations with pbdR, orders of magnitude faster than Spark's MLlib.

Level

intermediate

Language

English

About the tutor(s)

George Ostrouchov is a Senior Scientist at the Oak Ridge National Laboratory and Joint Faculty Professor at the University of Tennessee. His research has been focused for many years on the interaction of statistics and high performance computing. Currently, he leads the pbdR project, recently noted on HPC wire. He is a fellow of the American Statistical Association.

Acknowledgements

The course is supported by The Ministry of Education, Youth and Sports from the programme Large Infrastructures for Research, Experimental Development and Innovations as part of the project „IT4Innovations National Supercomputing Center – LM2015070“.

  • Thursday, 6 October
    • 09:30 10:00
      Registration 207

      207

      VŠB - Technical University Ostrava, IT4Innovations building

      Studentská 6231/1B 708 33 Ostrava–Poruba Czech Republic
    • 10:00 11:30
      Introduction and basics: RStudio, ggplot2, Markdown/knitr 207

      207

      VŠB - Technical University Ostrava, IT4Innovations building

      Studentská 6231/1B 708 33 Ostrava–Poruba Czech Republic
    • 11:30 13:00
      lunch break 1h 30m
    • 13:00 14:30
      Faster R: Speeding up serial R code 207

      207

      VŠB - Technical University Ostrava, IT4Innovations building

      Studentská 6231/1B 708 33 Ostrava–Poruba Czech Republic
    • 14:30 15:00
      coffee break 30m 207

      207

      VŠB - Technical University Ostrava, IT4Innovations building

      Studentská 6231/1B 708 33 Ostrava–Poruba Czech Republic
    • 15:00 16:30
      Parallel programming paradigms and multicore parallel R 207

      207

      VŠB - Technical University Ostrava, IT4Innovations building

      Studentská 6231/1B 708 33 Ostrava–Poruba Czech Republic
    • 16:30 17:00
      coffee break 30m 207

      207

      VŠB - Technical University Ostrava, IT4Innovations building

      Studentská 6231/1B 708 33 Ostrava–Poruba Czech Republic
    • 17:00 18:00
      Distributed R: SPMD and pbdMPI 207

      207

      VŠB - Technical University Ostrava, IT4Innovations building

      Studentská 6231/1B 708 33 Ostrava–Poruba Czech Republic
      • 17:00
        Coffee break 30m
  • Friday, 7 October
    • 09:00 10:30
      Matrix computations with pbdDMAT 207

      207

      VŠB - Technical University Ostrava, IT4Innovations building

      Studentská 6231/1B 708 33 Ostrava–Poruba Czech Republic
    • 10:30 11:00
      coffee break 30m 207

      207

      VŠB - Technical University Ostrava, IT4Innovations building

      Studentská 6231/1B 708 33 Ostrava–Poruba Czech Republic
    • 11:00 12:45
      Distributed R: Reading from a Parallel File System 207

      207

      VŠB - Technical University Ostrava, IT4Innovations building

      Studentská 6231/1B 708 33 Ostrava–Poruba Czech Republic
    • 12:45 14:00
      lunch break 1h 15m
    • 14:00 15:30
      TBD (R client/server SPMD and/or distributed+multicore) 207

      207

      VŠB - Technical University Ostrava, IT4Innovations building

      Studentská 6231/1B 708 33 Ostrava–Poruba Czech Republic