[HYBRID] High Performance Data Analysis with R (EuroCC)

Europe/Prague
207 (ONLINE and onsite)

207

ONLINE and onsite

Description

Annotation

This course is focused on data analysis and modeling in R statistical programming language. The first day of the course will introduce how to approach a new dataset to get a better understanding of the data and its features. Modeling based on the modern set of packages jointly called TidyModels will be shown afterward. This set of packages strives to make the modeling in R as simple and as reproducible as possible.

The second day is focused on increasing the efficiency of computation by introducing Rcpp for seamless integration of C++ code into R code. A simple example of CUDA usage with Rcpp will be shown. In the afternoon, the section on parallelization of the code with future and/or MPI will be presented.

Benefits for the attendees, and what they will learn:

  • What are the first steps to understanding a new dataset
  • Prepare data for the modeling
  • Creation of the standard modeling workflow using modern R packages
  • To speed up code by using C++
  • Parallelization of the code and execution of the code on a cluster

Level

intermediate

Language

English

Prerequisites

Some experience with programming in R, knowledge of dplyr is an advantage.

Tutor

Tomáš Martinovič obtained his Ph.D. in computational sciences at IT4Innovations, VSB - Technical University of Ostrava in 2018. From 2015 to 2018 he worked in a team focused on the analysis of complex dynamical systems, where he worked on scalable implementations of algorithms from the field of nonlinear time series analysis. Since the start of 2022, he leads a team focused on machine learning/AI and operations research with the defined objective of research and transfer of knowledge in cooperation with industry.

Acknowledgments

This project has received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 101101903. The JU receives support from the Digital Europe Programme and Germany, Bulgaria, Austria, Croatia, Cyprus, Czech Republic, Denmark, Estonia, Finland, Greece, Hungary, Ireland, Italy, Lithuania, Latvia, Poland, Portugal, Romania, Slovenia, Spain, Sweden, France, Netherlands, Belgium, Luxembourg, Slovakia, Norway, Türkiye, Republic of North Macedonia, Iceland, Montenegro, Serbia. This project has received funding from the Ministry of Education, Youth and Sports of the Czech Republic.

This course was supported by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:90254).

Surveys
High Performance Data Analysis with R satisfaction survey
  • Wednesday, 26 April
    • 09:00 09:30
      Introduction
    • 09:30 10:30
      Exploratory data analysis I
    • 10:30 10:45
      Coffee Break 15m
    • 10:45 12:00
      Exploratory data analysis II
    • 12:00 13:00
      Lunch 1h
    • 13:00 14:45
      Introduction to modelling using TidyModels
    • 14:45 15:00
      Coffee Break 15m
    • 15:00 16:30
      Modelling with TidyModels II
    • 16:30 17:00
      Q&A and closing
  • Thursday, 27 April
    • 09:00 10:30
      Introduction to Rcpp for speeding up the slow parts of code
    • 10:30 10:45
      Coffee Break 15m
    • 10:45 12:00
      Simple example of using CUDA and Rcpp
    • 12:00 13:00
      Lunch Break 1h
    • 13:00 14:30
      Using future for parallelization in R
    • 14:30 14:45
      Coffee Break 15m
    • 14:45 16:00
      Using MPI and execution of code on HPC cluster
    • 16:00 16:30
      Q&A and closing