

### Three ingredients to success

DATACENTER

GATEWAY

EDGE

Vision

Vision

Vision

Optimised Frameworks







% tools





https://software.intel.com/en-us/parallel-studio-xe



https://www.intelnervana.com/



Intel® VTune™ Amplifier

## SPEED UP DEVELOPMENT



using open AI software







#### TOOLKITS App developers



Open source platform for building E2E Analytics & Al applications on Apache Spark\* with distributed TensorFlow\*, Keras\*, BigDL

#### OpenVINO

Deep learning inference deployment on CPU/GPU/FPGA/VPU for Caffe\*, TensorFlow\*, MXNet\*, ONNX\*, Kaldi\*

#### NAUTA

Open source, scalable, and ex ensible distributed deep learning platform built on Kubernetes (BETA)



**LIBRARIES** Data scientists

#### **Pvthon**

- Scikitlearn
- Pandas
- NumPy

• Cart

Random

R

• e1071

#### **Distributed**

- MlLib (on Spark)
- Mahout



#### **Intel-optimized Frameworks Ö** Caffe2





And more framework optimizations underway including PaddlePaddle\*, Chainer\*, CNTK\* & others









#### Intel® Distribution for Pvthon\*

Intel distribution optimized for machine learning

#### Intel® Data Analytics **Acceleration Library** (DAAL)

High performance machine learnina & data analytics library

#### Intel® Math Kernel **Library for Deep Neural Networks (MKL-DNN)**

Open source DNN functions for CPU / integrated graphics



Open source compiler for deep learning model computations optimized for multiple devices (CPU, GPU, NNP) from multiple frameworks (TF, MXNet, ONNX)





### DEPLOY AI ANYWHERE INTEL® AI HARDWARE



**DEVICE** 



#### **OPTIMIZED FRAMEWORKS & SOFTWARE**



(intel)

**XEON** 

PLATINUM

inside"



















ALSPECIALIZATION

Multi-Purpose Foundation for AI Data-Parallel Media, Graphics, HPC & AI

Multi-Function & Real-time Deep Learning Inference

Deep Learning Inference

Deep Learning **Training** 

Media & Vision DI Inference at the Edge

Visit:

All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice. 1Unified software stack development in progress DL=Deep Learning

inte

## 2ND GENERATION INTEL® XEON® SCALABLE PROCESSOR formerly known as Cascade Lake















#### Begin your AI journey efficiently, now with even more agility...

- ✓ IMT Intel® Infrastructure Management Technologies
- √ ADQ Application Device Queues
- ✓ SST Intel® Speed Select Technology

#### **Built-in Acceleration with** Intel® Deep Learning Boost...



deep learning infernce throughput!1

Throughput (ima/s)



#### Hardware-Enhanced Security...

- ✓ Intel® Security Essentials
- ✓ Intel® SecL: Intel® Security Libraries for Data Center
- ✓ TDT Intel® Threat Detection Technology





Based on Intel internal testing: 1X,5.7x,14x and 30x performance improvement based on Intel® Optimization for Café ResNet-50 inference throughput performance on Intel® Xeon® Scalable Processor. See Configuration Details 3 sults are based on testing as of 7/11/2017(1x) ,11/8/2018 (5.7x), 2/20/2019 (14x) and 2/26/2019 (30x) and may not reflect all publically available security updates. No product can be absolutely secure

## INTEL® FPGA FOR AI

## FIRST TO MARKET TO ACCELERATE EVOLVING AI WORKLOADS

- PRECISION
- LATENCY
- SPARSITY
- ADVERSARIAL NETWORKS
- REINFORCEMENT LEARNING
- NEUROMORPHIC COMPUTING

٠.,



## DELIVERING AI+ FOR FLEXIBLE SYSTEM LEVEL FUNCTIONALITY

- AI+I/O INGEST
- AI+ NETWORKING
- AI+ SECURITY
- AI+ PRE/POST PROCESSING
- ...



- = RNN
- LSTM
- SPEECH WL



Enabling real-time AI in a wide range of embedded, edge and cloud apps



## **PERFORMANCE - 'IT'S ALL ABOUT PARALLELISM'**

## Core 4 Core 3 Core 2 Core 3 Core 4 Fadd ALU

### Levels of Parallelism

Node

Socket

Core / Thread-Level

(Hyperthreading)

**GPU-CPU** 

Instruction (by CPU internals)



### Levels of Parallelism

Node

Socket

Core / Thread-Level

(Hyperthreading)

**GPU-CPU** 

Instruction (by CPU internals)



### Levels of Parallelism

#### Node

Socket

Core / Thread-Level

(Hyperthreading)

**GPU-CPU** 

Instruction (by CPU internals)

Core 1 Core 2 Core 3 Core 4

### Levels of Parallelism

Node

Socket

Core / Thread-Level

(Hyperthreading)

**GPU-CPU** 

Instruction (by CPU internals)

Core 1 Core 2 Core 3 Core 4

### Levels of Parallelism

Node

Socket

Core / Thread-Level

(Hyperthreading)

**GPU-CPU** 

Instruction (by CPU internals)



### Levels of Parallelism

Node

Socket

Core / Thread-Level

(Hyperthreading)

GPU-CPU

Instruction (by CPU internals)



#### Levels of Parallelism

Node

Socket

Core / Thread-Level

(Hyperthreading)

**GPU-CPU** 

Instruction (by CPU internals)



SSE2 128 bit AVX 256 bit AVX512 512 bit

#### Levels of Parallelism

Node

Socket

Core / Thread-Level

(Hyperthreading)

**GPU-CPU** 

Instruction (by CPU internals)

#### What is vectorization?







## INTEL® DEEP LEARNING BOOST (DL BOOST)

FEATURING VECTOR NEURAL NETWORK INSTRUCTIONS (VNNI)





Current AVX-512 instructions to perform INT8 convolutions: vpmaddubsw, vpmaddwd, vpaddd



NEW AVX-512 (VNNI) instruction to accelerate INT8 convolutions: vpdpbusd



### Levels of Parallelism

Node

Socket

Core / Thread-Level

(Hyperthreading)

**GPU-CPU** 

Instruction (by CPU internals)

Data (Vectorisation)



levels of parallelism













|            | Intel® Xeon®<br>processor<br>64-bit | Intel® Xeon®<br>processor<br>5100<br>series | Intel® Xeon®<br>processor<br>5500<br>series | Intel® Xeon®<br>processor<br>5600<br>series | Intel® Xeon® processor code-named Sandy Bridge EP | Intel® Xeon®<br>processor<br>code-named<br>Ivy Bridge<br>EP | Intel® Xeon® processor code-named Skylake EP | Intel® Xeon® processor code-named Cascade Lake Platinum 9200 |
|------------|-------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------------|-------------------------------------------------------------|----------------------------------------------|--------------------------------------------------------------|
| Core(s)    | 1                                   | 2                                           | 4                                           | 6                                           | 8                                                 | 12                                                          | 28                                           | 56                                                           |
| Threads    | 2                                   | 2                                           | 8                                           | 12                                          | 16                                                | 24                                                          | 56                                           | 112                                                          |
| SIMD Width | 128                                 | 128                                         | 128                                         | 128                                         | 256                                               | 256                                                         | 512                                          | 512                                                          |





YET ANOTHER VIEW . . .

## **PERFORMANCE - 'IT'S ALL ABOUT MEMORY'**

## What does a 2 sockets system looks like?

Motherboard Processor 0 Processor 1 **QPI/UPI** Processin Processin DRAM DRAM g unit g unit



### Memory Hierarchy





Latency estimates for different storage and memory devices

### Intel® Optane™ DC Persistent Memory



- non-volatile, high-capacity memory
- near DRAM latency,
- affordable
- physically and electrically compatible with DDR4 interfaces and slots

### Intel® Optane™ DC Persistent Memory



Legacy Workloads



**Optimized Workloads** 

## Performance: A summary

- Product of CPU Parallelism AND Memory
  - See Advisor Roofline Model which combines
    - Peak Flops
    - Peak Bandwidth

https://software.intel.com/en-us/advisor



- See Performance Optimisation and Productivity Project which combines
  - Global Efficiency,
  - Parallel Efficiency,
  - Computational Efficiency



https://software.intel.com/en-us/download/parallel-universe-magazine-issue-37-july-2019



# Some factors in deciding 'What plaform/architecture should I use?'

## **Factor**

Cost

Performance

Accuracy

Power

Ease of Programming

**Portability** 

## Summary



Intel CPU offers multiple levels of parallelism



To get best performance you need to use these levels in your applications



Intel Libraries and Optimised Frameworks provide these 'automatically'