# **Evolutionary Optimization of Neural Network Accelerators**

Vojtěch Mrázek

Faculty of Information Technology, Brno University of Technology, Czech Republic





#### Outline



- Neural networks for embedded systems
- Approximate computing
  - Approximate components
  - Advanced approximation of NN
  - Neural Architecture search
- Hardware accelerator optimization
  - Capsule neural networks
  - Optimization of the accelerator architectures
  - Memory organization
- Execution planning
  - Convolutional networks
  - Transformer networks
- Conclusions

# **Deep Neural Networks**



- The processing of NNs encompasses two primary phases: training and inference.
- This work will be mostly focused on the inference phase.



# Parameters of commonly used NNs





S. P. Samadhi and E. Izquierdo, "Deep-learned faces: a survey," *Eurasip Journal on Image and Video Processing*, vol. 2020, pp. 1–33, 12 2020.

# **Examples of platforms for DNNs**

IT4I Karolina – 800 kW

NVidia DGX BasePOD – 14 kW

NVIDIA Jetson – 5 – 10 W

Google Edge TPU coprocessor – 2W



### **Neural Network Accelerator Comparison**





K. Guo, W. Li, K. Zhong, Z. Zhu, S. Zeng, T. Xie, S. Han, Y. Xie, P. Debacker, M. Verhelst, Y. Wang. "Neural Network Accelerator Comparison" [Online]. Available: <a href="https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/">https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/</a>

# Accelerators for inference processing

Performance: the number of inferences per second

Energy-efficiency: the number of inferences per Watt/s

| Platform Chip |              | Freq. | Precision | Perform.   | Power | Efficiency   |
|---------------|--------------|-------|-----------|------------|-------|--------------|
|               |              | [MHz] |           | [infer./s] | [W]   | [infer./s/W] |
| ASIC          | Eyeriss      | 200   | FX16      | 34.7       | 0.3   | 124.8        |
| <b>FPGA</b>   | Kintex KU115 | 235   | FX8       | 2252       | 22.9  | 98.3         |
| <b>FPGA</b>   | Kintex KU115 | 235   | FX16      | 1126       | 22.9  | 49.2         |
| <b>FPGA</b>   | Zynq XC7Z045 | 200   | FX8       | 340        | 7.2   | 47.2         |
| <b>FPGA</b>   | Zynq XC7Z045 | 200   | FX16      | 170        | 7.2   | 23.6         |
| GPU           | Jetson TX2   | 1 300 | FP16      | 250        | 10.7  | 23.3         |
| GPU           | Titan X      | 1 417 | FP32      | 5120       | 227.0 | 22.6         |
| CPU           | Core-i7      | 3 500 | FP32      | 162        | 73.0  | 2.2          |

(AlexNet on various platforms, according to [56], [68])

Unconventional platforms: in-memory computing, stochastic computing, memristive, RRAM, ...



#### Tensor Processing Unit (TPU)

Inference only Inference+Training

| Feature                                    |                              | TPUv1          | TPUv2               | TPUv3                |
|--------------------------------------------|------------------------------|----------------|---------------------|----------------------|
| Peak TeraFLOPS/<br>Chip                    |                              | 92 (8b int)    | 46 (16b)<br>3 (32b) | 123 (16b)<br>4 (32b) |
| Network links                              | x Gbits/s/Chip               | -              | 4 x 496             | 4 x 656              |
| Max chips/sup                              | ercomputer                   | 775            | 256                 | 1024                 |
| Peak PetaFLO                               | PS/supercomputer             | 1441           | 11.8                | 126                  |
| Bisection Teral                            | oits/supercomputer           | ***            | 15.9                | 42.0                 |
| Clock Rate (MI                             | Hz)                          | 700            | 700                 | 940                  |
| TDP (Watts)/Chip                           |                              | 75             | 280                 | 450                  |
| TDP (Kwatts)/supercomputer  Die Size (mm²) |                              | -              | 124<br><611         | 594<br><648          |
|                                            |                              | <331           |                     |                      |
| Chip Technology                            |                              | 28nm           | >12nm               | >12nm                |
| Memory size (                              | on-/off-chip)                | 28MiB/8GiB     | 32MiB/16GiB         | 32MiB/32GiB          |
| Memory GB/s/Chip                           |                              | 34             | 700                 | 900                  |
| MXUs/Core,<br>MXU Size                     | Matrix Multipl (multipliers) | Y 1<br>256x256 | 1<br>128x128        | 2<br>128x128         |
| Cores/Chip                                 |                              | 1              | 2                   | 2                    |
| Chips/CPU Host                             |                              | 4              | 4                   | 8                    |

Joupi et al, Comm. of the ACM, 63(7), 2020

# Two types of CNN accelerators





#### Design principles:

- Maximize the level of parallelization
- Maximize the reuse
- Minimize the communication (especially with external memory).
- Apply approximate computations





# Architecture of NN accelerator: Temporal Architectures



CPU implementation



GPU implementation – many warps in parallel



OUTPUT

# Architecture of NN accelerator: Spatial architectures



Step 1: loading kernel



Step 2: loading data and then passing



OUTPUT

#### **Pros**

- fast
- reduced memory access
- low energy consumption

#### Cons

- not flexible (8-bit uint for MCUs and 16-bit float version for Cloud)
- ASIC cost of design
- complicated planning

# Example of an ASIC accelerator: Eyeriss (MIT, 2016)







#### Embedded neural networks

# T FIT

#### Reduced set of functions

FULLY\_CONNECTED, MAX\_POOL\_2D, SOFTMAX, LOGISTIC, SVDF, CONV\_2D, CONCATENATION, DEPTHWISE\_CONV\_2D, AVERAGE\_POOL\_2D, ABS, SIN, COS, LOG, SQRT, RSQRT, SQUARE, PRELU, FLOOR, MAXIMUM, MINIMUM, ARG\_MAX, ARG\_MIN, LOGICAL\_OR, LOGICAL\_AND, LOGICAL\_NOT, RESHAPE, EQUAL, NOT\_EQUAL, GREATER, GREATER\_EQUAL, LESS, LESS\_EQUAL, CEIL, ROUND, STRIDED\_SLICE, PACK, PAD, PADV2, SPLIT, UNPACK, NEG, ADD, MUL, QUANTIZE, DEQUANTIZE, RELU, RELU6, MEAN



- Do we really need to run inference at the edge?
  - Cost of communication
  - Privacy
  - Latency







High Throughput OR Low Latency

High Throughput AND Low Latency

# Challenges of embedded neural networks



- Reduced set of layer types
- Integer representation
  - Low dynamic range of integer operations
  - Can cause many problems in the training => post-training quantization
- Very limited resources
  - Memory and memory-bandwidth
  - Energy budget

#### Outline



- Neural networks for embedded systems
- Approximate computing
  - Approximate components
  - Advanced approximation of NN
  - Neural Architecture search
- Hardware accelerator optimization
  - Capsule neural networks
  - Optimization of the accelerator architectures
  - Memory organization
- Execution planning
  - Convolutional networks
  - Transformer networks
- Conclusions

#### **Motivation**





Performance

Modern technologies & design techniques

Introducing some computational errors Approximate computing



Image Courtesy: Institut für Technische Informatik - Universität Stuttgart

- Many computationally intensive applications feature an intrinsic property – the error resilience.
- Users are often willing to accept certain errors in some cases.

Approximate computing - a design paradigm for energy-efficient system.

# Energy savings in neural network accelerators



• Where can the approximations be introduced?



# Hardware approaches for the approximate computing



#### Physical approximation

- voltage scaling
- near threshold computing
- inexact memories



#### Architectural approximation

- function skipping
- data precision reduction
- inexact models (regression)

#### **Functional approximation**

- modification of the Boolean function
- the rest of the tool flow (synthesis & implementation keep same)
- two major task
  - design of approximate components, in particular adders and multipliers
  - high-level approximate synthesis of complex hardware accelerators.

#### Outline



- Neural networks for embedded systems
- Approximate computing
  - Approximate components
  - Advanced approximation of NN
  - Neural Architecture search
- Hardware accelerator optimization
  - Capsule neural networks
  - Optimization of the accelerator architectures
  - Memory organization
- Execution planning
  - Convolutional networks
  - Transformer networks
- Conclusions

# Manual functional approximation

 Elementary units such as full-adders or 2x2 multipliers were replaced by approximate implementation [Kulkarni 2011]





Power savings 37%

 The structure of circuits was modified [Mahdiani 2010]

The longest computational paths of the circui were cut [Hanif 2017]



 Mathematical properties of circuits were exploited [Ansari 2019]





# Design methodology for functional approximation





#### **Automated CAD**

SALSA: Systematic logic synthesis using Quality Constraint Circuit [Venkataramani et al, DAC 2012],

SASIMI: Substitute-and-simplify [Venkataramani et al, DATE 2013],

Search-based (evolutionary) synthesis [Sekanina, Vasicek, ICES 2013, Mrazek et.al, ICCAD'16, Ceska et al. ICCAD'17],

ABACUS: AST-based approach [Nepal et al., DATE 2014],

**ASLAN: [DATE 2014]** 

Approximation-aware Rewriting of AIGs [Chandrasekharan et al., ICCAD 2016]

BLASYS [DAC 2018]

# **Evolutionary algorithm**





# Cartesian Genetic Programming





- Example: CGP parameters
  - $n_r=3$  (# rows)
  - $n_c = 3$  (# columns)
  - $n_i = 3$  (# inputs)
  - $n_0 = 2$  (# outputs)
  - $\Gamma = \{NAND^{(0)}, NOR^{(1)}, XOR^{(2)}, AND^{(3)}, OR^{(4)}, NOT^{(5)}\}$



#### Fitness calculation



- Fitness function evaluates candidate solutions.
- Better solutions obtain better scores.
- In our case:

• 
$$f(C) = \begin{cases} size(C) & if \ WCAE(C) < \tau \\ \infty & else \end{cases}$$

• where C is a candidate circuit and WCAE is the worst case absolute error,  $\tau$  is target error.

| Gate | Size   |
|------|--------|
| INV  | 1.4079 |
| AND  | 2.3465 |
| OR   | 2.3465 |
| XOR  | 4.693  |
| NAND | 1.8772 |
| NOR  | 2.3465 |
| XNOR | 4.693  |
| BUF  | 2.3465 |

Comparison of approximate 16-bit multipliers



M1: DAC 2015 2x2 multipliers composition M5: EvoApproxLib 8x8 multipliers composition

2x2 multipliers composition



Designed

multipliers





# Approximate computing: EvoApproxLib



POWER vs MAE plot (optimal for EvoApproxLib<sup>LITE</sup>)

Approximate circuits available in component libraries are electrical parameters and various error metrics.

The circuits are Pareto-optimal in all metrics.

| Circuit    | Bit-width | # components |
|------------|-----------|--------------|
| multiplier | 8         | 29,911       |
|            | 12        | 3,495        |
|            | 16        | 35,406       |
|            | 32        | 349          |



https://github.com/ehw-fit/evoapproxlib

nttps://gitnub.com/enw-jit/evoapproxiib

Accuracy
MRÁZEK, V.; SEKANINA, L.; VAŠÍČEK, Z. Libraries of Approximate Circuits: Automated Design and Application in CNN Accelerators. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2020, vol. 10, no. 4, p. 406-418. ISSN: 2156-3357.

# Research questions and possible solutions



- Time of design & non-deterministic approach
- Scalability of evaluation
  - Using advanced datasets
  - Employing formal verification techniques (BDDs for adders, SAT with limit for multipliers)
- Scalability of representation
  - Extracting small subcircuits of the circuits and optimize them separately => how to do it for approximate circuits?





KOCNOVÁ, Jitka. Evolutionary synthesis of complex digital circuits. Brno, 2023. PhD thesis. Brno University of Technology, Faculty of Information Technology

#### Outline



- Neural networks for embedded systems
- Approximate computing
  - Approximate components
  - Advanced approximation of NN
  - Neural Architecture search
- Hardware accelerator optimization
  - Capsule neural networks
  - Optimization of the accelerator architectures
  - Memory organization
- Execution planning
  - Convolutional networks
  - Transformer networks
- Conclusions

# Approximate components in neural networks



- The bit-width reduction can be generalized using approximate arithmetic components.
- Approximate arithmetic component = Boolean function with two n-bit operands and m-bit output; in our case realizing multiplication function.
- As there is no support in current NN frameworks, we implemented our own layers (convolutional)
  - Approximate multiplication simulated using a lookup table, but there is a bottleneck for speed
- In the previous approaches, researchers performed additional retraining to adapt the NN to the approximation => time consuming and typically results in
  - approximation of significantly smaller networks (limited scalability)
  - limited set of considered approximate components.



# Approximate components in NN: challenges

The layers have different error resilience, however the approximate NNs employ the same component in all layers (uniform structure)

We propose an algorithm assigning the approximate components to the layers.



How to obtain a good tradeoff between accuracy and energy consumption A multi-objective optimization is performed.



Different architectures of accelerators: pipelined or power-gated

The proposed methodology can handle different accelerator architectures.



# NSGA-II as optimization algorithm



#### **Design space:**

Assigning one of 20 approximate multipliers to one of 50 layers: 10<sup>65</sup> combinations.

#### **Heuristic space search exploration:**

Multi-objective NSGA-II

Generate initial population P

Generate new candidate set Q

crossover

mutation

Evaluate Q (energy, accuracy)

 $P_{t+1}$ = |P| best elements from PUQ



# Problem representation

The target accelerator based on systolic array is composed of |T| **tiles** with the same approximation unit (multiplier)

The optimization algorithm is searching for **mappings**:

 $map_{TM}$ :  $Tiles \rightarrow Multipliers$  $map_{LT}$ :  $Layers \rightarrow Tiles$ 

Two architectures are modelled

- Pipelined: |T|-tuples are distributed to the tiles
- Power-gated: arbitrary mapping of layers to tiles is allowed



# **ALWANN** methodology overview





#### Structures of NNs: Do we need non-uniform structure?









# Experimental setup

#### ResNet networks trained for CIFAR-10 dataset

| ResNet instance | # conv.<br>layers | #<br>mults. | accuracy<br>(floating-point) | (qint-8) |
|-----------------|-------------------|-------------|------------------------------|----------|
| ResNet-8        | 7                 | 21.1M       | 83.42%                       | 83.26%   |
| ResNet-14       | 13                | 35.3M       | 85.93%                       | 85.55%   |
| ResNet-50       | 49                | 120.3M      | 89.38%                       | 89.15%   |

New AxConv2D layers implemented in TensorFlow framework.

A set of 35 approximate multipliers and 1 accurate from EvoApproxLib<sup>LITE</sup> library.

**NSGA-II:** |P|=50, |Q|=50, in total 30 generations (1550 evaluations).

#### **Energy estimation:**

$$E(N) = \sum_{l \in I | avers(l)} \#mults(l) \cdot \frac{E(mult(l))}{E(accurate)}$$





# Complexity analysis



|                          |              | # Layers | # Ax. compone | nts <sup>*)</sup> Retraining            | Uniform struct. |
|--------------------------|--------------|----------|---------------|-----------------------------------------|-----------------|
| AxNN (ISLPED'14)         | 2-6          |          | -             | yes                                     | no              |
| ApproxNN (DATE'15)       | <b>■</b> 2-6 |          | 8             | yes                                     | no              |
| Mrazek et al. (ICCAD'16) | <b>■</b> 2-6 |          | 8 (from 420   | (2 Hours for 10 steps,                  | yes             |
| Sarwar et al. (JETCS'18) |              | 2-164    | 4             | yes<br>( <u>limited</u> set of ax. comp | ves             |
| ALWAN (proposed)         | 8-49         |          | 36            | no<br>(fast weight-tuning alg           | .) no           |

| AxNN        | Searching<br>(1k ima |        | <b>Final va</b><br>(10k ir | Total    |        |  |
|-------------|----------------------|--------|----------------------------|----------|--------|--|
|             | One evaluation       | Total  | One evaluation             | Total    |        |  |
| AxResNet-8  | 1.8 sec              | 0.75 h | 3.2 s                      | 2.6 min  | 0.79 h |  |
| AxResNet-14 | 2.1 sec              | 0.87 h | 4.9 s                      | 4.1 min  | 0.95 h |  |
| AxResNet-50 | 3.6 sec              | 1.5 h  | 14.6 s                     | 12.2 min | 1.70 h |  |

Vaverka, Mrazek, Vasicek, Sekanina. TFApprox: Towards a Fast Emulation of DNN Approximate Hardware Accelerators on GPU. DATE'20.

V. Mrazek: Evolutionary Optimization of Neural Network Accelerators

# Comparison with the state of the art – CIFAR-10



| Approach                  | Retraining | Energy | Accuracy |                 |
|---------------------------|------------|--------|----------|-----------------|
| Venkataramani <i>AxNN</i> | Yes        | -22%   | -0.5%    |                 |
| [ISLPED'14]               |            | -26%   | -2.5%    |                 |
| Sarwar [JETC 2018]        | Yes        | -33%   | -1.8%    |                 |
|                           | Yes        | -12%   | -1.2%    | ResNet-50 -> 44 |
| He ResNet [arXiv]         |            | -71%   | -4.0%    | ResNet-50 -> 14 |
|                           |            | -48%   | -2.7%    | ResNet-14 -> 8  |
|                           | No         | -30%   | -0.6%    | AxResNet-50     |
| Proposed<br>methodology   |            | -30%   | -0.9%    | AxResNet-14     |
|                           |            | -30%   | -1.7%    | AxResNet-8      |

- The power savings depends on the considered starting point.
- The proposed methodology saves the energy of large NNs more than SoA approaches.
- It is necessary to make a comparison with architecture scaling.

#### Comparison with NN scaling





Why should developers use ALWANN algorithm instead of the training smaller network

- No training data are available
- The architecture cannot be scaled
- The training time is larger than ALWANN time

#### Outline



- Neural networks for embedded systems
- Approximate computing
  - Approximate components
  - Advanced approximation of NN
  - Neural Architecture search
- Hardware accelerator optimization
  - Capsule neural networks
  - Optimization of the accelerator architectures
  - Memory organization
- Execution planning
  - Convolutional networks
  - Transformer networks
- Conclusions

#### Single-objective NAS

- The aim of NAS is to automate the process of finding the most suitable NN architecture for a given dataset. The singleobjective NAS has one objective maximizing the Accuracy.
  - Neuro-evolution has been performed in the Evolutionary Algorithms community since the mid-1980.
  - NAS has been connected with DNNs since 2016.
- Key components of NAS methods
  - Search space
  - Search algorithm
  - Performance estimation
- Target hardware: usually GPU







# Search Spaces and CNN encoding



Candidate CNN ~ string of integers Search space ~ all feasible strings



(Node ID; Operation; Parameter; Source ID 1; Source ID 2). Set of operations: (1) convolution, (2) max. pooling, (3) average pooling, (4) identity, (5) add, (6) concatenation, (7) terminal node [87].

# Macro search space

- The entire CNN is encoded.
- Some parts can be fixed by the



# Micro search space

 A subgraph (cell, block) or subgraphs is/are encoded and



# Hierarchical search space

- Recursive construction using a set of small graphs.

#### **Indirect encoding**

- A construction program is encoded.
- The program is executed to build a NN.

#### **Supernets**

 A large NN is pretrained and then pruned.

## Multi-objective NAS for a particular (fixed) hardware





Additional Objectives:

Latency

Area

Energy

RAM size

Flash size

#MAC

Reliability

etc.



Hardware-aware NAS is a NAS reflecting a given hardware executing the inference.

Can be extended by approximate operators

Pinos: Evolutionary Approximation and Neural Architecture Search. Genetic programming and Evolvable Machines 2022

## NAS with approximate components





# Shortening the evaluation time: Accuracy



- Simplify the common approach
  - Employ a proxy data set
  - Reduce the number of training epochs
  - Extrapolate the learning curve
  - etc.
- Build a surrogate model –
   Accuracy predictor
  - NN
  - regression trees
  - Gaussian process (GP)
  - etc.



Results on NASBench-101 (CIFAR-10) by Wen W. et al. ECCV 2020 V. Mrazek: Evolutionary Optimization of Neural Network Accelerators

#### Shortening the evaluation time: Hardware metrics

Tr FIT

Hardware metrics: Latency, Energy, Area, Memory etc.

Methods according to Benmeziane et al. 2021:

- <u>Baseline:</u> Real-time measurements on target hardware.
- Lookup Table Models a lookup table is created beforehand and filled with each operator hardware metrics on the targeted hardware.
   Once the search starts, the system calculates the overall cost from the lookup table.
- Analytical Estimation consists of computing a rough estimate using the processing time, the stall time, and the starting time.
- **Prediction Model** a ML model is built to predict the cost using architecture and dataset features.



#MAC is not a good proxy for latency! Shown for various NN models on a Google Pixel phone.



#### Outline



- Neural networks for embedded systems
- Approximate computing
  - Approximate components
  - Advanced approximation of NN
  - Neural Architecture search
- Hardware accelerator optimization
  - Capsule neural networks
  - Optimization of the accelerator architectures
  - Memory organization
- Execution planning
  - Convolutional networks
  - Transformer networks
- Conclusions

#### Analysis of the error resilience



- 1. Where can we introduce the error?
- 2. How much energy savings can we achieve?
- 3. What is the impact on the overall accuracy?



S. Sabour, N. Frosst, G.E. Hintom. Dynamic Routing Between Capsules @ NeurIPS'17

Alberto Marchisio, Vojtech Mrazek, Muhammad Abudllah Hanif, Muhammad Shafique: ReD-CaNe: A Systematic Methodology for Resilience Analysis and Design of Capsule Networks under Approximations @ DATE'20. arXiv: 1912.00700

## Capsule Neural Networks

**Traditional DNNs** 



- Scalar values
- Weighted sum + nonlinear function
- Pooling layers
- Detect features





- Vectors
- Complex vectorial function (squash)
- Dynamic routing
- Detect entities

#### **Architectures of CapsNets**







# DeepCaps architecture





#### Outline



- Neural networks for embedded systems
- Approximate computing
  - Approximate components
  - Advanced approximation of NN
  - Neural Architecture search
- Hardware accelerator optimization
  - Capsule neural networks
  - Optimization of the accelerator architectures
  - Memory organization
- Execution planning
  - Convolutional networks
  - Transformer networks
- Conclusions

## CapsAcc accelerator







## FEECA methodology



- The goal of this methodology is to optimize parameters of accelerators to get Pareto optimal configurations
  - Input Parallelism: Number of input pairs  $n_{pe}$  with bit-width  $b_{in}$
  - Output Precision: Partial sum bit-width  $b_{out}$
  - Pipeline Depth: Number of pipeline stages  $n_{stq}$
  - Array Dimensions: PE array rows and columns (#ROWS, #COLS)





#### Searching algorithm





#### Comparison of Brute-force and NSGA-II search





## Individual CapsNet layers





#### Outline



- Neural networks for embedded systems
- Approximate computing
  - Approximate components
  - Advanced approximation of NN
  - Neural Architecture search
- Hardware accelerator optimization
  - Capsule neural networks
  - Optimization of the accelerator architectures
  - Memory organization
- Execution planning
  - Convolutional networks
  - Transformer networks
- Conclusions

## Memory architecture



Advanced architectures employ several types of memories – off=chip and on-chip



## Modification of the ports







## Power savings in memories



- Using shared buffers.
- Power-gating unused banks vs block size



#### Design space exploration





# 1500 design configurations





## Power gating advances – HY-PG





## Parameters of Pareto-Optimal solutins





#### Outline



- Neural networks for embedded systems
- Approximate computing
  - Approximate components
  - Advanced approximation of NN
  - Neural Architecture search
- Hardware accelerator optimization
  - Capsule neural networks
  - Optimization of the accelerator architectures
  - Memory organization
- Execution planning
  - Convolutional networks
  - Transformer networks
- Conclusions

## Methods for Modeling AI Accelerators



- Huge accelerator design space to explore (different systolic array configurations, memory hierarchies, number of ports, memory bus widths and more)
- Reliance on configurable, flexible, fast, yet highly accurate modeling approaches



However, existing analytical tools are not designed to handle the parallelism and irregular memory access patterns of Transformers

| Feature                    | Roofline | Analytical | RTL Simulation | Estimation   | TransInferSim (ours) |
|----------------------------|----------|------------|----------------|--------------|----------------------|
| Speed (Fast)               | ✓        | ✓          | ×              | ✓            | ✓                    |
| Accuracy (High)            | ×        | ×          | ✓              | $\checkmark$ | $\checkmark$         |
| ASIC Design Modeling       | ×        | ×          | ✓              | $\checkmark$ | $\checkmark$         |
| Memory Access Optimization | ×        | ×          | $\checkmark$   | _            | $\checkmark$         |
| Executable Operation Plan  | ×        | ×          | $\checkmark$   | ×            | $\checkmark$         |

#### Accelergy / Maestro



Estimates the power consumption, latency and number of accesses to the memories



#### Timeloop: results of mappers







## Demonstration of Schedule Execution During Simulation





#### Possible applications



- Advanced planning of the operations
- Design space exploration



#### Conclusions



- Deep neural networks are now important in embedded systems.
- The NNs exhibit the error resilience property.
- An approximation methodology for computational path was introduced. For example, 30% energy savings in multiplication leads to 0.6% accuracy drop.
- The proposed methods are available as open-source software.



https://github.com/ehw-fit/evoapproxlib

https://github.com/ehw-fit/tf-approximate

## Acknowledgements



#### **Brno University of Technology**

- prof. Lukáš Sekanina, assoc. prof. Zdeněk Vašíček
- Ing. Jan Klhůfek

#### **NYU Abu Dhabi**

- Muhammad Shafique, Muhammad Abdullah Hanif
- Bharath Srinivas Prabakaran, Alberto Marchisio
- Czech Science Foundation project GA24-10990S
- IT4I Innovation infrastructure

















# Thank you for your attention

mrazek@fit.vutbr.cz

