



EUROPEAN UNION European Structural and Investment Funds Operational Programme Research, Development and Education



VSB TECHNICAL | IT4INNOVATIONS |||| UNIVERSITY OF OSTRAVA | CENTER

## TECHNICAL FEATURES AND THE USE OF GPU ACCELERATED PARTITION



### Key properties

- 72 nodes, each with
  - 2x AMD EPYC<sup>™</sup> 7763, 64-core, 2.45 GHz processors
  - 1024 GB DDR4 3200MT/s of physical memory
  - 8x GPU accelerator NVIDIA A100
    - | 40GB HBM2 memory per GPU
    - | 320GB HBM2 memory in total
  - 4x 200 Gb/s Infiniband HDR links

### In PBS

- PBS queue: **qnvidia**
- name of nodes: acn[01-72]





### CPU part of the node

- 2x AMD EPYC<sup>™</sup> 7763,
  - 64-core per socket
  - 2.45 GHz clock frequency
- Memory | 1024 GB DDR4 3200MT/s
- 4x 200 Gb/s Infiniband HDR links





## AMD EPYC SERVER CPU ROADMAP





| CATEGORY                    | EPYC 7002 (Rome)            | EPYC 7003 (Milan)           |
|-----------------------------|-----------------------------|-----------------------------|
| Socket                      | SP3                         | SP3                         |
| Core / Process              | Zen2 / 7nm                  | Zen3 / 7nm                  |
| Max Core Count /<br>Threads | 64 / 128                    | 64 / 128                    |
| L3 Cache Size               | 256 MB                      | 256 MB                      |
| CCX Arch                    | 4 Cores + 16MB              | 8 Cores + 32MB              |
| Memory                      | 8 Ch DDR4-3200,<br>NVDIMM-N | 8 Ch DDR4-3200,<br>NVDIMM-N |
| PCle Tech & Lane<br>Count   | PCle Gen4,<br>128L/Socket   | PCle Gen4,<br>128L/Socket   |
| Security                    | SME, SEV                    | SME, SEV, SNP               |
| Chipset                     | NA                          | NA                          |
| Power                       | 120W - 280W                 | 120W - 280W                 |

 VSB
 TECHNICAL
 IT4INNOVATIONS

 UNIVERSITY
 NATIONAL SUPERCOMPUTING

 OF OSTRAVA
 CENTER

## "MILAN" BUILDS ON INFINITY ARCHITECTURE



VSB TECHNICAL UNIVERSITY OF OSTRAVA | IT4INNOVATIONS NATIONAL SUPERCOMPUTING CENTER

RAL

#### numactl -H



 0
 10
 12
 12
 12

 1
 12
 10
 12
 12

 2
 12
 12
 10
 12

 3
 12
 12
 12
 10



https://developer.amd.com/spack/stream-benchmark/



## DUAL-SOCKET CONFIGURATIONS (MILAN)

# Two EPYC 7003 Processors connect through 4 xGMI links



2 NUMA Distances 2 NUMA Domains



[lriha@cn103.barbora ~]\$ numactl -H available: 2 nodes (0-1)

node 0 cpus: 0 - 17 node 0 size: 95197 MB

node 1 cpus: 18 - 35 node 1 size: 96762 MB

| node distances: |     | 0    | 1     |    |
|-----------------|-----|------|-------|----|
| node 0 1        | 0   | 10   | 21    |    |
| 0: 10 21        | 1   | 21   | 10    |    |
| 1: 21 10        | Bai | rbor | ra no | de |

VSB TECHNICAL UNIVERSITY OF OSTRAVA

## **DUAL-SOCKET CONFIGURATIONS (MILAN)**

Two EPYC 7003 Processors connect Nu through 4 xGMI links



2 NUMA Distances 2 NUMA Domains

| uma_ | 0    | 1    | 2    | 3    | 4    | 5    | 6             | 7    |
|------|------|------|------|------|------|------|---------------|------|
| 0    | 39,5 | 39,0 | 38,4 | 38,0 | 21,8 | 21,8 | 21,8          | 20,6 |
| 1    | 39,0 | 39,5 | 38,0 | 38,4 | 21,8 | 21,0 | 20,6          | 21,8 |
| 2    | 38,3 | 38,0 | 39,5 | 39,0 | 20,4 | 20,8 | 21,5          | 21,5 |
| 3    | 38,0 | 38,3 | 38,9 | 39,5 | 20,8 | 20,9 | 21,7          | 21,7 |
| 4    | 21,7 | 21,7 | 21,7 | 20,8 | 39,5 | 39,0 | 38,4          | 38,1 |
| 5    | 21,5 | 21,5 | 21,5 | 20,4 | 38,9 | 39,5 | 38,0          | 38,4 |
| 6    | 20,7 | 21,0 | 21,7 | 21,7 | 38,4 | 38,0 | 39,5          | 39,0 |
| 7    | 21,8 | 21,0 | 20,7 | 21,7 | 38,0 | 38,4 | 39 <i>,</i> 0 | 39,5 |

Socket 2

#### Bandwidth [GB/s]

| Numa | 0     | 1     | 2     | 3     | 4     | 5     | 6     | 7     |
|------|-------|-------|-------|-------|-------|-------|-------|-------|
| 0    | 90    | 98,5  | 107,1 | 110,4 | 188,9 | 192,4 | 184,1 | 188,6 |
| 1    | 109,6 | 91,6  | 110,8 | 106   | 192,5 | 197,5 | 190,9 | 192,4 |
| 2    | 118,7 | 110,8 | 91,5  | 97,9  | 181,1 | 190,5 | 189,6 | 190,7 |
| 3    | 126,8 | 106,5 | 100,8 | 90,1  | 193,7 | 198,5 | 196,9 | 201,1 |
| 4    | 204,7 | 190,6 | 189   | 188,8 | 90    | 98,4  | 106,7 | 110,2 |
| 5    | 206,6 | 197,6 | 194,1 | 194,6 | 97,9  | 91,5  | 110,8 | 106   |
| 6    | 203,7 | 189,1 | 189,5 | 192,4 | 106   | 110,8 | 91,5  | 97,9  |
| 7    | 201,3 | 193,1 | 193,6 | 198,4 | 110,1 | 106,5 | 98,6  | 90,1  |

Latency [ns]

#### Measured by: Inte Memory Latency Checker - v3.9a

VSB TECHNICAL IIT4INNOVATIONS UNIVERSITY NATIONAL SUPERCOMPUTING OF OSTRAVA CENTER

## CHIPLET BASED VS MONOLITHIC CPU







Intel Xeon Platinum 8280

https://www.anandtech.com/show/16529/amd-epyc-milan-review/4



## MEMORY BANDWIDTH VS. CLOCK FREQ

#### $10^{3}$ 40 10<sup>2</sup> Performance [GFLOP/sec] 35 DP Vector FMA Peak 54 GFLOPS DP Vector Add Peak 27 GFLOPS 10<sup>1</sup> Scalar Add Peak 7 GFLOPS 10<sup>0</sup> $10^{-1}$ · 10 Memory bound Compute bound 5 $10^{-2}$ 10-3 $10^{-2}$ $10^{-1}$ $10^{0}$ 10<sup>2</sup> $10^{1}$ Arithmetic intensity [FLOP/byte]

LAMMPS, EMA

- 2 competing system-level choke points:
  - Bandwidth to main memory
  - Compute Bound (frequency)
- These are mutually exclusive to each other
- Perform roofline analysis to confirm where hot-routine lands (red circle)
- It has performed this analysis on a number of popular HPC codes across CFD, Weather, Quantum Chemistry, Molecular Dynamics: Codes are memory bound or borderline
- HPL (compute bound) is \*NOT\* a good proxy for scoping job throughput on realistic workloads.
- Use memory bound synthetics: HPCG or STREAM

NATIONAL SUPERCOMPUTING



#### Compiling STREAM benchmark for AMD CPUs

#### ml AOCC

clang -03 -fopenmp -mcmodel=large -DSTREAM\_TYPE=double mavx2 -DSTREAM\_ARRAY\_SIZE=250000000 -DNTIMES=10 -ffpcontract=fast -fnt-store stream.c -o stream\_c

#### **Running STREAM**

\$ export OMP\_SCHEDULE=static
\$ export OMP\_DYNAMIC=false
\$ export OMP\_THREAD\_LIMIT=256
\$ export OMP\_NESTED=FALSE
\$ export OMP\_STACKSIZE=256M

STREAM generally gives the better performance with 1 thread per CCD. Binding options for AMD EPYC 7742 and AMD EPYC 7763 Processor to bind 1 thread per CCD: - ---- export GOMP\_CPU\_AFFINITY=0-127:8 and - export OMP\_NUM\_THREADS=16

```
# Thread Binding Options for AMD EPYC 7742/7763 Processor
$ export GOMP_CPU_AFFINITY=0-127:8
$ export OMP_NUM_THREADS=16
```

```
$ echo "running for 1 thread per CCD"
$ stream_c.exe
```

#### **Basic Details of Flags used:**

| Mcmodel=large        | Generate code for the large model. This model makes no assumptions about addresses and sizes of sections.                       |
|----------------------|---------------------------------------------------------------------------------------------------------------------------------|
| STREAM_ARRAY_SIZE= " | <b>250000000</b> " Sets the Array size for the STREAM benchmark. General recommendation is that "STREAM_ARRAY_SIZE" must be     |
|                      | at least 4x the size of the sum of all the last-level caches in the system.                                                     |
| NTIMES=10            | STREAM runs each kernel "NTIMES" times.                                                                                         |
| ffp-contract=fast    | enables floating-point expression contraction such as forming of fused multiply-add operations if the target has native support |
|                      | for them.                                                                                                                       |
| fnt-store            | generate non-temporal store instruction for array accesses in a loop with large trip count.                                     |

|                           |                                             | 🟦 Iriha — Iriha@acn06:~ — ssh Irih      | a@karolina.it4i.cz — 127×46                        |                                    |
|---------------------------|---------------------------------------------|-----------------------------------------|----------------------------------------------------|------------------------------------|
| Function Best Rate MB/s # | 90 <b>.5%]</b> 33 [                         | <b>0.0%</b> 65                          | 0.0%]                                              | 97 [ 0.0%]                         |
| 2 E                       | 0.0% 34 [                                   | 0.0% 66                                 | 0.0%                                               | 98 <b>[ 0.0%]</b>                  |
| 3 <b>[</b>                | 0.0%] 35 [                                  | <b>0.0%</b> 67                          | 0.0%                                               | 99 [ 0.0%]                         |
| 4 <u></u>                 | 0.0%] 36 [                                  | 0.0% 68                                 | 0.0%                                               | 100 0.0%                           |
| 5                         | 0.0% 37                                     | 0.0% 69                                 | 0.0%                                               | 101[ 0.0%]                         |
| 6                         | 0.0% 38 [                                   | 0.0% 70                                 | 0.0%                                               | 102[ 0.0%]                         |
|                           | 0.0% 39 [                                   | 0.0% 71                                 | 0.0%                                               | 103[ 0.0%]                         |
|                           | 0.0% 40                                     | 0.0% 72                                 | _ 0.0%_                                            |                                    |
|                           |                                             | <b>U_U%_</b> 73                         | L U.U%                                             |                                    |
|                           | 4.5% <u>4</u> 2                             | <b>0.0%</b> 74                          | U_U%                                               |                                    |
|                           | 0.0% 43                                     | ຢ <b>ະ</b> ປ% ຢູ່ (ວ<br>ຊຸດທີ 70        | Ø_0%_                                              |                                    |
|                           | 0.0%] 44 L                                  | 0,000,00<br>0,000,00<br>77              | 0_0%_                                              | 100L Ø.Ø%]                         |
|                           | 0.0%] 43 L                                  | 0.00/01 70                              | 0_0%_                                              | 109L Ø.Ø%J                         |
|                           | 0.0%] 40 [<br>0.0%] 47 [                    | 0.0%] 70                                | L 0.0%]                                            | 111 <b>F</b> 0.097                 |
|                           | 0.0%] 4° [<br>0.0%] /8 [                    | 21 [640-0<br>0 [92] 20                  | L 0.0%]                                            | 112 <b>F</b> 0.092 <b>T</b>        |
| 17 <b>E</b>               | 0.0%] 40 [<br>0.0%] 49 [                    | 0.0%] 00<br>0.0%] 81                    | [ 0.0%]                                            | 113F 0.0%                          |
|                           | 0.0%] +2 [<br>0.0%] 50 [                    | 0.0% 1 82                               | [ 0.0%]                                            | 114 <b>F</b> 0.0%                  |
| 19 F                      | 0.0%] 51 F                                  | <b>0.0%</b> 83                          | Г 0.0%]                                            | 115 0.0%                           |
| 20                        | <b>0.0%</b> 52 <b>F</b>                     | <b>0.0%</b> 84                          | 0.0%                                               | 116 0.0%                           |
|                           | 0.0% 53 F                                   | 0.0% 85                                 | 0.0%                                               | 117 0.0%                           |
| 22                        | 0.0% 54                                     | 0.0% 86                                 | 0.0%                                               | 118 0.0%                           |
| 23                        | 0.0% 55                                     | 0.0% 87                                 | 0.0%                                               | 119 0.0%                           |
| 24                        | 0.0% 56 [                                   | <b>0.0%</b> 88                          | 0.0%                                               | 120 <b>0.0%</b>                    |
| 25 [                      | 0.0% 57 [                                   | 0.0% 89                                 | 0.0%                                               | 121 0.0%                           |
| 26 [                      | <b>4.5%]</b> 58 [                           | 0.0% 90                                 | 0.0%                                               | 122[ 0.0%]                         |
| 27 [                      | 0.0% 59 [                                   | 0.0% 91                                 | 0.0%                                               | 123[   <b>9.1%</b> ]               |
| 28 [                      | 0.0% 60 [                                   | <b>0.0%</b> 92                          | 0.0%                                               | 124[ 0.0%]                         |
| 29 [                      | 0.0% 61 [                                   | <b>0.0%</b> 93                          | 0.0%                                               | 125 0.0%                           |
| 30 L                      | 0.0% 62                                     | 0.0% 94                                 | 0.0%                                               |                                    |
|                           |                                             | <b>U.U%</b> 95                          | 0.0%                                               |                                    |
|                           | 0.0% 64                                     | 0.0% 96<br>27 56 400 767                | 0.0%                                               | 128 0.0%                           |
|                           |                                             |                                         | d avanada. 0 68 2 27 7 62                          |                                    |
| ⊃wpL                      |                                             | Unt                                     | ime <b>. 1 day. 08.19.59</b>                       |                                    |
|                           |                                             | ор с                                    | line. 1 uuy, 00.19.39                              |                                    |
| PID USER                  | PRI NI VIRT RES                             | SHR S CPU% MEM% TIME+ Com               | mmand                                              |                                    |
| 125100 lriha              | 35 15 131M 3524 1                           | 528 K 9.3 0.0 1:19.05 ht                | op –d Z                                            |                                    |
| 51694 root                | <u>20</u> 0 221M 11340 33                   | 108 S U.U U.U 3:22.56 /op               | pt/pbs/sbin/pbs_mom                                |                                    |
| 4256 FOOT                 | ZU UZZ61Z Z31Z S                            | 79250.0001:31.73/u:<br>216600000001.46/ | sr/spin/irqpalancetoreg                            |                                    |
| 4202 1'00t                | - 20 0 197M 1612 7                          | 524 5 0 0 0 0 0 0 0 0 46 00 /ol         | s <del>r/spin/NetworkMunuyer</del> ––n<br>bip/ipi+ |                                    |
| 5577 root                 | - 20 0 107M 4032 20<br>- 20 0 360M 45116 21 | 737 5 0 0 0 0 0 0 40 90 751             | nt/mellanov/sharn/hin/shar                         | nd                                 |
| 47271                     | 20 0 1256M 17896 3                          | 124 S                                   | sr/hin/nv-fahricmana <del>aer_</del>               | /usr/share/nvidia/nvswitch/fabricm |
| F1Help F2Setup            | F3SearchF4FilterF5Tree                      | F6SortBvF7Nice -F8Nice +F9              | Kill F10Ouit                                       |                                    |

- Maximum memory bandwidth can be reached with only 16 OMP threads / CPU cores if placed correctly
- More threads improve compute performance, but reduces memory bandwidth up to 12%
- STREAM generally gives the better performance with 1 thread per CCD



# of active cores

| GOMP_CPU_AFFINITY     | 0-127:32 | 0-127:16 | 0-127:8 | 0-127:4 | 0-127:2 | 0-127:1 |
|-----------------------|----------|----------|---------|---------|---------|---------|
| # of active CPU cores | 4        | 8        | 16      | 32      | 64      | 128     |
| Max bandwidth [GB/s]  | 153,1    | 307,3    | 338,6   | 326,9   | 310,8   | 297,9   |
| Efficiency            | 45,2%    | 90,8%    | 100,0%  | 96,6%   | 91,8%   | 88,0%   |

TECHNICAL | IT4INNOVATIONS UNIVERSITY NATIONAL SUPERCOMPUTING OF OSTRAVA CENTER



#### Maximum memory bandwidth

| GOMP_CPU_AFFINITY     | 0-127:32 | 0-127:16 | 0-127:8 | 0-127:4 | 0-127:2 | 0-127:1 |
|-----------------------|----------|----------|---------|---------|---------|---------|
| # of active CPU cores | 4        | 8        | 16      | 32      | 64      | 128     |
| Max bandwidth [GB/s]  | 153,1    | 307,3    | 338,6   | 326,9   | 310,8   | 297,9   |
| Efficiency            | 45,2%    | 90,8%    | 100,0%  | 96,6%   | 91,8%   | 88,0%   |

#### AOCC compiler:

```
clang -O3 -fopenmp -mcmodel=large -DSTREAM_TYPE=double
-DSTREAM_ARRAY_SIZE=250000000 -DNTIMES=10
```

-mavx2 -ffp-contract=fast -fnt-store
stream.c -o stream\_c

| GOMP_CPU_AFFINITY     | 0-127:32 | 0-127:16 | 0-127:8 | 0-127:4 | 0-127:2 | 0-127:1 |
|-----------------------|----------|----------|---------|---------|---------|---------|
| # of active CPU cores | 4        | 8        | 16      | 32      | 64      | 128     |
| Max bandwidth [GB/s]  | 107,2    | 212,7    | 248,2   | 239,9   | 231,8   | 227,2   |
| Efficiency            | 43,2%    | 85,7%    | 100,0%  | 96,7%   | 93,4%   | 91,5%   |

#### GCC compiler - settings from STREAM Makefile

gcc -O2 -fopenmp -mcmodel=large -DSTREAM\_TYPE=double -DSTREAM\_ARRAY\_SIZE=250000000 -DNTIMES=10

stream.c -o stream\_gcc



## SOFTWARE DEVELOPMENT ENVIRONMENT

### Use AMD tools for best performance and code efficiency on EPYC CPUs

- Compilers focus on delivering the best out-of-the-box code generation for C, C++, Fortran, Java
- Libraries support common kernels for core math, solvers and FFT
- Profiling tools enable developers to access the full capabilities of EPYC CPUs
  - | All tools are available at <u>https://developer.amd.com/</u> and of course as modules



VSB TECHNICAL | IT4INNOVATIONS ||||| UNIVERSITY | NATIONAL SUPERCOMPUTING OF OSTRAVA | CENTER



### NVIDIA HGX A100

8-GPUs connected with NVSwitch







|                                                                       | A100 PCle                           | 4-GPU                            | 8-GPU                           | 16-GPU                         |
|-----------------------------------------------------------------------|-------------------------------------|----------------------------------|---------------------------------|--------------------------------|
| GPUs                                                                  | 1x NVIDIA A100 PCIe                 | HGX A100 4-GPU                   | HGX A100 8-GPU                  | 2x HGX A100 8-GPU              |
| Form factor                                                           | PCIe                                | 4x NVIDIA A100 SXM               | 8x NVIDIA A100 SXM              | 16x NVIDIA A100 SXM            |
| HPC and AI compute<br>FP64<br>TF32*/FP16*<br>INT8*<br>* with sparsity | 19.5TF<br>312TF*/624TF*<br>1.2POPS* | 78TF<br>1.25PF*/2.5PF*<br>5POPS* | 156TF<br>2.5PF*/5PF*<br>10POPS* | 312TF<br>5PF*/10PF*<br>20POPS* |
| Memory                                                                | 40 or 80GB per GPU                  | Up to 320GB                      | Up to 640GB                     | Up to 1,280GB                  |
| NVLink                                                                | Third generation                    | Third generation                 | Third generation                | Third generation               |
| NVSwitch                                                              | N/A                                 | N/A                              | Second generation               | Second generation              |
| NVSwitch GPU-to-GPU<br>bandwidth                                      | N/A                                 | N/A                              | 600GB/s                         | 600GB/s                        |
| Total aggregate bandwidth                                             | 600GB/s                             | 2.4TB/s                          | 4.8TB/s                         | 9.6TB/s                        |

https://www.nvidia.com/en-us/data-center/hgx/





A 100 40CD CVA



| -P64                          | 9.7 TFLOPS               |
|-------------------------------|--------------------------|
| P64 Tensor Core               | 19.5 TFLOPS              |
| FP32                          | 19.5 TFLOPS              |
| Tensor Float 32 (TF32)        | 156 TFLOPS   312 TFLOPS* |
| 3FLOAT16 Tensor Core          | 312 TFLOPS   624 TFLOPS* |
| P16 Tensor Core               | 312 TFLOPS   624 TFLOPS* |
| NT8 Tensor Core               | 624 TOPS   1248 TOPS*    |
| GPU Memory                    | 40GB HBM2                |
| GPU Memory Bandwidth          | 1,555GB/s                |
| ax Thermal Design Power (TDP) | 400W                     |
| Multi-Instance GPU            | Up to 7 MIGs @ 5GB       |
| Form Factor                   | SXM                      |
| nterconnect                   | NVLink: 600GB/s          |

\* With sparsity

\*\* SXM4 GPUs via HGX A100 server boards; PCIe GPUs via NVLink Bridge for up to two GPUs

https://www.nvidia.com/en-us/data-center/a100/

VSB TECHNICAL IT4INNOVATIONS UNIVERSITY NATIONAL SUPERCOMPUTING OF OSTRAVA CENTER



## NVLINK GPU INTERCONNECT IN DGX-A100



Bandwidth and latency for accessing remote memory over NVLink 3.0 for all combinations of GPUs

Unidir - Bandwith [GB/s]

| GPU | 0            | 1    | 2    | 3    | 4    | 5    | 6    | 7    |  |
|-----|--------------|------|------|------|------|------|------|------|--|
| 0   | 1180         | 244  | 255  | 251  | 255  | 255  | 249  | 255  |  |
| 1   | 251          | 1202 | 256  | 245  | 256  | 256  | 252  | 257  |  |
| 2   | 248          | 256  | 1195 | 255  | 252  | 255  | 255  | 248  |  |
| 3   | 252          | 257  | 257  | 1198 | 253  | 255  | 255  | 249  |  |
| 4   | 244          | 255  | 256  | 249  | 1173 | 254  | 249  | 253  |  |
| 5   | 251          | 256  | 255  | 251  | 256  | 1198 | 255  | 252  |  |
| 6   | 256          | 251  | 255  | 255  | 253  | 254  | 1195 | 248  |  |
| 7   | 257          | 256  | 248  | 255  | 257  | 251  | 255  | 1206 |  |
| GPU | 0            | 1    | 2    | 3    | 4    | 5    | 6    | 7    |  |
| C   | 4.1          | 11.1 | 11.1 | 10.2 | 10.0 | 9.9  | 9.9  | 10.0 |  |
| 1   | 11.1         | 4.2  | 11.1 | 9.6  | 9.7  | 9.6  | 9.8  | 9.7  |  |
| 2   | 11.0         | 11.0 | 4.1  | 10.0 | 10.0 | 9.9  | 9.9  | 9.9  |  |
| 3   | 11.1         | 10.1 | 9.6  | 4.2  | 10.0 | 10.0 | 9.9  | 10.0 |  |
| 4   | 11.6         | 10.0 | 9.7  | 9.8  | 4.4  | 9.7  | 9.7  | 9.7  |  |
| 5   | 11.7         | 10.1 | 9.8  | 9.8  | 9.7  | 4.4  | 9.7  | 9.7  |  |
| 6   | 5 11.6       | 11.3 | 10.6 | 9.8  | 9.8  | 9.9  | 4.4  | 9.7  |  |
| 7   | 11.7         | 10.3 | 9.8  | 9.8  | 9.7  | 9.7  | 9.8  | 4.4  |  |
| L   | Latency [us] |      |      |      |      |      |      |      |  |

**OF OSTRAVA** 

CENTER



### **GPU NODE TOPOLOGY**

[lriha@acn07.karolina ~]\$ nvidia-smi topo -m

|       | GPU0  | GPU1 | GPU2   | GPU3 | GPU4 | GPU5 | GPU6 | GPU7 | mlx5_ | 0 mlx5_ | <u>1 mlx5</u> | 2 mlx5_3 | CPU Affinity | NUMA | Affinity |
|-------|-------|------|--------|------|------|------|------|------|-------|---------|---------------|----------|--------------|------|----------|
| GPU0  | X     | NV12 | 2 NV12 | NV12 | NV12 | NV12 | NV12 | NV12 | SYS   | PXB     | SYS           | SYS      | 48-63        | 3    |          |
| GPU1  | NV12  | X    | NV12   | NV12 | NV12 | NV12 | NV12 | NV12 | SYS   | PXB     | SYS           | SYS      | 48-63        | 3    |          |
| GPU2  | NV12  | NV12 | 2 X    | NV12 | NV12 | NV12 | NV12 | NV12 | PXB   | SYS     | SYS           | SYS      | 16-31        | 1    |          |
| GPU3  | NV12  | NV12 | 2 NV12 | X    | NV12 | NV12 | NV12 | NV12 | PXB   | SYS     | SYS           | SYS      | 16-31        | 1    |          |
| GPU4  | NV12  | NV12 | 2 NV12 | NV12 | X    | NV12 | NV12 | NV12 | SYS   | SYS     | SYS           | PXB      | 112-127      | 7    |          |
| GPU5  | NV12  | NV12 | 2 NV12 | NV12 | NV12 | X    | NV12 | NV12 | SYS   | SYS     | SYS           | PXB      | 112-127      | 7    |          |
| GPU6  | NV12  | NV12 | 2 NV12 | NV12 | NV12 | NV12 | x    | NV12 | SYS   | SYS     | PXB           | SYS      | 80-95        | 5    |          |
| GPU7  | NV12  | NV12 | 2 NV12 | NV12 | NV12 | NV12 | NV12 | x    | SYS   | SYS     | PXB           | SYS      | 80-95        | 5    |          |
|       |       |      |        |      |      |      |      |      |       |         |               |          |              |      |          |
| mlx5_ | OSYS  | SYS  | PXB    | PXB  | SYS  | SYS  | SYS  | SYS  | x     | SYS     | SYS           | SYS      |              |      |          |
| mlx5_ | 1 PXB | PXB  | SYS    | SYS  | SYS  | SYS  | SYS  | SYS  | SYS   | X       | SYS           | SYS      |              |      |          |
| mlx5_ | 2 SYS | SYS  | SYS    | SYS  | SYS  | SYS  | PXB  | PXB  | SYS   | SYS     | X             | SYS      |              |      |          |
| mlx5_ | 3 SYS | SYS  | SYS    | SYS  | PXB  | PXB  | SYS  | SYS  | SYS   | SYS     | SYS           | X        |              |      |          |
|       |       |      |        |      |      |      |      |      |       |         |               |          |              |      |          |

Legend:

| X<br>SYS<br>NODE  | Self<br>Connection<br>Connection       | traversing<br>traversing               | PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)<br>PCIe as well as the interconnect between PCIe Host Bridges within a NUMA       |
|-------------------|----------------------------------------|----------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| node              |                                        |                                        |                                                                                                                                                           |
|                   |                                        |                                        |                                                                                                                                                           |
| PHB               | Connection                             | traversing                             | PCIe as well as a PCIe Host Bridge (typically the CPU)                                                                                                    |
| PHB<br>PXB        | Connection<br>Connection               | traversing traversing                  | PCIe as well as a PCIe Host Bridge (typically the CPU)<br>multiple PCIe bridges (without traversing the PCIe Host Bridge)                                 |
| PHB<br>PXB<br>PIX | Connection<br>Connection<br>Connection | traversing<br>traversing<br>traversing | PCIe as well as a PCIe Host Bridge (typically the CPU)<br>multiple PCIe bridges (without traversing the PCIe Host Bridge)<br>at most a single PCIe bridge |

## RUNNING MPI ON ALL 4 IB LINKS



4x 200 Gb/s Infiniband HDR links should provide close to 80 GB/s

however, they are on different sockets – one needs 2 MPI processes to reach 80 GB/s



## RUNNING MPI ON ALL 4 IB LINKS



#### 4x 200 Gb/s Infiniband HDR links should provide close to 80 GB/s

however, they are on different sockets – one needs 2 MPI processes to reach 80 GB/s

```
#qsub -q qnvidia -A PROJ_ID -1 select=2:mpiprocs=2:ompthreads=64:ncpus=128 -I -1 walltime=02:00:00
```

```
ml OpenMPI/4.1.1-GCC-10.3.0
mpic++ -fopenmp -o mpi_test_ompi mpi_test.cpp
mpirun -np 4 \
```

```
-bind-to core -cpu-list 16,80 --report-bindings \
-x UCX_MAX_EAGER_RAILS=2 -x UCX_MAX_RNDV_RAILS=2 \
-x UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 \
./mpi_test_ompi
```

```
mpirun -np 4 \
-bind-to core -cpu-list 16,80 \
./mpi_test_ompi
```

mpirun -np 4 \
-bind-to core -cpu-list 48,112 \
./mpi\_test\_ompi

#### **Unified Communication X – UCX**

- an open-source communication framework
- takes care of multi rail support
- loaded as module with MPI



## RUNNING MPI ON ALL 4 IB LINKS



#### 4x 200 Gb/s Infiniband HDR links should provide close to 80 GB/s

however, they are on different sockets – one needs 2 MPI processes to reach 80 GB/s

qsub -q qnvidia -A PROJ\_ID -1 select=2:mpiprocs=2:ompthreads=64:ncpus=128 -I -1 walltime=02:00:00



## SINGLE NODE MULTI-GPU WITH OPENMP



```
omp_set_num_threads( num_gpus); // create as many CPU threads as there are CUDA
devices
```

#pragma omp parallel

}

unsigned int cpu\_thread\_id = omp\_get\_thread\_num();

cudaSetDevice( cpu thread id);

```
GPUkernel<<<gpu_blocks, gpu_threads>>>( ... );
```

Source: https://github.com/NVIDIA/cuda-samples/tree/master/Samples/cudaOpenMP



## CUDA AWARE MPI



cudaMemcpy(s\_buf\_h,s\_buf\_d,size,cudaMemcpyDeviceToHost); MPI\_Send(s\_buf\_h,size,MPI\_BYTE,1,tag,MPI\_COMM\_WORLD);

MPI\_Recv(r\_buf\_h,size,MPI\_BYTE,0,tag,MPI\_COMM\_WORLD,&stat); cudaMemcpy(r\_buf\_d,r\_buf\_h,size,cudaMemcpyHostToDevice);

#### REGULAR MPI GPU TO REMOTE GPU

MPI\_Send(s\_buf\_d,size,MPI\_BYTE,1,tag,MPI\_COMM\_WORLD);

MPI\_Recv(r\_buf\_d,size,MPI\_BYTE,0,tag,MPI\_COMM\_WORLD,&stat);

#### MPI GPU TO REMOTE GPU

Source: Multi-GPU Programming with CUDA, GPUDirect, NCCL, NVSHMEM, and MPI; Akhil Langer, Senior Software Engineer, NVIDIA

VSB TECHNICAL | IT4INNOVATIONS ||||| UNIVERSITY OF OSTRAVA | CENTER



## NCCL

### NCCL : NVIDIA Collective Communication Library

Communication library running on GPUs, for GPU buffers.



Source: NCCL: ACCELERATED MULTI-GPU COLLECTIVE COMMUNICATIONS, Cliff Woolley, Sr. Manager, Developer Technology Software, NVIDIA

TECHNICAL IT4INNOVATIONS UNIVERSITY NATIONAL SUPERCOMPUTING OF OSTRAVA CENTER

#### ADVANCED METHOD BASED ON MEMORY ACCESS ANALYSIS IMPLEMENTATION USING CUDA UNIFIED MEMORY



#### Data structure memory allocation in CPU memory



cudaMemAdviseSetReadMostly

size t size = 4 \* 64 \* 1024 \* 1024; //size of data struct char \*data struct = NULL;

cudaMallocManaged(&data struct, size);

for (int gpu = 0; gpu < gpu count; gpu++)</pre> cudaMemAdvise((char \*)data struct, size, cudaMemAdviseSetAccessedBy, gpu);

size t csize = 64 \* 1024 \* 1024; // chunk size

//set chunk 1 to be replicated in memory of all GPUs cudaMemAdvise(data struct + 0\*csize, csize, cudaMemAdviseSetReadMostly, 0)); for (int gpu = 0; gpu < gpu count; gpu++) cudaMemPrefetchAsync(data struct + 0\*csize, csize, gpu);

//set chunk 2 to located on GPU0 only cudaMemAdvise(data struct + 1\*csize, csize, cudaMemAdviseSetPreferredLocation, 0)); cudaMemPrefetchAsync(data struct + 1\*csize, csize, 0);

//set chunk 3 to located on GPU1 only cudaMemAdvise(data struct + 2\*csize, csize, cudaMemAdviseSetPreferredLocation, 1)); cudaMemPrefetchAsync(data struct + 2\*csize, csize, 1);

#### //set chunk 4 to located on GPU3 only

cudaMemAdvise(data struct + 3\*csize, csize, cudaMemAdviseSetPreferredLocation, 3)); cudaMemPrefetchAsync(data struct + 3\*csize, csize, 3);

cudaMemAdvise ( const void\* devPtr, size t count, cudaMemoryAdvise advice, int device );

cudaMemPrefetchAsync ( const void\* devPtr, size t count, int dstDevice, cudaStream t stream = 0 );

**VSB** TECHNICAL **IT4INNOVATIONS** UNIVERSITY NATIONAL SUPERCOMPUTING OF OSTRAVA CENTER

### ADVANCED METHOD BASED ON MEMORY ACCESS ANALYSIS MEMORY ACCESS PATTERN



#### Method work on a memory management level

- it does not differentiate what content is stored in a data structure
- works with all data structures equally
- all allocations are divided into chunks of fixed sizes





VSB TECHNICAL

UNIVERSITY

OF OSTRAVA

**IT4INNOVATIONS** 

NATIONAL SUPERCOMPUTING

|  | data_a <sup>data</sup> data_c | data_d | data_e |
|--|-------------------------------|--------|--------|
|--|-------------------------------|--------|--------|

#### Division of all data structures into chunks

|  | dataa | data_c | datad | data_e |
|--|-------|--------|-------|--------|
|--|-------|--------|-------|--------|

#### Step 1: Memory access pattern analysis

• based on 1 sample per pixel pre pass, which can be executed on CPU or on fully distributed data in GPU memories

#### Memory access counters of all individual chunks

 9
 1
 8
 1
 4
 8
 9
 5
 7
 4
 3
 6
 2
 4
 7
 8
 5
 5
 5
 7
 8
 5
 7
 8
 5
 7
 8
 5
 8
 9
 5
 7
 8
 5
 7
 8
 5
 7
 8
 5
 7
 8
 5
 7
 8
 5
 7
 8
 5
 7
 8
 5
 7
 8
 5
 7
 8
 5
 7
 8
 5
 7
 8
 5
 7
 8
 5
 7
 8
 5
 7
 8
 5
 7
 8
 5
 7
 8
 5
 7
 8
 5
 7
 8
 5
 7
 8
 5
 7
 8
 5
 7
 8
 5
 7
 8
 5
 7
 8
 5
 7
 8
 5
 5
 7
 8
 5

#### Step 2: Identify chunks which will be replicated

- chunks with the highest amount of memory access are marked for replication on all GPUs
- number of chunks to be replicated is based on the scene size and total amount of GPU memory

#### Memory access counters of all individual chunks



Step 3: Distribution of the non-replicated chunks

- we have to assign chunk to the GPU that has the highest number of access to it
- based on scene partitioning each part will be assigned to one GPU
- we have to record memory access counters for each part of the scene independently
  - this again can be done on both CPU or multiple GPUs with fully distributed data in a round robin fashion
- chunks with no memory accesses are distributed in a round robin fashion







Step 3: Distribution of the non-replicated chunks

- we have to assign chunk to the GPU that has the highest number of access to it
- based on scene partitioning each part will be assigned to one GPU
- we have to record memory access counters for each part of the scene independently
  - this again can be done on both CPU or multiple GPUs with fully distributed data in a round robin fashion
- chunks with no memory accesses are distributed in a round robin fashion

| GPU 0                                     | Part of scene rendered by GPU 0 |
|-------------------------------------------|---------------------------------|
| GPU 1 GPU 1                               | Part of scene rendered by GPU 1 |
| GPU 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | Part of scene rendered by GPU 2 |
| GPU 3 GPU 3                               | Part of scene rendered by GPU 3 |
|                                           |                                 |
| Final data distribution per chunk         |                                 |

Final data distribution per chunk





# ADVANCED METHOD BASED ON MEMORY ACCESS ANALYSIS VERTICES – DATA DISTRIBUTION



**Replication 0%** 

Replication 10%



 VSB
 TECHNICAL
 IT4INNOVATIONS

 UNIVERSITY
 NATIONAL SUPERCOMPUTING
 NATIONAL SUPERCOMPUTING





TECHNICAL I IT4INNOVATIONS UNIVERSITY NATIONAL SUPERCOMPUTING OF OSTRAVA CENTER

### **RENDERING OF MASSIVE SCENES**



OF OSTRAVA



Source: Jaros M., Riha L., Strakos P., Spetko M.: GPU Accelerated Path Tracing of Massive Scenes, ACM Transactions on Graphics (TOG), 2021, DOI: http://dx.doi.org/10.1145/3447807

#### ADVANCED METHOD BASED ON MEMORY ACCESS ANALYSIS SCALABILITY ANALYSIS ON SMALL SCENE - DGX A100



Moana 27GB

Source: Jaros M., Riha L., Strakos P., Spetko M.: GPU Accelerated Path Tracing of Massive Scenes, ACM Transactions on Graphics (TOG), 2021, DOI: http://dx.doi.org/10.1145/3447807

VSB TECHNICAL | IT4INNOVATIONS |||| UNIVERSITY | NATIONAL SUPERCOMPUTING OF OSTRAVA | CENTER

### RENDERING OF MASSIVE SCENES



Moana 169GB



VSB TECHNICAL | IT4INNOVATIONS UNIVERSITY | NATIONAL SUPERCOMPUTING OF OSTRAVA | CENTER

Source: Jaros M., Riha L., Strakos P., Spetko M.: GPU Accelerated Path Tracing of Massive Scenes, ACM Transactions on Graphics (TOG), 2021, DOI: http://dx.doi.org/10.1145/3447807



Lubomir Riha lubomir.riha@vsb.cz VSB TECHNICAL | IT4INNOVATIONS |||| UNIVERSITY | NATIONAL SUPERCOMPUTING OF OSTRAVA | CENTER

IT4Innovations National Supercomputing Center VSB – Technical University of Ostrava Studentská 6231/1B 708 00 Ostrava-Poruba, Czech Republic <u>www.it4i.cz</u>

