Speaker
Dr
Jiri Jaros
(Brno University of Technology)
Description
**Introduction**
The simulation of ultrasound wave propagation through biological tissue has a wide range of practical applications including planning therapeutic ultrasound treatments of various brain disorders such as brain tumours, essential tremor, and Parkinson's disease. The major challenge is to ensure the ultrasound focus is accurately placed at the desired target within the brain because the skull can significantly distort it. Performing accurate ultrasound simulations, however, requires the simulation code to be able to exploit thousands of processor cores and work with TBs of data while delivering the output within 24 hours.
We have recently developed an efficient full-wave ultrasound model (the parallel k-Wave toolbox) enabling to solve realistic problems within a week using the pseudospectral model and a global slab domain decomposition (GDD). Unfortunately, GDD limits scaling by the number of 2D slabs, which is usually below 2048. Moreover, since the method is reliant on the fast 3D Fourier transform, all-to-all communications concealed in matrix transpositions significantly deteriorate the performance. The imbalance between communication and computation is even more striking when graphics processing units (GPUs) are used, as the raw performance of GPUs is an order of magnitude above current central processing units (CPUs). In addition, transfers over the peripheral component interconnect express (PCI-E) bus have to be considered as another source of communication overhead. The most efficient implementation to our knowledge proposed by Gholami reveals the fundamental communication problem of distributed GPU FFTs. For an $1024^3$ FFT calculated using 128 GPUs, the communication overhead accounts for 99% of the total execution time. Although the execution time reduces by 8.6$\times$ for a 32$\times$ increase in the number of GPUs (giving a parallel efficiency of 27%), this overhead may not be not acceptable in many applications.
**Proposed method**
This paper presents a novel multi-GPU implementation of the Fourier spectral method using domain decomposition based on local Fourier basis. The fundamental idea behind this work is the replacement of the global all-to-all communications introduced by the FFT (used to calculate spatial derivatives) by direct neighbour exchanges. By doing so, the communication burden can be significantly reduced, at the expense of a slight reduction in numerical accuracy.
The accuracy is shown to be dependent on the overlap (halo) size, independent of the local domain size, and to increase linearly with the number of domain cuts an acoustic wave must traverse. For an overlap (halo) size of 16 grid points, the error is on the order of $10^{-3}$, which is comparable to the error introduced by the perfectly matched layer (ensuring signal attenuation at the domain boundaries and enforcing periodicity). Consequently, the level of parallelism achievable in practice is not limited by the reduction in accuracy due to the use of local Fourier basis. Strong scaling results demonstrate that the code scales with reasonable parallel efficiency, reaching 50% for large simulation domain sizes. However, the small amount of on-board memory ultimately limits the global domain size for a given number of GPUs. 1D decomposition is shown to be the most efficient unless the local subdomain becomes too thin. Beyond, it is useful to exploit half 2D or 3D decomposition with only a single neighbour in a given direction to limit the number of MPI transfers.
An overlap size of 16 grid points is shown to be a good trade off between speed and accuracy, with larger overlaps becoming impractical due to the overhead imposed by large MPI transfers. Compared to the CPU implementation using global domain decomposition, the GPU version is always faster for an equivalent number of nodes. For production simulations executed as part of ultrasound treatment planning, the GPU implementation reduces the simulation time by a factor of 7.5 and the simulation cost by a factor of 3.8. This is a promising result, given the GPUs utilised are now almost decommissioned.
Summary
This paper presents a novel approach to spectral methods domain decomposition targeted on GPU clusters. By reducing the communication overhead and accepting a small numerical inaccuracy, we managed to reduce the simulation time by a factor of 7.5 and the simulation cost by a factor of 3.8 for a~realistic grid size of $1536 \times 1024 \times 2048$
Primary author
Dr
Jiri Jaros
(Brno University of Technology)
Co-authors
Dr
Bradley E Treeby
(Univesity College London)
Mr
Filip Vaverka
(Brno University of Technology)