Authors:
(1) Simone Silvestri, Massachusetts Institute of Technology, Cambridge, MA, USA;
(2) Gregory Wagner, Massachusetts Institute of Technology, Cambridge, MA, USA;
(3) Christopher Hill, Massachusetts Institute of Technology, Cambridge, MA, USA;
(4) Matin Raayai Ardakani, Northeastern University, Boston, MA, USA;
(5) Johannes Blaschke, Lawrence Berkeley National Laboratory, Berkeley, CA, USA;
(6) Valentin Churavy, Massachusetts Institute of Technology, Cambridge, MA, USA;
(7) Jean-Michel Campin, Massachusetts Institute of Technology, Cambridge, MA, USA;
(8) Navid Constantinou, Australian National University, Canberra, ACT, Australia;
(9) Alan Edelman, Massachusetts Institute of Technology, Cambridge, MA, USA;
(10) John Marshall, Massachusetts Institute of Technology, Cambridge, MA, USA;
(11) Ali Ramadhan, Massachusetts Institute of Technology, Cambridge, MA, USA;
(12) Andre Souza, Massachusetts Institute of Technology, Cambridge, MA, USA;
(13) Raffaele Ferrari, Massachusetts Institute of Technology, Cambridge, MA, USA.
Table of Links
5.1 Starting from scratch with Julia
5.2 New numerical methods for finite volume fluid dynamics on the sphere
5.3 Optimization of ocean free surface dynamics for unprecedented GPU scalability
6 How performance was measured
7 Performance Results and 7.1 Scaling Results
9 Acknowledgments and References
7 Performance Results
We report both scaling results via time-to-solution in SYPD and efficiency results via energy-to-solution in SYPMWh.
7.1 Scaling Results
Realistic ocean simulations (Satori and Engaging clusters). We report strong scaling tests using the realistic global setup shown in figure 3 on two clusters: (i) the MIT Satori cluster [2], a high-performance Power 9 system composed of 64 Power 9 nodes hosting four Nvidia V100 GPUs with 32GBs memory each, and (ii) the Engaging MIT cluster, using 8 nodes that host 4 NVlinked A100s with 80GBs memory each. The resulting wall clock time per time step, averaged over 1500 time steps, is presented in Figure 4 for both single precision (FP32) and double precision (FP64) computations. On a single node, OceananigansR12 attains 0.9 SYPD in double precision and 1.4 SYPD in single precision, with a wall clock time per time step ranging from 330 to 550 ms. When increasing the number of nodes up to 16 (64 GPUs), the communication overhead increases, resulting in 12.4 SYPD in single precision and 7.75 SYPD in double precision. We measure a strong scaling efficiency of 52% in single precision and 55% in double precision over 64 GPUs, because the computational workload (40 ms wall clock time per time-step) eventually becomes too short to completely mask the communication overhead.
For higher-resolution ocean weather-permitting simulations, the scaling is almost ideal across the range we investigate. For OceananigansR24 (FP64-V100) and OceananigansR48 (FP32-V100), we measure larger than ideal scaling. This counter-intuitive result is a product of a load balance improvement as the number of GPUs increases. In summary, we attain 1.94 SYPD on 120 V100 GPUs with a kilometer-scale resolution (OceananigansR24) and 0.33 SYPD with an ocean weather resolving simulation (OceananigansR48). Finally, we have tested the OceananigansR48 setup on 144 Perlmutter nodes (576 A100 GPUs), reaching the 0.95 SYPD. This is the first instance of a kilometer-scale ocean achieving ∼1 SYPD. We have also tested the OceananigansR12 setup on 17 nodes obtaining 9.9 SYPD (see fig. 5).
Aqua-planet simulation (Perlmutter cluster). We report weak scaling tests on the NERSC supercomputer (Perlmutter). Perlmutter is a HPE (Hewlett Packard Enterprise) Cray EX super
computer that hosts four A100 GPUs with 40GB per node, linked through a NVLink3 interconnect. All weak scaling tests are performed using the OceananigansAP setup on double precision. We allocate two different horizontal resolutions (1/12 and 1/6 of a degree), progressively increasing them with the number of GPUs while maintaining 100 vertical levels. As shown in figure 5, we obtain 100% weak scaling efficiency for the whole investigated range (1 to 196 nodes – 4 to 768 A100s).
This paper is available on arxiv under CC BY 4.0 DEED license.