This story draft by @escholar has not been reviewed by an editor, YET.

Performance Results and Scaling Results

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
0-item

Authors:

(1) Simone Silvestri, Massachusetts Institute of Technology, Cambridge, MA, USA;

(2) Gregory Wagner, Massachusetts Institute of Technology, Cambridge, MA, USA;

(3) Christopher Hill, Massachusetts Institute of Technology, Cambridge, MA, USA;

(4) Matin Raayai Ardakani, Northeastern University, Boston, MA, USA;

(5) Johannes Blaschke, Lawrence Berkeley National Laboratory, Berkeley, CA, USA;

(6) Valentin Churavy, Massachusetts Institute of Technology, Cambridge, MA, USA;

(7) Jean-Michel Campin, Massachusetts Institute of Technology, Cambridge, MA, USA;

(8) Navid Constantinou, Australian National University, Canberra, ACT, Australia;

(9) Alan Edelman, Massachusetts Institute of Technology, Cambridge, MA, USA;

(10) John Marshall, Massachusetts Institute of Technology, Cambridge, MA, USA;

(11) Ali Ramadhan, Massachusetts Institute of Technology, Cambridge, MA, USA;

(12) Andre Souza, Massachusetts Institute of Technology, Cambridge, MA, USA;

(13) Raffaele Ferrari, Massachusetts Institute of Technology, Cambridge, MA, USA.

Table of Links

Abstract and 1 Justification

2 Performance Attributes

3 Overview of the Problem

4 Current State of the Art

5 Innovations

5.1 Starting from scratch with Julia

5.2 New numerical methods for finite volume fluid dynamics on the sphere

5.3 Optimization of ocean free surface dynamics for unprecedented GPU scalability

6 How performance was measured

7 Performance Results and 7.1 Scaling Results

7.2 Energy efficiency

8 Implications

9 Acknowledgments and References

7 Performance Results

We report both scaling results via time-to-solution in SYPD and efficiency results via energy-to-solution in SYPMWh.

7.1 Scaling Results

Figure 4: Strong scaling tests for the realistic setups OceananigansR12 (1/12◦), OceananigansR24 (1/24◦), and OceananigansR48 (1/48◦). The left plot reports simulated years per wall clock day (SYPD) while the right plot wall clock milliseconds per time steps. All results are averaged over 1500 time steps.


Realistic ocean simulations (Satori and Engaging clusters). We report strong scaling tests using the realistic global setup shown in figure 3 on two clusters: (i) the MIT Satori cluster [2], a high-performance Power 9 system composed of 64 Power 9 nodes hosting four Nvidia V100 GPUs with 32GBs memory each, and (ii) the Engaging MIT cluster, using 8 nodes that host 4 NVlinked A100s with 80GBs memory each. The resulting wall clock time per time step, averaged over 1500 time steps, is presented in Figure 4 for both single precision (FP32) and double precision (FP64) computations. On a single node, OceananigansR12 attains 0.9 SYPD in double precision and 1.4 SYPD in single precision, with a wall clock time per time step ranging from 330 to 550 ms. When increasing the number of nodes up to 16 (64 GPUs), the communication overhead increases, resulting in 12.4 SYPD in single precision and 7.75 SYPD in double precision. We measure a strong scaling efficiency of 52% in single precision and 55% in double precision over 64 GPUs, because the computational workload (40 ms wall clock time per time-step) eventually becomes too short to completely mask the communication overhead.


For higher-resolution ocean weather-permitting simulations, the scaling is almost ideal across the range we investigate. For OceananigansR24 (FP64-V100) and OceananigansR48 (FP32-V100), we measure larger than ideal scaling. This counter-intuitive result is a product of a load balance improvement as the number of GPUs increases. In summary, we attain 1.94 SYPD on 120 V100 GPUs with a kilometer-scale resolution (OceananigansR24) and 0.33 SYPD with an ocean weather resolving simulation (OceananigansR48). Finally, we have tested the OceananigansR48 setup on 144 Perlmutter nodes (576 A100 GPUs), reaching the 0.95 SYPD. This is the first instance of a kilometer-scale ocean achieving ∼1 SYPD. We have also tested the OceananigansR12 setup on 17 nodes obtaining 9.9 SYPD (see fig. 5).


Aqua-planet simulation (Perlmutter cluster). We report weak scaling tests on the NERSC supercomputer (Perlmutter). Perlmutter is a HPE (Hewlett Packard Enterprise) Cray EX super


Figure 5: Weak scaling tests performed in double precision with the OceananigansAP setup. Each GPU has a grid equivalent to a global 1/6◦ and 100 vertical layers. The weak scaling is performed up to a horizontal resolution of 1/168th of a degree (∼488 m resolution) where we achieve 15 simulated days per wall clock day (1 year in roughly 25 days). The star marks the performance of OceananigansR48 (figure 3) on 144 Perlmutter GPU nodes. All results are averaged over 500 time steps.


computer that hosts four A100 GPUs with 40GB per node, linked through a NVLink3 interconnect. All weak scaling tests are performed using the OceananigansAP setup on double precision. We allocate two different horizontal resolutions (1/12 and 1/6 of a degree), progressively increasing them with the number of GPUs while maintaining 100 vertical levels. As shown in figure 5, we obtain 100% weak scaling efficiency for the whole investigated range (1 to 196 nodes – 4 to 768 A100s).


This paper is available on arxiv under CC BY 4.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks