paint-brush
Stable Nonconvex-Nonconcave Training via Linear Interpolation: Experimentsby@interpolation

Stable Nonconvex-Nonconcave Training via Linear Interpolation: Experiments

tldt arrow

Too Long; Didn't Read

This paper presents a theoretical analysis of linear interpolation as a principled method for stabilizing (large-scale) neural network training.
featured image - Stable Nonconvex-Nonconcave Training via Linear Interpolation: Experiments
The Interpolation Publication HackerNoon profile picture

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Thomas Pethick, EPFL (LIONS) [email protected];

(2) Wanyun Xie, EPFL (LIONS) [email protected];

(3) Volkan Cevher, EPFL (LIONS) [email protected].

8 Experiments

This section demonstrates that linear interpolation can lead to an improvement over common baselines.


Synthetic examples Figures 2 and 3 demonstrate RAPP, LA-GDA and LA-CEG+ on a host of nonmonotone problems (Hsieh et al. (2021, Ex. 5.2), Pethick et al. (2022, Ex. 3(iii)), Pethick et al. (2022, Ex. 5)). See Appendix H.2 for definitions and further details.



Image generation We replicate the experimental setup of Chavdarova et al. (2020); Miyato et al. (2018), which uses hinge version of the non-saturated loss and a ResNet with spectral normalization for the discriminator (see Appendix H.2 for details). To evaluate the performance we rely on the commonly used Inception score (ISC) (Salimans et al., 2016) and Fréchet inception distance (FID) (Heusel et al., 2017) and report the best iterate. We demonstrate the methods on the CIFAR10 dataset (Krizhevsky et al., 2009). The aim is not to beat the state-of-the-art, but rather to complement the already exhaustive numerical evidence provided in Chavdarova et al. (2020).


For a fair computational comparison we count the number of gradient computations instead of iterations k as in Chavdarova et al. (2020). Maybe surprisingly, we find that the extrapolation methods such as EG and RAPP still outperform the baseline, despite having fewer effective iterations. RAPP improves over EG, which suggest that it can be worthwhile to spend more computation on refining the updates at the cost of making fewer updates.


The first experiment we conduct matches the setting of Chavdarova et al. (2020) by relying on the Adam optimizer and using and update ratio of 5 : 1 between the discriminator and generator. We find in Table 2 that LA-ExtraAdam+ has the highest ISC (8.08) while LA-ExtraAdam has the lowest FID (15.88). In contrast, we confirm that Adam is unstable while Lookahead prevents divergence as apparent from Figure 4, which is in agreement with Chavdarova et al. (2020). In addition, the outer loop of Lookahead achieves better empirical performance, which corroborate the theoretical result (cf. Remark 7.4). Notice that ExtraAdam+ has slow convergence (without Lookahead), which is possibly due to the 1/2-smaller stepsize.


Figure 4: Adam eventually diverges on CIFAR10 while Lookahead is stable with the outer iterate enjoying superior performance.


We additionally simplify the setup by using GDA-based optimizers with an update ratio of 1 : 1, which avoids the complexity of diagonal adaptation, gradient history and multiple steps of the discriminator as in the Adam-based experiments. The results are found in Table 3. The learning rates are tuned for GDA and we use those parameters fixed across all other methods. Despite being tuned on GDA, we find that extragradient methods, Lookahead-based methods and RAPP all still outperform GDA in terms of FID. The biggest improvement comes from the linear interpolation based methods Lookahead and RAPP (see Figure 8 for further discussion on EG+). Interesting, the Lookahead-based methods are roughly comparable with their Adam variants (Table 2) while GDA even performs better than Adam.