This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Thomas Pethick, EPFL (LIONS) [email protected];
(2) Wanyun Xie, EPFL (LIONS) [email protected];
(3) Volkan Cevher, EPFL (LIONS) [email protected].
This paper presents a theoretical analysis of linear interpolation as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear interpolation can help by leveraging the theory of nonexpansive operators. We construct a new optimization scheme called relaxed approximate proximal point (RAPP), which is the first explicit method to achieve last iterate convergence rates for the full range of cohypomonotone problems. The construction extends to constrained and regularized settings. By replacing the inner optimizer in RAPP we rediscover the family of Lookahead algorithms for which we establish convergence in cohypomonotone problems even when the base optimizer is taken to be gradient descent ascent. The range of cohypomonotone problems in which Lookahead converges is further expanded by exploiting that Lookahead inherits the properties of the base optimizer. We corroborate the results with experiments on generative adversarial networks which demonstrates the benefits of the linear interpolation present in both RAPP and Lookahead.
Stability is a major concern when training large scale models. In particular, generative adversarial networks (GANs) are known to be notoriously difficult to train. To stabilize training, the Lookahead algorithm of Zhang et al. (2019) was recently proposed for GANs Chavdarova et al. (2020) which linearly interpolates with a slow moving iterate. The mechanism has enjoyed superior empirical performance in both minimization and minimax problems, but it largely remains a heuristic with little theoretical motivation.
One major obstacle for providing a theoretical treatment, is in capturing the (fuzzy) notion of stability. Loosely speaking, a training dynamics is referred to as unstable in practice when the iterates either cycle indefinitely or (eventually) diverge—as has been observed for the Adam optimizer (see e.g. Gidel et al. (2018, Fig. 12) and Chavdarova et al. (2020, Fig. 6) respectively). Conversely, a stable dynamics has some bias towards stationary points. The notion of stability so far (e.g. in Chavdarova et al. (2020, Thm. 2-3)) is based on the spectral radius and thus inherently local.
In this work, we are interested in establishing global convergence properties, in which case some structural assumptions are needed. One (nonmonotone) structure that lends itself well to the study of stability is that of cohypomonotonicity studied in Combettes & Pennanen (2004); Diakonikolas et al. (2021), since even the extragradient method has been shown to cycle and diverge in this problem class (see Pethick et al. (2022, Fig. 1) and Pethick et al. (2023, Fig. 2) respectively). We provide a geometric intuition behind these difficulties in Figure 1. Biasing the optimization schemes towards stationary points becomes a central concern and we demonstrate in Figure 2 that Lookahead can indeed converge for such nonmonotone problems.
A principled approach to cohypomonotone problems is the extragradient+ algorithm (EG+) proposed by Diakonikolas et al. (2021). However, the only known rates are on the best iterate, which can be problematic to pick in practice. It is unclear whether last iterate rates for EG+ are possible even in the monotone case (see discussion prior to Thm. 3.3 in Gorbunov et al. (2022a)). For this reason, the community has instead resorted to showing last iterate of extragradient (EG) method of Korpelevich (1977), despite originally being developed for the monotone case. Maybe not surprisingly, EG only enjoys a last iterate guarantee under mild form of cohypomonotonicity and have so far only been studied in the unconstrained case (Luo & Tran-Dinh; Gorbunov et al., 2022b). Recently it was shown that the full range of cohypomonotone problems is achievable, but the scheme is implicit and the complexity blows up with increasing cohypomonotonicity (Gorbunov et al., 2022b). This leaves the questions: Can an explicit scheme enjoy last iterate rates for all cohypomonotone problems? Can the rate be agnostic to the degree of cohypomonotonicity? We answer both in the affirmative.
As will become clear, the main mechanism behind convergence in this nonmonotone class is the linear interpolation also used in Lookahead. This iterative interpolation is more broadly referred to as the Krasnosel’ski˘ı-Mann (KM) iteration in the theory of nonexpansive operators. We show that the extragradient+ algorithm (EG+) of Diakonikolas et al. (2021), our proposed relaxed approximate proximal point method (RAPP), and Lookahead based algorithms are all instances of the (inexact) KM iteration and provide simple proofs of these schemes in the cohypomonotone case.
More concretely we make the following contributions:
We prove global convergence rates for the last iterate of our proposed algorithm RAPP which additionally handles constrained and regularized settings. This makes RAPP the first explicit scheme to have non-asymptotic guarantees for the full range of cohypomonotone problems. As a byproduct we obtain a last iterate convergence rate for an implicit scheme that is independent of the degree of cohypomonotonicity. The last iterate rates are established by showing monotonic decrease of the operator norm–something which is not possible for EG+. This contrast is maybe surprising, since RAPP can be viewed as an extension of EG+, which simply takes multiple extrapolation steps.
By replacing the inner optimization routine in RAPP with gradient descent ascent (GDA) and extragradient (EG) we rediscover the Lookahead algorithms considered in Chavdarova et al. (2020). We obtain guarantees for the Lookahead variants by deriving nonexpansive properties of the base optimizers. By casting Lookahead as a KM iteration we find that the optimal interpolation constant is λ = 0.5. This choice corresponds to the default value used in practice for both minimization and minimax—thus providing theoretical motivation for the parameter value.
For τ = 2 inner iterations we observe that LA-GDA reduces to a linear interpolation between GDA and EG+ which allows us to obtain global convergence in ρ-comonotone problems when ρ > −1/3 √ 3L. However, for τ large, we provide a counterexample showing that LA-GDA cannot be guaranteed to converge. This leads us to instead propose LA-CEG+ which corrects the inner optimization to guarantee global convergence for ρ-comonotone problems when ρ > −1/2L.
We test the methods on a suite of synthetic examples and GAN training where we confirm the stabilizing effect. Interestingly, RAPP seems to provide a similar benefit as Lookahead, which suggest that linear interpolation could play a key role also experimentally.
An overview of the theoretical results is provided in Table 1 and Figure 5§B.