Stable Nonconvex-Nonconcave Training via Linear Interpolation: Analysis of Lookaheaby@interpolation

Stable Nonconvex-Nonconcave Training via Linear Interpolation: Analysis of Lookahea

tldt arrow

Too Long; Didn't Read

This paper presents a theoretical analysis of linear interpolation as a principled method for stabilizing (large-scale) neural network training.
featured image - Stable Nonconvex-Nonconcave Training via Linear Interpolation:  Analysis of Lookahea
The Interpolation Publication HackerNoon profile picture

This paper is available on arxiv under CC 4.0 license.


(1) Thomas Pethick, EPFL (LIONS) [email protected];

(2) Wanyun Xie, EPFL (LIONS) [email protected];

(3) Volkan Cevher, EPFL (LIONS) [email protected].

7 Analysis of Lookahead

The update in RAPP leads to a fairly conservative update in the inner loop, since it corresponds to optimizing a highly regularized subproblem as noted in Section 5. Could we instead replace the optimization procedure with gradient descent ascent (GDA)? If we replace the inner optimization routine we recover what is known as the Lookahead (LA) algorithm

We empirically demonstrate that this scheme can converge for nonmonotone problems for certain choices of parameters (see Figure 2). However, what global guarantees can we provide theoretically?

It turns out that for LA-GDA with two inner steps (τ = 2) we have an affirmative answer. After some algebraic manipulation it is not difficult to see that the update can be simplified as follow

Figure 2: LA-GDA and RAPP can converge for Hsieh et al. (2021, Ex. 5.2). Interestingly, we can set the stepsize γ larger than 1/L while RAPP remains stable. Approximate proximal point (APP) with the same stepsize diverges (the iterates of APP are deferred to Figure 6). In this example, it is apparent from the rates, that there is a benefit in replacing the conservative inner update in RAPP with GDA in LA-GDA as explored in Section 7.

This is the average of GDA and EG+ (when λ ∈ (0, 1/2)). This observation allows us to show convergence under cohypomonotonicity. This positive result for nonmonotone problems partially explains the stabilizing effect of LA-GDA.

Convergence follows from quasi-nonexpansiveness.

Remark 7.4. Even though the base optimizer Alg might not converge (since nonexpansiveness is not sufficient), Theorem 7.3 shows that the outer loop converges. Interestingly, this aligns with the benefit observed in practice of using the outer iteration of Lookahead (see Figure 4).

Monotone When only monotonicity and Lipschitz holds we may instead consider the following extragradient based version of Lookahead (first empirically investigated in Chavdarova et al. (2020))