Efficient Neural Network Approaches for Conditional Optimal Transport: Abstract & Introduction

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Zheyu Oliver Wang, Department of Aeronautics and Astronautics, Massachusetts Institute of Technology, Cambridge, MA and [email protected];

(2) Ricardo Baptista, Computing + Mathematical Sciences, California Institute of Technology, Pasadena, CA and [email protected];

(3) Youssef Marzouk, Department of Aeronautics and Astronautics, Massachusetts Institute of Technology, Cambridge, MA and [email protected];

(4) Lars Ruthotto, Department of Mathematics, Emory University, Atlanta, GA and [email protected];

(5) Deepanshu Verma, Department of Mathematics, Emory University, Atlanta, GA and [email protected].

Table of Links

Abstract.

We present two neural network approaches that approximate the solutions of static and dynamic conditional optimal transport (COT) problems, respectively. Both approaches enable sampling and density estimation of conditional probability distributions, which are core tasks in Bayesian inference. Our methods represent the target conditional distributions as transformations of a tractable reference distribution and, therefore, fall into the framework of measure transport. COT maps are a canonical choice within this framework, with desirable properties such as uniqueness and monotonicity. However, the associated COT problems are computationally challenging, even in moderate dimensions. To improve the scalability, our numerical algorithms leverage neural networks to parameterize COT maps. Our methods exploit the structure of the static and dynamic formulations of the COT problem. PCP-Map models conditional transport maps as the gradient of a partially input convex neural network (PICNN) and uses a novel numerical implementation to increase computational efficiency compared to state-of-the-art alternatives. COT-Flow models conditional transports via the flow of a regularized neural ODE; it is slower to train but offers faster sampling. We demonstrate their effectiveness and efficiency by comparing them with state-of-the art approaches using benchmark datasets and Bayesian inverse problems.

Key words. Measure transport, generative modeling, optimal transport, Bayesian inference, inverse problems, uncertainty quantification

MSC codes. 62F15, 62M45

Most effective sampling techniques are limited to the conventional Bayesian setting as they require a tractable likelihood model (often given by a forward operator that maps x to y and a noise model) and prior. Even in the conventional setting, producing thousands of approximately i.i.d. samples from the conditional distribution often requires millions of forward operator evaluations. This can be prohibitive for complex forward operators, such as those in science and engineering applications based on stochastic differential equations (SDE) or partial differential equations (PDE). Moreover, MCMC schemes are sequential and difficult to parallelize.

Beyond the conventional Bayesian setting, one is usually limited to likelihood-free methods; see, e.g., [12]. Among these methods, measure transport provides a general framework to characterize complex posteriors using samples from the joint distribution. The key idea is to construct transport maps that push forward a simple reference (e.g., a standard Gaussian) toward a complex target distribution; see [3] for discussions and reviews. Once obtained, these transport maps provide an immediate way to generate i.i.d. samples from the target distribution by evaluating the map at samples from a reference distribution.

Under mild assumptions, there exist infinitely many transport maps that fulfill the push-forward constraint but have drastically different theoretical properties. One way to establish uniqueness is to identify the transport map that satisfies the push-forward constraint and incurs minimal transport cost. Adding transport costs renders the measure transport problem for the conditional distribution into a conditional optimal transport (COT) problem [52].

Solving COT problems is computationally challenging, especially when the number of parameters, n, or the number of measurements, m, are large or infinite. The curse of dimensionality affects methods that use grids or polynomials to approximate transport maps and renders most of them impractical when n + m is larger than ten. Due to their function approximation properties, neural networks are a natural candidate to parameterize COT maps. This choice bridges the measure transport framework and deep generative modeling [38, 27]. While showing promising results, many recent neural network approaches for COT such as [32, 9, 3] rely on adversarial training, which requires solving a challenging stochastic saddle point problem.

This paper contributes two neural network approaches for COT that can be trained by maximizing the likelihood of the target samples, and we demonstrate their use in Bayesian inference. Our approaches exploit the known structure of the COT map in different ways. Our first approach parameterizes the map as the gradient of a PICNN [2]; we name it partially convex potential map (PCP-Map). By construction, this yields a monotone map for any choice of network weights. When trained to sufficient accuracy, we obtain the optimal transport map with respect to an expected L2 cost, known as the conditional Brenier map [10]. Our second approach builds upon the relaxed dynamical formulation of the L2 optimal transport problem that instead seeks a map defined by the flow map of an ODE; we name it conditional optimal transport flow (COT-Flow). Here, we parameterize the velocity of the map as the gradient of a scalar potential and obtain a neural ODE. To ensure that the network achieves sufficient accuracy, we monitor and penalize violations of the associated optimality conditions, which are given by a Hamilton Jacobi Bellman (HJB) PDE.

A series of numerical experiments are conducted to evaluate PCP-Map and COT-Flow comprehensively. The first experiment demonstrates our approaches’ robustness to hyperparameters and superior numerical accuracy for density estimation compared to results from other approaches in [4] using six UCI tabular datasets [29]. The second experiment demonstrates our approaches’ effectiveness and efficiency by comparing them to a provably convergent approximate Bayesian computation approach on the task of conditional sampling using a Bayesian inference problem involving the stochastic Lotka-Volterra equation, which gives rise to intractable likelihood. The third experiment compares our approaches’ competitiveness against the flow-based neural posterior estimation (NPE) approach studied in [40] on a real-world high-dimensional Bayesian inference problem involving the 1D shallow water equations. The final experiment demonstrates PCP-Map’s improvements in computational stability and efficiency over an amortized version of the approach relevant to [20]. Through these experiments, we conclude that the proposed approaches characterize conditional distributions with improved numerical accuracy and efficiency. Moreover, they improve upon recent computational methods for the numerical solution of the COT problem.

Like most neural network approaches, the effectiveness of our methods relies on an adequate choice of network architecture and an accurate solution to a stochastic non-convex optimization problem. As there is little theoretical guidance for choosing the network architecture and optimization hyper-parameters, our numerical experiments show the effectiveness and robustness of a simple random grid search. The choice of the optimization algorithm is a modular component in our approach; we use the common Adam method [24] for simplicity of implementation.

The remainder of the paper is organized as follows: Section 2 contains the mathematical formulation of the conditional sampling problem and reviews related learning approaches. Section 3 presents our partially convex potential map (PCP-Map). Section 4 presents our conditional optimal transport flow (COT-Flow). Section 5 describes our effort to achieve reproducible results and procedures for identifying effective hyperparameters for neural network training. Section 6 contains a detailed numerical evaluation of both approaches using six open-source data sets and experiments motivated by Bayesian inference for the stochastic Lotka-Volterra equation and the 1D shallow water equation. Section 7 features a detailed discussion of our results and highlights the advantages and limitations of the presented approaches.