Authors:
(1) Jongmin Lee, Department of Mathematical Science, Seoul National University;
(2) Ernest K. Ryu, Department of Mathematical Science, Seoul National University and Interdisciplinary Program in Artificial Intelligence, Seoul National University.
1.1 Notations and preliminaries
2.1 Accelerated rate for Bellman consistency operator
2.2 Accelerated rate for Bellman optimality opera
5 Approximate Anchored Value Iteration
6 Gauss–Seidel Anchored Value Iteration
7 Conclusion, Acknowledgments and Disclosure of Funding and References
We show that the classical value iteration (VI) is, in fact, suboptimal and that the anchoring mechanism accelerates VI to be optimal in the sense that the accelerated rate matches a complexity lower bound up to a constant factor of 4. We also show that the accelerated iteration provably converges to a fixed point even when γ = 1, if a fixed point exists. Being able to provide a substantive improvement upon the classical VI is, in our view, a surprising contribution.
One direction of future work is to study the empirical effectiveness of Anc-VI. Another direction is to analyze Anc-VI in a model-free setting and, more broadly, to investigate the effectiveness of the anchor mechanism in more practical RL methods.
Our results lead us to believe that many of the classical foundations of dynamic programming and reinforcement learning may be improved with a careful examination based on an optimization complexity theory perspective. The theory of optimal optimization algorithms has recently enjoyed significant developments [43–45, 66, 98], the anchoring mechanism being one such example [49, 65], and the classical DP and RL theory may benefit from a similar line of investigation on iteration complexity.
This work was supported by the the Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) [NO.2021-0-01343, Artificial Intelligence Graduate School Program (Seoul National University)] and the Samsung Science and Technology Foundation (Project Number SSTF-BA2101-02). We thank Jisun Park for providing valuable feedback.
[1] M. Akian, S. Gaubert, Z. Qu, and O. Saadi. Multiply accelerated value iteration for nonsymmetric affine fixed point problems and application to markov decision processes. SIAM Journal on Matrix Analysis and Applications, 43(1):199–232, 2022.
[2] D. G. Anderson. Iterative procedures for nonlinear integral equations. Journal of the Association for Computing Machinery, 12(4):547–560, 1965.
[3] D. Andre, N. Friedman, and R. Parr. Generalized prioritized sweeping. Neural Information Processing Systems, 1997.
[4] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. Neural Information Processing Systems, 2011.
[5] S. Banach. Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales. Fundamenta Mathematicae, 3(1):133–181, 1922.
[6] M. Barré, A. Taylor, and A. d’Aspremont. Convergence of a constrained vector extrapolation scheme. SIAM Journal on Mathematics of Data Science, 4(3):979–1002, 2022.
[7] M. G. Bellemare, G. Ostrovski, A. Guez, P. Thomas, and R. Munos. Increasing the action gap: New operators for reinforcement learning. Association for the Advancement of Artificial Intelligence, 2016.
[8] R. Bellman. A Markovian decision process. Journal of Mathematics and Mechanics, 6(5):679–684, 1957.
[9] D. Bertsekas and J. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Athena Scientific, 2015.
[10] D. P. Bertsekas. Dynamic Programming and Optimal Control, volume II. 4th edition, 2012.
[11] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1995.
[12] W. Bowen, X. Huaqing, Z. Lin, L. Yingbin, and Z. Wei. Finite-time theory for momentum Q-learning. Conference on Uncertainty in Artificial Intelligence, 2021.
[13] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for finding stationary points I. Mathematical Programming, 184(1–2):71–120, 2020.
[14] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for finding stationary points II: first-order methods. Mathematical Programming, 185(1–2):315–355, 2021.
[15] A.-L. Cauchy. Méthode générale pour la résolution des systemes d’équations simultanées. Comptes rendus de l’Académie des Sciences, 25:536–538, 1847.
[16] V. Colao and G. Marino. On the rate of convergence of Halpern iterations. Journal of Nonlinear and Convex Analysis, 22(12):2639–2646, 2021.
[17] J. P. Contreras and R. Cominetti. Optimal error bounds for non-expansive fixed-point iterations in normed spaces. Mathematical Programming, 199(1–2):343–374, 2022.
[18] P. Dai, D. S. Weld, J. Goldsmith, et al. Topological value iteration algorithms. Journal of Artificial Intelligence Research, 42:181–209, 2011.
[19] D. P. De Farias and B. Van Roy. On the existence of fixed points for approximate value iteration and temporal-difference learning. Journal of Optimization theory and Applications, 105:589–608, 2000.
[20] J. Diakonikolas. Halpern iteration for near-optimal and parameter-free monotone inclusion and strong solutions to variational inequalities. Conference on Learning Theory, 2020.
[21] J. Diakonikolas and P. Wang. Potential function-based framework for minimizing gradients in convex and min-max optimization. SIAM Journal on Optimization, 32(3):1668–1697, 2022.
[22] Q. Dong, H. Yuan, Y. Cho, and T. M. Rassias. Modified inertial Mann algorithm and inertial CQ-algorithm for nonexpansive mappings. Optimization Letters, 12(1):87–102, 2018.
[23] Y. Drori. The exact information-based complexity of smooth convex minimization. Journal of Complexity, 39:1–16, 2017.
[24] Y. Drori and O. Shamir. The complexity of finding stationary points with stochastic gradient descent. International Conference on Machine Learning, 2020.
[25] Y. Drori and A. Taylor. On the oracle complexity of smooth strongly convex minimization. Journal of Complexity, 68, 2022.
[26] Y. Drori and M. Teboulle. An optimal variant of Kelley’s cutting-plane method. Mathematical Programming, 160(1–2):321–351, 2016.
[27] M. Ermis, M. Park, and I. Yang. On Anderson acceleration for partially observable Markov decision processes. IEEE Conference on Decision and Control, 2021.
[28] M. Ermis and I. Yang. A3DQN: Adaptive Anderson acceleration for deep Q-networks. IEEE Symposium Series on Computational Intelligence, 2020.
[29] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005.
[30] D. Ernst, M. Glavic, P. Geurts, and L. Wehenkel. Approximate value iteration in the reinforcement learning context. Application to electrical power system control. International Journal of Emerging Electric Power Systems, 3(1), 2005.
[31] A.-m. Farahmand and M. Ghavamzadeh. PID accelerated value iteration algorithm. International Conference on Machine Learning, 2021.
[32] A.-m. Farahmand, C. Szepesvári, and R. Munos. Error propagation for approximate policy and value iteration. Neural Information Processing Systems, 2010.
[33] M. Fellows, K. Hartikainen, and S. Whiteson. Bayesian Bellman operators. Neural Information Processing Systems, 2021.
[34] M. Geist and B. Scherrer. Anderson acceleration for reinforcement learning. European Workshop on Reinforcement Learning, 2018.
[35] N. Golowich, S. Pattathil, C. Daskalakis, and A. Ozdaglar. Last iterate is slower than averaged iterate in smooth convex-concave saddle point problems. Conference on Learning Theory, 2020.
[36] G. J. Gordon. Stable function approximation in dynamic programming. International Conference on Machine Learning, 1995.
[37] V. Goyal and J. Grand-Clément. A first-order approach to accelerated value iteration. Operations Research, 71(2):517–535, 2022.
[38] J. Grand-Clément. From convex optimization to MDPs: A review of first-order, second-order and quasi-newton methods for MDPs. arXiv:2104.10677, 2021.
[39] B. Halpern. Fixed points of nonexpanding maps. Bulletin of the American Mathematical Society, 73(6):957–961, 1967.
[40] R. Hannah, Y. Liu, D. O’Connor, and W. Yin. Breaking the span assumption yields fast finite-sum minimization. Neural Information Processing Systems, 2018.
[41] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement learning. Association for the Advancement of Artificial Intelligence, 2018.
[42] F. Iutzeler and J. M. Hendrickx. A generic online acceleration scheme for optimization algorithms via relaxation and inertia. Optimization Methods and Software, 34(2):383–405, 2019.
[43] D. Kim. Accelerated proximal point method for maximally monotone operators. Mathematical Programming, 190(1–2):57–87, 2021.
[44] D. Kim and J. A. Fessler. Optimized first-order methods for smooth convex minimization. Mathematical Programming, 159(1–2):81–107, 2016.
[45] D. Kim and J. A. Fessler. Optimizing the efficiency of first-order methods for decreasing the gradient of smooth convex functions. Journal of Optimization Theory and Applications, 188(1):192–219, 2021.
[46] J. Lee, C. Park, and E. K. Ryu. A geometric structure of acceleration and its role in making gradients small fast. Neural Information Processing Systems, 2021.
[47] S. Lee and D. Kim. Fast extra gradient methods for smooth structured nonconvex-nonconcave minimax problems. Neural Information Processing Systems, 2021.
[48] L. Leustean. Rates of asymptotic regularity for Halpern iterations of nonexpansive mappings. Journal of Universal Computer Science, 13(11):1680–1691, 2007.
[49] F. Lieder. On the convergence rate of the Halpern-iteration. Optimization Letters, 15(2):405– 418, 2021.
[50] M. Lutter, S. Mannor, J. Peters, D. Fox, and A. Garg. Value iteration in continuous actions, states and time. International Conference on Machine Learning, 2021.
[51] P.-E. Maingé. Convergence theorems for inertial KM-type algorithms. Journal of Computational and Applied Mathematics, 219(1):223–236, 2008.
[52] A. massoud Farahmand, M. Ghavamzadeh, C. Szepesvári, and S. Mannor. Regularized fitted Q-iteration for planning in continuous-space Markovian decision problems. American Control Conference, 2009.
[53] H. B. McMahan and G. J. Gordon. Fast exact planning in Markov decision processes. International Conference on Automated Planning and Scheduling, 2005.
[54] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
[55] A. W. Moore and C. G. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13:103–130, 1993.
[56] R. Munos. Error bounds for approximate value iteration. Association for the Advancement of Artificial Intelligence, 2005.
[57] R. Munos and C. Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(27):815–857, 2008.
[58] A. S. Nemirovski. Information-based complexity of linear operator equations. Journal of Complexity, 8(2):153–175, 1992.
[59] Y. Nesterov. Lectures on Convex Optimization. Springer, 2nd edition, 2018.
[60] Y. Nesterov, A. Gasnikov, S. Guminov, and P. Dvurechensky. Primal–dual accelerated gradient methods with small-dimensional relaxation oracle. Optimization Methods and Software, 36(4):773–810, 2021.
[61] Y. E. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k2 ). Doklady Akademii Nauk SSSR, 269(3):543–547, 1983.
[62] S. Niu, S. Chen, H. Guo, C. Targonski, M. Smith, and J. Kovacevi ˇ c. Generalized value itera- ´ tion networks: Life beyond lattices. Association for the Advancement of Artificial Intelligence, 2018.
[63] C. Nota and P. Thomas. Is the policy gradient a gradient? International Conference on Autonomous Agents and Multiagent Systems, 2020.
[64] C. Park, J. Park, and E. K. Ryu. Factor-√ 2 acceleration of accelerated gradient methods. Applied Mathematics & Optimization, 2023.
[65] J. Park and E. K. Ryu. Exact optimal accelerated complexity for fixed-point iterations. International Conference on Machine Learning, 2022.
[66] J. Park and E. K. Ryu. Accelerated infeasibility detection of constrained optimization and fixed-point iterations. International Conference on Machine Learning, 2023.
[67] M. Park, J. Shin, and I. Yang. Anderson acceleration for partially observable Markov decision processes: A maximum entropy approach. arXiv:2211.14998, 2022.
[68] J. Peng and R. J. Williams. Efficient learning and planning within the Dyna framework. Adaptive Behavior, 1(4):437–454, 1993.
[69] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley and Sons, 1994.
[70] S. Reich. Strong convergence theorems for resolvents of accretive operators in Banach spaces. Journal of Mathematical Analysis and Applications, 75(1):287–292, 1980.
[71] E. K. Ryu, K. Yuan, and W. Yin. Ode analysis of stochastic gradient methods with optimism and anchoring for minimax problems. arXiv:1905.10899, 2019.
[72] S. Sabach and S. Shtern. A first order method for solving convex bilevel optimization problems. SIAM Journal on Optimization, 27(2):640–660, 2017.
[73] A. Salim, L. Condat, D. Kovalev, and P. Richtárik. An optimal algorithm for strongly convex minimization under affine constraints. International Conference on Artificial Intelligence and Statistics, 2022.
[74] D. Scieur, A. d’Aspremont, and F. Bach. Regularized nonlinear acceleration. Mathematical Programming, 179(1–2):47–83, 2020.
[75] Y. Shehu. Convergence rate analysis of inertial Krasnoselskii–Mann type iteration with applications. Numerical Functional Analysis and Optimization, 39(10):1077–1091, 2018.
[76] W. Shi, S. Song, H. Wu, Y.-C. Hsu, C. Wu, and G. Huang. Regularized Anderson acceleration for off-policy deep reinforcement learning. Neural Information Processing Systems, 2019.
[77] O. Shlakhter, C.-G. Lee, D. Khmelev, and N. Jaber. Acceleration operators in the value iteration algorithms for Markov decision processes. Operations Research, 58(1):193–202, 2010.
[78] J. J. Suh, J. Park, and E. K. Ryu. Continuous-time analysis of anchor acceleration. Neural Information Processing Systems, 2023.
[79] K. Sun, Y. Wang, Y. Liu, B. Pan, S. Jui, B. Jiang, L. Kong, et al. Damped Anderson mixing for deep reinforcement learning: Acceleration, convergence, and stabilization. Neural Information Processing Systems, 2021.
[80] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988.
[81] R. S. Sutton and A. G. Barto. Reinforcement Learning: An introduction. MIT press, 2nd edition, 2018.
[82] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. Neural Information Processing Systems, 1999.
[83] Q. Sykora, M. Ren, and R. Urtasun. Multi-agent routing value iteration network. International Conference on Machine Learning, 2020.
[84] C. Szepesvári. Algorithms for Reinforcement Learning. Springer, 1st edition, 2010.
[85] A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel. Value iteration networks. Neural Information Processing Systems, 2016.
[86] A. Taylor and Y. Drori. An optimal gradient method for smooth strongly convex minimization. Mathematical Programming, 199(1-2):557–594, 2023.
[87] S. Tosatto, M. Pirotta, C. d’Eramo, and M. Restelli. Boosted fitted Q-iteration. International Conference on Machine Learning, 2017.
[88] J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185–202, 1994.
[89] H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double Q-learning. Association for the Advancement of Artificial Intelligence, 2016.
[90] B. Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Mathematics of Operations Research, 31(2):234–244, 2006.
[91] B. Van Scoy, R. A. Freeman, and K. M. Lynch. The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Systems Letters, 2(1):49–54, 2018.
[92] N. Vieillard, B. Scherrer, O. Pietquin, and M. Geist. Momentum in reinforcement learning. International Conference on Artificial Intelligence and Statistics, 2020.
[93] H. F. Walker and P. Ni. Anderson acceleration for fixed-point iterations. SIAM Journal on Numerical Analysis, 49(4):1715–1735, 2011.
[94] C. J. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279–292, 1992.
[95] D. Wingate, K. D. Seppi, and S. Mahadevan. Prioritization methods for accelerating MDP solvers. Journal of Machine Learning Research, 6(25):851–881, 2005.
[96] R. Wittmann. Approximation of fixed points of nonexpansive mappings. Archiv der Mathematik, 58(5):486–491, 1992.
[97] H.-K. Xu. Iterative algorithms for nonlinear operators. Journal of the London Mathematical Society, 66(1):240–256, 2002.
[98] T. Yoon and E. K. Ryu. Accelerated algorithms for smooth convex-concave minimax problems with O(1/k2 ) rate on squared gradient norm. International Conference on Machine Learning, 2021.
[99] T. Yoon and E. K. Ryu. Accelerated minimax algorithms flock together. arXiv:2205.11093, 2022.
[100] Y. Zeng, F. Feng, and W. Yin. AsyncQVI: Asynchronous-parallel Q-value iteration for discounted Markov decision processes with near-optimal sample complexity. International Conference on Artificial Intelligence and Statistics, 2020.
[101] J. Zhang, B. O’Donoghue, and S. Boyd. Globally convergent type-I Anderson acceleration for nonsmooth fixed-point iterations. SIAM Journal on Optimization, 30(4):3170–3197, 2020.
[102] K. Zhou, L. Tian, A. M.-C. So, and J. Cheng. Practical schemes for finding near-stationary points of convex finite-sums. International Conference on Artificial Intelligence and Statistics, 2022.
This paper is available on arxiv under CC BY 4.0 DEED license.