Why Anc-VI is Crucial for Undiscounted Reinforcement Learning

by AnchoringJanuary 14th, 2025

Too Long; Didn't Read

Anc-VI converges to fixed points of the Bellman consistency and optimality operators in undiscounted MDPs (γ = 1), providing solutions where traditional value iteration struggles.

featured image - Why Anc-VI is Crucial for Undiscounted Reinforcement Learning

‘data sheet displayed on a laptop's screen’ Image created by HackerNoon AI Image Generator

Authors:

(1) Jongmin Lee, Department of Mathematical Science, Seoul National University;

(2) Ernest K. Ryu, Department of Mathematical Science, Seoul National University and Interdisciplinary Program in Artificial Intelligence, Seoul National University.

Abstract and 1 Introduction

1.1 Notations and preliminaries

1.2 Prior works

2 Anchored Value Iteration

2.1 Accelerated rate for Bellman consistency operator

2.2 Accelerated rate for Bellman optimality opera

3 Convergence when y=1

4 Complexity lower bound

5 Approximate Anchored Value Iteration

6 Gauss–Seidel Anchored Value Iteration

7 Conclusion, Acknowledgments and Disclosure of Funding and References

A Preliminaries

B Omitted proofs in Section 2

C Omitted proofs in Section 3

D Omitted proofs in Section 4

E Omitted proofs in Section 5

F Omitted proofs in Section 6

G Broader Impacts

H Limitations

3 Convergence when y=1

Undiscounted MDPs are not commonly studied in the DP and RL theory literature due to the following difficulties: Bellman consistency and optimality operators may not have fixed points, VI is a nonexpansive (not contractive) fixed-point iteration and may not convergence to a fixed point even if one exist, and the interpretation of a fixed point as the (optimal) value function becomes unclear when the fixed point is not unique. However, many modern deep RL setups actually do not use discounting, [2] and this empirical practice makes the theoretical analysis with γ = 1 relevant.

In this section, we show that Anc-VI converges to fixed points of the Bellman consistency and optimality operators of undiscounted MDPs. While a full treatment of undiscounted MDPs is beyond the scope of this paper, we show that fixed points, if one exists, can be found, and we therefore argue that the inability to find fixed points should not be considered an obstacle in studying the γ = 1 setup.

We first state our convergence result for finite state-action spaces.

This paper is available on arxiv under CC BY 4.0 DEED license.

[3] Well-definedness of T requires a σ-algebra on state and action spaces, expectation with respect to transition probability and policy to be well defined, boundedness and measurability of the output of Bellman operator, etc.

L O A D I N G
. . . comments & more!

About Author

Anchoring@anchoring

Anchoring provides a steady start, grounding decisions and perspectives in clarity and confidence.

Read my stories About @anchoring

TOPICS

machine-learning #reinforcement-learning #dynamic-programming #nesterov-acceleration #machine-learning-optimization #value-iteration #value-iteration-convergence #bellman-error #fixed-point-convergence

THIS ARTICLE WAS FEATURED IN...

Join HackerNoon

Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas

Why Anc-VI is Crucial for Undiscounted Reinforcement Learning

Too Long; Didn't Read

3 Convergence when y=1

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES