Introduction
RL, which is given a thorough explanation by Sutton and Barto (2018), is summarized in this section. RL is a continuous feedback loop between an agent and its environment. At each time step, the RL agent, in a current state ๐ , takes an action ๐, and receives a reward ๐. The objective is to maximize future expected rewards, ๐บ๐ก , which, at time step ๐ก for an episodic task of ๐ discrete steps, are computed as
noting that ๐พ is the discount factor to avoid infinite returns. For any state ๐ , the corresponding action is given by the agentโs policy, ๐(๐ ) โ ๐. Policies are evaluated by computing value functions such as the ๐-function, which maps state-action pairs to their expected returns under the current policy, i.e.,
SARSA and ๐-learning are tabular methods, which require storing the ๐-value for each stateaction pair and policy for each state in a lookup table. As a tabular ๐-function requires the storage of ๐ฎ ร ๐ values, high-dimensional problems become intractable. However, the ๐-function may be approximated with function approximators such as NNs, recalling that this combination of NNs and RL is called DRL. DRL was popularized by the work of Mnih et al. (2013), who approximate the ๐-function with a NN, specifically denoting each ๐-value by ๐(๐ , ๐; ๐), where ๐ is a vector of the NN parameters that signifies the weights and biases connecting the nodes within the neural network. Mnih et al. (2013) show that this method of deep Q-networks (DQN) can be used to train an agent to master multiple Atari games.
The Q-network parameters stored in ๐ are optimized by minimizing a loss function between the Q-network output and the target value. The loss function for iteration ๐ is given by
where ๐ is the learning rate for the neural network, and ๐ป๐๐ ๐ฟ๐(๐๐) is the gradient of the loss with respect to the network weights. In addition to ๐-network approximation, another component of DQN is the use of the experience replay buffer (Mnih et al. 2013). Experience replay, originally proposed for RL by Lin (1992), stores encountered transitions (state, action, and reward) in a memory buffer. Mnih et al. (2013) use the replay buffer to uniformly sample the target ๐-value for iteration ๐ (๐ฆ๐ ).
To update the Q-function in value methods such as DQN, a max operation on the ๐-value over all next possible actions is required. Further, as discussed above, improving the policy in value methods requires an argmax operation over the entire action-space. When the action-space is continuous, the valuation of each potential next action becomes intractable. In financial option hedging, discretizing the action-space restricts the available hedging decisions. While hedging does require the acquisition of an integer number of shares, a continuous action-space for the optimal hedge provides more accuracy, as the hedge is not limited to a discrete set of choices. As such, continuous action-spaces are much more prevalent in the DRL hedging literature (Pickard and Lawryshyn 2023)
Authors:
(1) Reilly Pickard, Department of Mechanical and Industrial Engineering, University of Toronto, Toronto, ON M5S 3G8, Canada ([email protected]);
(2) Finn Wredenhagen, Ernst & Young LLP, Toronto, ON, M5H 0B3, Canada;
(3) Julio DeJesus, Ernst & Young LLP, Toronto, ON, M5H 0B3, Canada;
(4) Mario Schlener, Ernst & Young LLP, Toronto, ON, M5H 0B3, Canada;
(5) Yuri Lawryshyn, Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, ON M5S 3E5, Canada.
This paper is