Authors:
(1) Xiao-Yang Liu, Hongyang Yang, Columbia University (xl2427,[email protected]);
(2) Jiechao Gao, University of Virginia ([email protected]);
(3) Christina Dan Wang (Corresponding Author), New York University Shanghai ([email protected]).
2 Related Works and 2.1 Deep Reinforcement Learning Algorithms
2.2 Deep Reinforcement Learning Libraries and 2.3 Deep Reinforcement Learning in Finance
3 The Proposed FinRL Framework and 3.1 Overview of FinRL Framework
3.5 Training-Testing-Trading Pipeline
4 Hands-on Tutorials and Benchmark Performance and 4.1 Backtesting Module
4.2 Baseline Strategies and Trading Metrics
4.5 Use Case II: Portfolio Allocation and 4.6 Use Case III: Cryptocurrencies Trading
5 Ecosystem of FinRL and Conclusions, and References
We review the state-of-the-art DRL algorithms, relevant opensource libraries, and applications of DRL in quantitative finance.
Many DRL algorithms have been developed. They fall into three categories: value based, policy based, and actor-critic based.
A value based algorithm estimates a state-action value function that guides the optimal policy. Q-learning [49] approximates a Qvalue (expected return) by iteratively updating a Q-table, which works for problems with small discrete state spaces and action spaces. Researchers proposed to utilize deep neural networks for approximating Q-value functions, e.g., deep Q-network (DQN) and its variants double DQN and dueling DQN [1].
A policy based algorithm directly updates the parameters of a policy through policy gradient [45]. Instead of value estimation, policy gradient uses a neural network to model the policy directly, whose input is a state and output is a probability distribution according to which the agent takes an action at the input state.
An actor-critic based algorithm combines the advantages of value based and policy based algorithms. It updates two neural networks, namely, an actor network updates the policy (probability distribution) while a critic network estimates the state-action value function. During the training process, the actor network takes actions and the critic network evaluates those actions. The state-of-art actor-critic based algorithms are deep deterministic policy gradient (DDPG), proximal policy optimization (PPO), asynchronous advantage actor critic (A3C), advantage actor critic (A2C), soft actor-critic (SAC), multi-agent DDPG, and twin-delayed DDPG (TD3) [1].
This paper is available on arxiv under CC BY 4.0 DEED license.