This is Part 8 of a 11-part series based on the research paper “Reinforcement Learning In Agent-based Market Simulation: Unveiling Realistic Stylized Facts And Behavior”. Use the table of links below to navigate to the next part.
Part 1: Abstract & Introduction
Part 4: Agents & Simulation Details
Part 8: Market and Agent Responsiveness to External Events
Part 9: Conclusion & References
Part 10: Additional Simulation Results
Part 11: Simulation Configuration
In the first experiment, we introduce a rule-based agent which performs a sequence of “flash sales”. During a flash sale event, this agent continuously sends sell orders (300 lots each) every second for a duration of 5 seconds. Then, the agent remains inactive for 400 seconds to allow the market to recover. The goal is to study how the market moves during and after the flash sale events and how the agents’ behavior changes after training with these events.
Figure 7 depicts the price movement after the start of flash sale events. To create these plots we use the average price of multiple simulations normalized by the price at which flash sale started. The untrained group shows a significant and permanent price drop in contrast to the testing and continual training groups. In the testing group, prices recover to pre-flash-sale levels. However, the continual training group shows more realistic trajectories, with prices dropping and recovering to levels lower than the pre-sale prices. This price pattern aligns with descriptions of permanent and temporary price impact in [24].
We are also interested in how RL agents’ policy functions are affected when trained with flash sale events. Recall the action symmetric tweak and asymmetric tweak parameters control the prices of the limit orders sent by MM’s (see (1) and illustration in Figure 1). Intuitively, a larger symmetric tweak makes the agent place orders away from the best bid and ask (more conservatively). A positive asymmetric tweak indicates that the MM expects future higher prices than the current one, while a negative value indicates the MM expects the market to go down. When a flash sale event happens, market sell orders take liquidity from the bid side of the LOB and this results in an imbalanced LOB. In contrast, a balanced LOB means the volumes on the buy and sell sides are comparable. To determine the MM agents’ reaction we collect the states with balanced and imbalanced LOBs, and feed these states to the different agents’ policy functions. Figure 8 shows the action symmetric tweak values in the top row and the asymmetric tweak in the bottom row. We can see that when the order book is balanced the three groups output a similar distribution of actions (Figure 8b and 8d). However, when fed imbalanced LOBs (Figures 8a and 8c), the agents in the continual training group output higher symmetric price tweaks and lower asymmetric price tweaks, compared with the agents in the two other groups. Thus when a flash sale event is detected, the MM agents in the continual learning group tend to place orders deeper into the LOB and move the expected price equilibrium downward. This behavior aligns with the findings in literature [25, 26, 27]. These results show that the agents can adapt to different market conditions through continual learning.
In this experiment, we only use continual learning Group A MM agents, as they need to adapt to changing market conditions. The LT agents are also from Group A but their reward function is changed through the evolution of the target buy/sell parameters. Specifically, Figure 9 shows the price process resulting from the activity of these informed LT agents. The four phases separated by dashed red lines are: Sell (0.3/0.4), Buy (0.4/0.35), Balanced buy and sell (0.4/0.4), and last Buy (0.4/0.3), the numbers in parenthesis indicate (buy fraction/sell fraction). We expect to see the price movement aligning with the target buy/sell parameters. Additionally, similar to the previous study we collect states in the first and the last phases (i.e., steps 0-10,000 when the price goes down and steps 30,000-36,000 when the price goes up). We feed these states to the MM’s policy function at the beginning of the day (Before) and at the end of the trading day (After). Figure 10 shows the distributions of the outputs in action symmetric and asymmetric tweak. In both scenarios when the price is going down and when the price is going up, the action symmetric tweak (top row) gets larger, thus the MM agent becomes more conservative and tends to enlarge the spread. This aligns with the findings in the previous experiment. We can also see that the distribution of price asymmetric tweak becomes negative when prices are going down and becomes positive when prices are going up. This means the learned agents change their expectations of future prices along with the observed market direction. In contrast, the asymmetric tweak average is close to zero for agents that are not continuously trained. This helps explain why the price tends to come back to pre-flash-sale values for the agents in Group B in Figure 7.
Authors:
(1) Zhiyuan Yao, Stevens Institute of Technology, Hoboken, New Jersey, USA ([email protected]);
(2) Zheng Li, Stevens Institute of Technology, Hoboken, New Jersey, USA ([email protected]);
(3) Matthew Thomas, Stevens Institute of Technology, Hoboken, New Jersey, USA ([email protected]);
(4) Ionut Florescu, Stevens Institute of Technology, Hoboken, New Jersey, USA ([email protected]).
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.