Recently we were investigating Reinforcement Learning (RL) for a sequential decision-making problem: We have a supply chain with multiple customers to satisfy. The customers themselves have a wide range of demands. Over time, if the supply chain fails to meet their product requests, the corresponding customer satisfaction decreases and consequently future orders from said customers will also decrease.
Thrown into this mix is the stochastic nature of customer demand, a priority ranking of customers, constraints on our production capacity, and sudden unexpected bursts in customer demand. Simultaneously, the actions the supply chain can take at any instance are to choose between maximizing the average satisfaction across all the customers or a weighted average of customer satisfaction.
Hop on the work we’ve done so far, here.
The RL agent/bot’s function here is to decide on what action to take at any instance when customer requests come in. Interestingly, in hindsight, we observed that maximizing the weighted average of customer satisfaction is the best action to take in all instances. One would expect the RL agent to figure this out, but the performance of the RL agent in picking actions was much lower than expected, leading us to a multitude of questions. The following are some of our observations.
A large class of RL algorithms don’t have theoretical guarantees that the solution they’ll provide is the ideal (i.e. optimal) among possible solutions. A bit of history on RL to follow here. Most of the theory on RL is based on Markov Decision Processes (MDPs). So we’ll quickly brush on it.
Let’s say a person intends to take go to his office in a car or by walking. Now if it’s cloudy outside, there’s a 50% likelihood that he might go by the tram and a 50% likelihood to go by walk. Similarly, if it’s sunny, the likelihood of going by tram is maybe 30% and by walk is 70%. Similarly, there’s some likelihood of getting back home from the office by tram or walk depending on the climate outside.
Now depending on the action taken, the person gets a penalty in terms of money spent (0 bitcoins for walking, > 0 bitcoin for the tram). The states of this system are:
a) cloudy outside,
b) sunny outside.
The actions of this system are:
a) traveling by car
b) traveling by walk.
This system combined with some subtleties is called a Markov (Decision) Process.
In such systems with a number of states and actions, RL is a class of algorithms to infer conclusions (most importantly, what is the best action to be taken at any state is). As long as the number of states and actions is finite (as above), theoretical guarantees exist to show that RL algorithms can arrive at the optimal actions to be taken at each state.
However, the hoi polloi of interesting real-life instances consist of an infinitely large number of actions/states or both (like playing a video game). It becomes computationally very costly to store these hugely large numbers of actions/states in memory.
In cases such as above, we need some sort of approximator to identify the state/action of the system (through incomplete perceptions of the environment). If we were to use a non-linear approximator here (think neural network), there is no guarantee of reaching the optimal solution but we get a pretty decent approximation.
This lack of a theoretical guarantee on arriving at the best action to be taken at a given state implies that the approximate RL agents have to be experimented with heavily (parameters tuning) on a case-by-case basis, to make sure they work as intended.
Since the RL agent didn’t work as expected in our test case, we investigated where RL has become so popular. Specifically, the type of situations they’ve been effective in.
The majority of Reinforcement Learning algorithms’ success has come from learning to play games (like Atari games, chess, and GO). Here the number of possible actions and states is so humongous that it is not humanely possible to go through all of them while making a move (imagine iterating through all possible moves and future moves, prior to making a move, while playing a game of chess).
Furthermore, let’s say in a game of chess, you’ve made your 10th move. Now the positioning of the pieces on the chess board by the 20th move will in all likelihood be widely influenced by your 10th move. So the future states are greatly impacted by your current action.
In the case of our supply chain example, this long-term effect of the current action is feeble at best. And the disparity between outcomes of the possible actions is minimal. It goes without saying that we’ve made these observations in hindsight.
Since the number of states in our supply chain example is infinitely large, we use a neural-network approximation in our RL algorithm. The disparity between the outcomes of the possible actions is quite low (i.e. the RL agent receives very little difference in rewards from the actions), we were interested in investigating if there is an explainable lower bound on the efficiency of neural networks. We restrict our investigation to the behavior of systems that take in numerical values and compute an output based on a relation between them (it’s safe to assume that we ignore image and natural language processing).
We perform a simple experiment of trying to get a simple feed-forward neural network to learn the mathematical function of addition; a function where the outcome is straightforward (f(x,y): x+y). Here, the human accuracy of ‘predicting’ the output is 100% (we know how to add two numbers). Interestingly enough, in the case of the neural network, the in-train accuracy is almost always < 100%, as shown in the figure below.
We can see that the neural network learns ‘how to add’ well, but compared to a human it is sub-par. From this, we hypothesize that for systems that aren’t too complex for human cognizance, neural networks don’t offer any improvement. Their value is in approximating complex systems beyond our interpretation. This explains why our RL agent wasn’t able to learn to take the most obvious action in all states. But if the neural network truly learns, this subpar performance shouldn’t be the outcome. So it’s worth pondering here on the much larger and philosophical question of whether neural networks truly learn as opposed to fitting a curve to data points.
The featured image for this article was generated with Kadinsky 2
Prompt: Illustrate a chain.