Sometimes we run into extravagant ideas that end up bearing no fruits. I spent the good part of the last day trying to nail down the dropout regularization technique in neural networks to derive from the Braess Paradox in Game theory congestion games. Sadly enough for my Academic Machine learning career, this doesn’t seem to hold.
I will shortly explain what is Dropout, what is the Braess Paradox, Why would they be related and what did I find experimenting with it.
Dropout is a standard regularization technique when training Neural network to prevent overfitting. Instead of training the whole fully connected layers at once, you sample network paths with probability p, train the sub-neural-nets created in the process, then later combine them together to your final neural net on which you test and perform actual predictions. It is shown in the research paper and in many later usages of the technique that it does in fact improve the test accuracy and prevents overfitting.
The Braess paradox is a paradox in the theory of Congestion games. Like other phenomenon as the tragedy of the commons and the prisoner’s dilemma, it shows how things can go wrong in multi-player games where the players aim to maximize their gain. The weird thing about the Braess paradox though, is that by adding more edges to a transportation network, presumably ‘improving the infrastructure’, congestion can grow by up to 4/3.
In the example in the illustration, which I gratefully taken from the article here, say the edges marked with x make up for ½ an hour latency for a single player but 1 hour if both players use them. You can examine that in the case on the right, equilibrium will be for the two players to use each a different colored path, one using the red and one using the blue. In the case on the left though, before removing the diagonal connective edge actually make both players use the same route, ending up with 2h latency. This is the equilibrium and no player will switch to a different route.
The Braess paradox have been since investigated thoroughly as it could have real-life implications. What’s the use of adding roads to the public highway system if it could actually worsen congestion ? Instead of being a linear straight-forward question of resource allocation, building new infrastructure becomes worrisome. The Braess paradox can also appear in any network — communication networks, digital currencies exchange rates and there are even a bunch of articles about it appearing in physics.
It was especially the Physical examples that drove my attention to the question -
Could the Braess Paradox appear in Neural nets ?
As in physics it’s hard to claim there are selfish agents whose action cause a non-optimal equilibrium to appear. Combining this and the fact that in Neural nets:
- The standard way of operating is arranging the neurons in layers, rather than a full mesh network
- We have the Dropout technique mentioned before, where removing network edges helps optimization
- We have CNNs and other structures where locality and reduced connectivity helps optimization
I had the Eureka moment of: Aha ! The two must be related.
As I said before, not all Eureka moments were born equal, and this one proved to be rather superficial than profound. Who knows though, maybe someone reading this post will find the missing link.
To check this hypothesis, I set out with the following code:
What I basically did was build a small neural net in the shape of the Braess network, and by playing with different regularization, learning rates and especially initial weights, find examples where the network optimizes better or faster when the diagonal edge is removed. Although I did find many examples for it, the expected behavior in a case where the source of trouble is the Braess paradox is that the diagonal link will draw all the weights to itself. So, in a way, while optimizing the neural network weights and bias to fit the data, the analogue of latency will be loss, and with the correct — or rather incorrect — set of weights the optimization process will neglect all other paths but the one involving the diagonal path.
Sadly, examining 100K different initial weights I found no example of the weights behaving this way by the end of the Neural net optimization. When the reduced-edge net performed better than the dense-net, the weights on the diagonal path were very actually very low, in one or two orders of magnitude than other weights in the network.
Thinking about it, the neural net is not really susceptible for the Braess paradox, for two main reasons:
- The functions on the edges don’t behave so linearly and nice. But this does not imply immediately the Braess paradox can’t appear there, and also there are examples of Braess paradox in more complex functions.
- The optimization of the Neural net as done in Backpropagation is done all at once. All the weights are updated according to one loss function that propagates back rather than to each path by itself. Though I’m not a physics expert, I believe that this differs even from the physics settings of Braess paradox, where the different springs and strings or electrons behave in a way like independent agents optimizing their own path. If the neural net was optimized by a greedy algorithm of some sort, it might have happened that the Braess paradox would appear explicitly.
Anyway, it was fun experimenting and thinking harder about how Neural nets should be constructed.
For more about the Braess paradox please read at: