Teaching Machines to Remember Means Choosing What They Forget

TABLE OF LINKS

6 Results

Since we are only interested in the phenomenon of catastrophic forgetting itself, we only report the learning systems’ performance in terms of the metrics described in Section 4 here and skip reporting their performance on the actual problems. The curious reader can refer to Appendix C of our supplementary material for that information. The left side of Figure 1 shows the retention and relearning of the four optimizers in the MNIST testbed, and the right side shows the retention of the four optimizers in the Fashion MNIST testbed. Recall that, here, retention is defined as the learning system’s accuracy on the first task after training it on the first task to mastery, then training it on the second task to mastery, and relearning is defined as the length of the first phase as a function of the third. When comparing the retention displayed by the optimizers in the MNIST testbed, RMSProp vastly outperformed the other three here. However, when comparing relearning instead, SGD is the clear leader. In the Fashion MNIST testbed, retention was less than 0.001 with each of the four optimizers. Nevertheless, the same trend with regards to

relearning in the MNIST testbed results can be observed in the Fashion MNIST testbed results. Also notable here, Adam displayed particularly poor performance in all cases. The left side of Figure 2 shows the activation overlap and pairwise interference of the four optimizers in the MNIST testbed, and the right side shows these in the Fashion MNIST testbed. Note that, in Figure 2, lines stop when at least half of the runs for a given optimizer have moved to the next phase. Also, note that activation overlap should be expected to increase here as training progress since the network’s representation for samples starts as random noise. Interestingly, the results under the MNIST and Fashion MNIST testbeds here are similar. Consistent with the retention and relearning metric, Adam exhibited the highest amount of activation overlap here. However, in contrast to the retention and relearning metric, RMSProp exhibited the second highest. Only minimal amounts are displayed with both SGD and SGD with Momentum.

When compared with activation overlap, the pairwise interference reported in Figure 2 seems to agree much more here with the retention and relearning metrics: SGD displays less pairwise interference than RMSProp, which, in turn, displays much less than either Adam or SGD with Momentum. Figure 3 shows the activation overlap and pairwise interference of each of the four optimizers in the Mountain Car and Acrobot testbeds at the end of each episode. In Mountain Car, Adam exhibited both the highest mean and final activation overlap, whereas SGD with Momentum exhibited the least. However, in Acrobot, SGD with Momentum exhibited both the highest mean and final activation overlap. When looking at the post-episode pairwise interference values shown in Figure 3, again, some disagreement is observed. While SGD with Momentum seemed to do well in both Mountain Car and Acrobot, vanilla SGD did well only in Acrobot and did the worst in Mountain Car. Notably, pairwise interference in Mountain Car is the only instance under any of the metrics or testbeds of Adam being among the better two optimizers.

Figure 4 shows the retention and relearning in the MNIST testbed for SGD with Momentum as a function of momentum, and RMSProp as a function of the coefficient of the moving average. As would be expected with the results on SGD, lower values of momentum produce less forgetting. Conversely, lower coefficients produce worse retention in RMSProp, but seem to have less effect on relearning. Note that, under all the variations shown here, in no instance does SGD with Momentum or RMSProp outperform vanilla SGD with respect to relearning. Similar to Figure 4, Figure 5 shows the retention and relearning of the four optimizers as a function of α. While—unsurprisingly—α has a large effect on both metrics, the effect is smooth with similar values of α producing similar values for retention and relearning.

Authors:

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.