This Is How Your Model Forgets What It Just Learned

TABLE OF LINKS

5 Experimental Setup

In this section, we design the experiments which will help answer our earlier questions: (1) how we should quantify catastrophic forgetting, and (2) to what degree do all of the choices we make when designing learning systems affect the amount of catastrophic forgetting. To address these questions, we apply the four metrics from the previous section to four different testbeds. For brevity, we defer less relevant details of our experimental setup to Appendix A of our supplementary material. The first two testbeds we use build on the MNIST (LeCun et al., 1998) and Fashion MNIST (Xiao et al., 2017) dataset, respectively, to create two four-class image classification supervised learning tasks. With both tasks, we separate the overall task into two distinct subtasks where the first subtask only includes examples from the first and second class, and the second subtask only includes examples from the third and fourth class. We have the learning system learn these subtasks in four phases, wherein only the first and third phases contain the first subtask, and only the second and fourth phases contain the second subtask. Each phase transitions to the next only when the learning system has achieved mastery in the phase. Here, that means the learning system must maintain a running accuracy in that phase of 90% for five consecutive steps. All learning here—and in the other two testbeds—is fully online and incremental.

The third and fourth testbeds draw examples from an agent operating under a fixed policy in a standard undiscounted episodic reinforcement learning domain. For the third testbed, we use the Mountain Car domain (Moore, 1990; Sutton and Barto, 1998) and, for the fourth testbed, we use the Acrobot domain (DeJong and Spong, 1994; Spong and Vidyasagar, 1989; Sutton, 1995). The learning system’s goal in both testbeds is to learn, for each timestep, what the value of the current state is. Thus these are both reinforcement learning value estimation testbeds (Sutton and Barto, 2018, p. 74). There are several significant differences between the four testbeds that are worth noting. Firstly, the MNIST and Fashion MNIST testbeds’ data-streams consist of multiple phases, each containing only i.i.d. examples. However, the Mountain Car and Acrobot testbeds have only one phase each, and that phase contains strongly temporally-correlated examples. One consequence of this difference is that only intra-task catastrophic forgetting metrics can be used in the Mountain Car and Acrobot testbed, and so here, the retention and relearning metrics of Section 4 can only be measured in the MNIST and Fashion MNIST testbeds. While it is theoretically possible to derive semantically similar metrics for the Mountain Car and Acrobot testbeds, this is non-trivial as—in addition to them consisting of only a single phase—it is somewhat unclear what mastery is in these contexts. Another difference between the MNIST testbed and the other two testbeds is that in the MNIST testbed—since the network is solving a four-class image classification problem in four phases with not all digits appearing in each phase—some weights connected to the output units of the network will be protected from being modified in some phases. This property of these kinds of experimental testbeds has been noted previously in Farquhar and Gal (2018, Section 6.3.2.). In the Mountain Car and Acrobot testbeds, no such weight protection exists.

For each of the four testbeds, we use feedforward shallow ANNs trained through backpropagation (Rumelhart et al., 1986). We experiment with four different optimizers for training the aforementioned ANNs for each of the four testbeds. These optimizers are (1) SGD, (2) SGD with Momentum (Qian, 1999; Rumelhart et al., 1986), (3) RMSProp (Hinton et al., n.d.), and (4) Adam (Kingma and Ba, 2014). For Adam, in accordance with recommendations of Adam’s creators (Kingma and Ba, 2014), we set β1, β2, and to 0.9, 0.999, and 10−8 , respectively. As Adam can be roughly viewed as a union of SGD with Momentum and RMSProp, we may expect that if one of the two is particularly susceptible to catastrophic forgetting, so too would Adam. Thus, there is some understanding we can gain by aligning their hyperparameters with some of the hyperparameters used by Adam. So to be consistent with Adam, in RMSProp, we set the coefficient for the moving average to 0.999 and to 10−8 , and, in SGD with Momentum, we set the momentum parameter to 0.9. In the MNIST and Fashion MNIST testbeds, we select one α for each of the above optimizers by trying each of 2 −3 , 2 −4 , ..., 2 −18 and selecting whatever minimized the average number of steps needed to complete the four phases. As the Mountain Car testbed and Acrobot testbed are likely to be harder for the ANN to learn, we select one α for each of these testbeds by trying each of 2 −3 , 2 −3.5 , ..., 2 −18 and selecting whatever minimized the average area under the curve of the post-episode mean squared value error. We provide a sensitivity analysis for our selection of the coefficient for the moving average in RMSProp, for the momentum parameter in SGD with Momentum, as well as for our selection of α with each of the four optimizers. We limit this sensitivity analysis to the retention and relearning metrics in the MNIST testbed. We extend this sensitivity analysis to the other metrics and testbeds in Appendix E of our supplementary material

Authors:

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.