Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Mar 28;114(13):3521-3526.
doi: 10.1073/pnas.1611835114. Epub 2017 Mar 14.

Overcoming Catastrophic Forgetting in Neural Networks

Affiliations
Free PMC article

Overcoming Catastrophic Forgetting in Neural Networks

James Kirkpatrick et al. Proc Natl Acad Sci U S A. .
Free PMC article

Abstract

The ability to learn tasks in a sequential fashion is crucial to the development of artificial intelligence. Until now neural networks have not been capable of this and it has been widely thought that catastrophic forgetting is an inevitable feature of connectionist models. We show that it is possible to overcome this limitation and train networks that can maintain expertise on tasks that they have not experienced for a long time. Our approach remembers old tasks by selectively slowing down learning on the weights important for those tasks. We demonstrate our approach is scalable and effective by solving a set of classification tasks based on a hand-written digit dataset and by learning several Atari 2600 games sequentially.

Keywords: artificial intelligence; continual learning; deep learning; stability plasticity; synaptic consolidation.

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.
EWC ensures task A is remembered while training on task B. Training trajectories are illustrated in a schematic parameter space, with parameter regions leading to good performance on task A (gray) and on task B (cream color). After learning the first task, the parameters are at θA. If we take gradient steps according to task B alone (blue arrow), we will minimize the loss of task B but destroy what we have learned for task A. On the other hand, if we constrain each weight with the same coefficient (green arrow), the restriction imposed is too severe and we can remember task A only at the expense of not learning task B. EWC, conversely, finds a solution for task B without incurring a significant loss on task A (red arrow) by explicitly computing how important weights are for task A.
Fig. 2.
Fig. 2.
Log-log plot of the SNR for recalling the first pattern after observing t random patterns. If no penalty is applied (blue), the SNR decays as (n/t)0.5 only when t is smaller than the number of synapses n= 1,000 and then decays exponentially. When EWC is applied (red), the decay takes a power-law form for all times. The dashed and solid lines show the analytic solutions derived in Eqs. S28 and S30. The fraction of memories retained (Bottom) is defined as the fraction of patterns whose SNR exceeds 1. EWC results in a higher fraction of memories being retained when the network is at capacity (tn). After network capacity is exceeded (Right), EWC performs worse than gradient descent (Discussion). More detailed plots can be found in the Supporting Information, Figs. S1 and S2.
Fig. S1.
Fig. S1.
Signal (Bottom row) and noise (Top row) terms for the EWC (Left column) and gradient descent (Right column) cases as a function of time t. The blue curves are the results of numeric simulations, whereas the red curves show the analytic results. Each panel contains stimuli observed at different times i. (Top Left) (noise in the EWC case) The solid red curve is the full form of Eq. S27, the dashed line is Eq. S25 (which is valid for small times), and the dashed-dotted line is the long time approximation of the noise in Eq. S29. The different solid blue curves correspond to the noise from patterns observed at times 1, 50, 100, and 500. (Top Right) (noise in the gradient descent case) The red curve shows Eq. S25. (Bottom Left) (signal in the EWC case) The red curve is Eq. S13. (Bottom Right) (signal for the gradient descent case) The red curves are Eq. S11.
Fig. S2.
Fig. S2.
Comparison of EWC (red lines) with gradient descent with different learning rates (green, α= 1.0; blue, α= 0.5). Bottom shows these two learning rates correspond to the learning rate used in EWC at the first pattern and at network capacity (t=n= 1,000) Top shows log-log plot of the SNR for the first pattern observed. The black lines show the analytic expressions from Eqs. S13, S27, and S32. Note that EWC has a power-law decay for the SNR, whereas gradient descent eventually decays exponentially, albeit at a later time for the lower learning rate. Middle shows the fraction of memories retained (i.e., with SNR> 1) in the three cases. Note that the lower rate has a moderately higher fraction of memories retained than the larger one, but that EWC still has a higher memory retention.
Fig. 3.
Fig. 3.
Results on the permuted MNIST task. (A) Training curves for three random permutations A, B, and C, using EWC (red), L2 regularization (green), and plain SGD (blue). Note that only EWC is capable of maintaining a high performance on old tasks, while retaining the ability to learn new tasks. (B) Average performance across all tasks, using EWC (red) or SGD with dropout regularization (blue). The dashed line shows the performance on a single task only. (C) Similarity between the Fisher information matrices as a function of network depth for two different amounts of permutation. Either a small square of 8 × 8 pixels in the middle of the image is permuted (gray) or a large square of 26 × 26 pixels is permuted (black). Note how the more different the tasks are, the smaller the overlap in Fisher information matrices in early layers.
Fig. 4.
Fig. 4.
Results on Atari task. (A) Schedule of games. Black bars indicate the sequential training periods (segments) for each game. After each training segment, performance on all games is measured. The EWC constraint is activated only to protect an agent’s performance on each game once the agent has experienced 20 million frames in that game. (B) Total human-averaged scores for each method across all games. The score is averaged across random seeds and over the choice of which 10 games are played (Fig. S3). The human-normalized score for each game is clipped to 1. Red curve denotes the network that infers the task labels using the FMN algorithm; brown curve is the network provided with the task labels. The EWC and SGD curves start diverging when games start being played again that have been protected by EWC. (C) Sensitivity of a single-game DQN, trained on Breakout, to noise added to its weights. The performance on Breakout is shown as a function of the magnitude (standard deviation) of the weight perturbation. The weight perturbation is drawn from a zero mean Gaussian with covariance that is either uniform (black; i.e., targets all weights equally), the inverse Fisher ((F+λI)1; blue; i.e., mimicking weight changes allowed by EWC), or uniform within the nullspace of the Fisher (orange; i.e., targets weights that the Fisher estimates that the network output is entirely invariant to). To evaluate the score, we ran the agent for 10 full game episodes, drawing a new random weight perturbation for every time step.
Fig. S3.
Fig. S3.
Score in the individual games as a function of steps played in that game. The black baseline curves show learning on individual games alone.

Comment in

  • Avoiding Catastrophic Forgetting.
    Hasselmo ME. Hasselmo ME. Trends Cogn Sci. 2017 Jun;21(6):407-408. doi: 10.1016/j.tics.2017.04.001. Epub 2017 Apr 23. Trends Cogn Sci. 2017. PMID: 28442279
  • Reply to Huszár: The elastic weight consolidation penalty is empirically valid.
    Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu AA, Milan K, Quan J, Ramalho T, Grabska-Barwinska A, Hassabis D, Clopath C, Kumaran D, Hadsell R. Kirkpatrick J, et al. Proc Natl Acad Sci U S A. 2018 Mar 13;115(11):E2498. doi: 10.1073/pnas.1800157115. Epub 2018 Feb 20. Proc Natl Acad Sci U S A. 2018. PMID: 29463734 Free PMC article. No abstract available.
  • Note on the quadratic penalties in elastic weight consolidation.
    Huszár F. Huszár F. Proc Natl Acad Sci U S A. 2018 Mar 13;115(11):E2496-E2497. doi: 10.1073/pnas.1717042115. Epub 2018 Feb 20. Proc Natl Acad Sci U S A. 2018. PMID: 29463735 Free PMC article. No abstract available.

Similar articles

See all similar articles

Cited by 26 articles

See all "Cited by" articles

Publication types

LinkOut - more resources

Feedback