The ability to learn tasks in a sequential fashion is crucial to the development of artificial intelligence. Until now neural networks have not been capable of this and it has been widely thought that catastrophic forgetting is an inevitable feature of connectionist models. We show that it is possible to overcome this limitation and train networks that can maintain expertise on tasks that they have not experienced for a long time. Our approach remembers old tasks by selectively slowing down learning on the weights important for those tasks. We demonstrate our approach is scalable and effective by solving a set of classification tasks based on a hand-written digit dataset and by learning several Atari 2600 games sequentially.
artificial intelligence; continual learning; deep learning; stability plasticity; synaptic consolidation.
Conflict of interest statement
The authors declare no conflict of interest.
EWC ensures task
A is remembered while training on task B. Training trajectories are illustrated in a schematic parameter space, with parameter regions leading to good performance on task A (gray) and on task B (cream color). After learning the first task, the parameters are at . If we take gradient steps according to task θ A ∗ B alone (blue arrow), we will minimize the loss of task B but destroy what we have learned for task A. On the other hand, if we constrain each weight with the same coefficient (green arrow), the restriction imposed is too severe and we can remember task A only at the expense of not learning task B. EWC, conversely, finds a solution for task B without incurring a significant loss on task A (red arrow) by explicitly computing how important weights are for task A.
Log-log plot of the SNR for recalling the first pattern after observing
random patterns. If no penalty is applied (blue), the SNR decays as t only when ( n / t ) 0.5 is smaller than the number of synapses t and then decays exponentially. When EWC is applied (red), the decay takes a power-law form for all times. The dashed and solid lines show the analytic solutions derived in Eqs. n = 1 , 000 S28 and S30. The fraction of memories retained ( Bottom) is defined as the fraction of patterns whose SNR exceeds . EWC results in a higher fraction of memories being retained when the network is at capacity ( 1 ). After network capacity is exceeded ( t ≈ n Right), EWC performs worse than gradient descent ( Discussion). More detailed plots can be found in the Supporting Information, Figs. S1 and S2.
Bottom row) and noise ( Top row) terms for the EWC ( Left column) and gradient descent ( Right column) cases as a function of time . The blue curves are the results of numeric simulations, whereas the red curves show the analytic results. Each panel contains stimuli observed at different times t . ( i Top Left) (noise in the EWC case) The solid red curve is the full form of Eq. S27, the dashed line is Eq. S25 (which is valid for small times), and the dashed-dotted line is the long time approximation of the noise in Eq. S29. The different solid blue curves correspond to the noise from patterns observed at times 1, 50, 100, and 500. ( Top Right) (noise in the gradient descent case) The red curve shows Eq. S25. ( Bottom Left) (signal in the EWC case) The red curve is Eq. S13. ( Bottom Right) (signal for the gradient descent case) The red curves are Eq. S11.
Comparison of EWC (red lines) with gradient descent with different learning rates (green,
; blue, α = 1.0 ). α = 0.5 Bottom shows these two learning rates correspond to the learning rate used in EWC at the first pattern and at network capacity ( ) t = n = 1 , 000 Top shows log-log plot of the SNR for the first pattern observed. The black lines show the analytic expressions from Eqs. S13, S27, and S32. Note that EWC has a power-law decay for the SNR, whereas gradient descent eventually decays exponentially, albeit at a later time for the lower learning rate. Middle shows the fraction of memories retained (i.e., with ) in the three cases. Note that the lower rate has a moderately higher fraction of memories retained than the larger one, but that EWC still has a higher memory retention. S N R > 1
Results on the permuted MNIST task. (
A) Training curves for three random permutations A, B, and C, using EWC (red), regularization (green), and plain SGD (blue). Note that only EWC is capable of maintaining a high performance on old tasks, while retaining the ability to learn new tasks. ( L 2 B) Average performance across all tasks, using EWC (red) or SGD with dropout regularization (blue). The dashed line shows the performance on a single task only. ( C) Similarity between the Fisher information matrices as a function of network depth for two different amounts of permutation. Either a small square of 8 × 8 pixels in the middle of the image is permuted (gray) or a large square of 26 × 26 pixels is permuted (black). Note how the more different the tasks are, the smaller the overlap in Fisher information matrices in early layers.
Results on Atari task. (
A) Schedule of games. Black bars indicate the sequential training periods (segments) for each game. After each training segment, performance on all games is measured. The EWC constraint is activated only to protect an agent’s performance on each game once the agent has experienced 20 million frames in that game. ( B) Total human-averaged scores for each method across all games. The score is averaged across random seeds and over the choice of which 10 games are played (Fig. S3). The human-normalized score for each game is clipped to 1. Red curve denotes the network that infers the task labels using the FMN algorithm; brown curve is the network provided with the task labels. The EWC and SGD curves start diverging when games start being played again that have been protected by EWC. ( C) Sensitivity of a single-game DQN, trained on Breakout, to noise added to its weights. The performance on Breakout is shown as a function of the magnitude (standard deviation) of the weight perturbation. The weight perturbation is drawn from a zero mean Gaussian with covariance that is either uniform (black; i.e., targets all weights equally), the inverse Fisher ( ; blue; i.e., mimicking weight changes allowed by EWC), or uniform within the nullspace of the Fisher (orange; i.e., targets weights that the Fisher estimates that the network output is entirely invariant to). To evaluate the score, we ran the agent for 10 full game episodes, drawing a new random weight perturbation for every time step. ( F + λ I ) − 1
Score in the individual games as a function of steps played in that game. The black baseline curves show learning on individual games alone.
All figures (7)
Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization.
Proc Natl Acad Sci U S A. 2018 Oct 30;115(44):E10467-E10475. doi: 10.1073/pnas.1803839115. Epub 2018 Oct 12.
Proc Natl Acad Sci U S A. 2018.
30315147 Free PMC article.
Neural modularity helps organisms evolve to learn new skills without forgetting old skills.
PLoS Comput Biol. 2015 Apr 2;11(4):e1004128. doi: 10.1371/journal.pcbi.1004128. eCollection 2015 Apr.
PLoS Comput Biol. 2015.
25837826 Free PMC article.
Comparing continual task learning in minds and machines.
Proc Natl Acad Sci U S A. 2018 Oct 30;115(44):E10313-E10322. doi: 10.1073/pnas.1800755115. Epub 2018 Oct 15.
Proc Natl Acad Sci U S A. 2018.
30322916 Free PMC article.
Continual lifelong learning with neural networks: A review.
Neural Netw. 2019 May;113:54-71. doi: 10.1016/j.neunet.2019.01.012. Epub 2019 Feb 6.
Neural Netw. 2019.
Learning activation rules rather than connection weights.
Int J Neural Syst. 1996 May;7(2):129-47. doi: 10.1142/s0129065796000117.
Int J Neural Syst. 1996.
Bio-Inspired Techniques in a Fully Digital Approach for Lifelong Learning.
Front Neurosci. 2020 Apr 30;14:379. doi: 10.3389/fnins.2020.00379. eCollection 2020.
Front Neurosci. 2020.
32425749 Free PMC article.
KS(conf): A Light-Weight Test if a Multiclass Classifier Operates Outside of Its Specifications.
Int J Comput Vis. 2020;128(4):970-995. doi: 10.1007/s11263-019-01232-x. Epub 2019 Oct 10.
Int J Comput Vis. 2020.
32313381 Free PMC article.
A Curiosity-Based Learning Method for Spiking Neural Networks.
Front Comput Neurosci. 2020 Feb 7;14:7. doi: 10.3389/fncom.2020.00007. eCollection 2020.
Front Comput Neurosci. 2020.
32116621 Free PMC article.
CLRS: Continual Learning Benchmark for Remote Sensing Image Scene Classification.
Sensors (Basel). 2020 Feb 24;20(4):1226. doi: 10.3390/s20041226.
Sensors (Basel). 2020.
32102294 Free PMC article.
Research Support, Non-U.S. Gov't
Neural Networks, Computer*
LinkOut - more resources
Full Text Sources Other Literature Sources