Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2018 Oct 30;115(44):E10313-E10322.
doi: 10.1073/pnas.1800755115. Epub 2018 Oct 15.

Comparing continual task learning in minds and machines

Affiliations
Comparative Study

Comparing continual task learning in minds and machines

Timo Flesch et al. Proc Natl Acad Sci U S A. .

Abstract

Humans can learn to perform multiple tasks in succession over the lifespan ("continual" learning), whereas current machine learning systems fail. Here, we investigated the cognitive mechanisms that permit successful continual learning in humans and harnessed our behavioral findings for neural network design. Humans categorized naturalistic images of trees according to one of two orthogonal task rules that were learned by trial and error. Training regimes that focused on individual rules for prolonged periods (blocked training) improved human performance on a later test involving randomly interleaved rules, compared with control regimes that trained in an interleaved fashion. Analysis of human error patterns suggested that blocked training encouraged humans to form "factorized" representation that optimally segregated the tasks, especially for those individuals with a strong prior bias to represent the stimulus space in a well-structured way. By contrast, standard supervised deep neural networks trained on the same tasks suffered catastrophic forgetting under blocked training, due to representational interference in the deeper layers. However, augmenting deep networks with an unsupervised generative model that allowed it to first learn a good embedding of the stimulus space (similar to that observed in humans) reduced catastrophic forgetting under blocked training. Building artificial agents that first learn a model of the world may be one promising route to solving continual task performance in artificial intelligence research.

Keywords: catastrophic forgetting; categorization; continual learning; representational similarity analysis; task factorization.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.
Task design, experiment 1. (A) Naturalistic tree stimuli were parametrically varied along two dimensions (leafiness and branchiness). (B) All participants engaged in a virtual gardening task with two different gardens (north and south). Via trial and error, they had to learn which type of tree grows best in each garden. (C) Each training trial consisted of a cue, stimulus, response, and feedback period. At the beginning of each trial, an image of one of the two gardens served as contextual cue. Next, the context was blurred (to direct the attention toward the task-relevant stimulus while still providing information about the contextual cue), and the stimulus (tree) appeared together with a reminder of the key mapping (“accept” vs. “reject,” corresponding to “plant” vs. “don’t plant”) in the center of the screen. Once the participant had communicated her decision via button press (left or right arrow key), the tree would either be planted inside the garden (“accept”) or disappear (“reject”). In the feedback period, the received and counterfactual rewards were displayed above the tree, with the received one being highlighted, and the tree would either grow or shrink, proportionally to the received reward. Test trials had the same structure, but no feedback was provided. Key mappings were counterbalanced across participants. (D) Unbeknownst to the participants a priori, there were clear mappings of feature dimensions onto rewards. In experiment 1a (cardinal group), each of the two feature dimensions (branchiness or leafiness) was mapped onto one task rule (north or south). The sign of the rewards was counterbalanced across participants (see Methods). (E) In experiment 1b (diagonal group), feature combinations were mapped onto rewards, yielding nonverbalizable rules. Once again, we counterbalanced the sign of the rewards across participants. (F) Experiments 1a and 1b were between-group designs. All four groups were trained on 400 trials (200 per task) and evaluated on 200 trials (100 per task). The groups differed in the temporal autocorrelation of the tasks during training, ranging from “blocked 200” (200 trials of one task, thus only one switch) to “interleaved” (randomly shuffled and thus unpredictable task switches). Importantly, all four groups were evaluated on interleaved test trials. The order of tasks for the blocked groups was counterbalanced across participants.
Fig. 2.
Fig. 2.
Results of experiment 1a. All error bars depict SEM. (A) Training curves of mean accuracy, averaged over 50 trials, and averaged test-phase accuracy. Performance of all groups plateaued by the end of training. At test, the B200 group performed significantly better than the B2 and Interleaved groups. (B) Mean test performance on task switch and task stay trials. Even on switch trials, the B200 group outperformed the Interleaved and B2 groups, despite having experienced only one task switch during training. (C) Sigmoid fits to the test-phase choice proportions of the task-relevant (solid lines) and task-irrelevant dimensions (dashed lines). Higher sensitivity (i.e., steeper slope) to the task-relevant dimension was observed for the B200, compared with the Interleaved group. There was stronger intrusion from the task-irrelevant dimension in Interleaved compared with B200. (D) Conceptual choice models. The factorized model (Left) predicted that participants learned two separate boundaries for each task, corresponding to the rewards that were assigned to each dimension in trees space. The linear model (Right) simulated that participants had learned the same, linear boundary for both tasks, separating the trees space roughly into two halves that yielded equal rewards and penalties in both tasks. (E) Results of RDM model correlations on test-phase data. While the factorized model provided a better fit to the data for all groups, its benefit over the linear model was greater for the B200 than for the B2 and Interleaved groups. (F) Bayesian model selection for the unconstrained and constrained psychophysical models. The estimated model frequencies support the RSA findings, as we observed an interaction of group with model type. (G) Mean angular distances between true and subjective boundary, estimated by the 2-boundary model. A significantly stronger bias for Interleaved compared with B200 suggests that blocked training optimizes boundary estimation. (H) Mean lapse rates, obtained from the same 2-boundary model. There were no significant differences between groups. *P < 0.05; **P < 0.01; ***P < 0.001.
Fig. 3.
Fig. 3.
Results of experiment 1b. All error bars depict SEM. (A) Training curves and averaged test-phase performance. At the end of the training, performance plateaued for all groups. At test, in contrast to experiment 1a, there was no significant difference in performance between groups. (B) No performance difference between task switch and stay trials. (C) Sigmoid fits to the test-phase choice proportions of the task-relevant (solid lines) and task-irrelevant dimensions (dashed lines). No sensitivity differences were observed along the relevant dimension. However, once again, there was stronger intrusion from the task-irrelevant dimension for Interleaved compared with B200. (D) Conceptual model RDMs. The same reasoning applies as described in Fig. 2D. (E) RDM model correlations at test. Despite equal test performance, the relative advantage of the factorized over the linear model is stronger for B200 than for B2 or Interleaved, suggesting that blocked training did result in better task separation, despite equal performance. (F) Bayesian model comparison between unconstrained and constrained models supports the RSA findings. The unconstrained model fits best in the B200 group, but the constrained model fits best to the Interleaved group. (G) Mean bias of the decision boundary obtained by the unconstrained model. The bias was smallest for B200, indicating that this group estimated the boundaries with high precision. (H) Mean lapse rates. The B200 group made a higher number of unspecific random errors during the test phase, compared with the Interleaved group, which explains equal test performance despite evidence for successful task factorization. We suspect that limited experience with task switches is more detrimental when rules are nonverbalizable. Asterisks denote significance: *P < 0.05; **P < 0.01; ***P < 0.001.
Fig. 4.
Fig. 4.
Task design, experiment 2. Before and after the main experiment (identical to experiment 1), participants engaged in a dissimilarity rating arena task, in which they had to rearrange trees via mouse drag and drop inside a circular aperture to communicate subjective dissimilarity (see Methods). We obtained one RDM per subject and phase, depicting how dissimilarly the trees were perceived. Correlation of the RDMs from the “Pre” with a model RDM that assumed perfect grid-like arrangement (branchiness × leafiness) yielded a grid prior (Kendall tau) for each participant.
Fig. 5.
Fig. 5.
Results of experiment 2. All error bars depict SEM. (A) Experiment 2a (cardinal boundary): median split of test performance. The benefit of blocked training was significantly stronger for participants with a higher prior on the structure of the trees space. (B) Experiment 2a: median split of correlations between choice probabilities and factorized model (Fig. 1D). Under blocked training, participants with a strong prior showed significantly stronger evidence of task factorization. (C) Experiment 2b (diagonal boundary). There was no difference between low and high grid priors on mean test accuracy. (D) Experiment 2b. The correlation coefficients of the factorized model did not differ between groups. An ANCOVA (see Results) revealed a main effect of the prior on task factorization, but no interaction with group. *P < 0.05.
Fig. 6.
Fig. 6.
Results of experiment 3. All error bars depict SEM across independent runs. (A) Experiment 3a (cardinal boundary): mean performance of the CNN on independent test data, calculated after the first and second half of training, separately for the first and second task and blocked vs. interleaved training. Interleaved training resulted quickly in ceiling performance. In contrast, the network trained with a blocked regime performed at ceiling for the first task, but dropped back to chance level after it had been trained on the second task, on which it did also achieve ceiling performance. (B) Experiment 3b (diagonal boundary): mean test performance. Similar patterns as for the cardinal boundary were found: Blocked training resulted in catastrophic interference, whereas interleaved training allowed the network to learn both tasks equally well. Interestingly, the CNNs performed slightly worse on the diagonal boundary, as did our human participants. (C) Experiment 2a, blocked training. Layer-wise RDM correlations between RDMs were obtained from activity patterns and model RDMs. The correlation with the pixel dissimilarity model decreases with depth, whereas the correlation with the catastrophic interference model increases. Neither the factorized nor the linear model explain the data well, indicating that blocked training did not result in task factorization or convergence toward a single linear boundary. (D) Experiment 2b, blocked training. Again, correlations with the pixel model decrease and correlations with the interference model increase with network depth.
Fig. 7.
Fig. 7.
Results of experiment 4. (A) Experiment 4a. (Top) Example of tree images used for training the autoencoder. (Bottom) The 2D latent space traversal of trained autoencoder (see Methods). For each x,y coordinate pair, we plot a tree image sampled from the generative model, revealing that the autoencoder learned a disentangled low-dimensional representation of branchiness and leafiness. (B) Experiment 4b (cardinal), blocked training: comparison of performance on the first task after training on the second task, between the model from experiment 3 (“vanilla” CNN, without priors) and the model from experiment 4 (“pretrained” CNN, with priors from VAE encoder). Unsupervised pretraining partially mitigated catastrophic interference. (C) Experiment 4b (cardinal), blocked training: comparison of layer-wise RDM correlations with factorized model for CNN, between networks without and with unsupervised pretraining. Pretraining yielded stronger correlations with the factorized model in each layer. (D) Experiment 4b (cardinal), blocked training: comparison of layer-wise RDM correlations with interference model. Likewise, pretraining significantly reduced correlations with the catastrophic interference model in each layer. (E) Experiment 4b (diagonal), blocked training: mean accuracy on the first task after training on the second task, for vanilla and pretrained CNN. Again, pretraining mitigated catastrophic interference. (F) Experiment 4b (diagonal), blocked training. RDM correlations with factorized model only increased in the output layer. (G) Experiment 4b (diagonal), blocked training. RDM correlations with the interference model increased significantly in each layer. All error bars indicate SEM across independent runs. *P < 0.05; **P < 0.01.

Similar articles

Cited by

References

    1. Legg S, Hutter M. 2007. A collection of definitions of intelligence. arXiv:10.1207/s15327051hci0301_2. Preprint, posted June 25, 2007.
    1. Parisi GI, Kemker R, Part JL, Kanan C, Wermter S. 2018. Continual lifelong learning with neural networks: A review. arXiv:1802.07569v2. Preprint, posted February 21, 2018. - PubMed
    1. Kirkpatrick J, et al. Overcoming catastrophic forgetting in neural networks. Proc Natl Acad Sci USA. 2017;114:3521–3526. - PMC - PubMed
    1. French RM. Catastrophic forgetting in connectionist networks. Trends Cogn Sci. 1999;3:128–135. - PubMed
    1. Mnih V, et al. Human-level control through deep reinforcement learning. Nature. 2015;518:529–533. - PubMed

Publication types

LinkOut - more resources