Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jan 12;18(1):e1009634.
doi: 10.1371/journal.pcbi.1009634. eCollection 2022 Jan.

Optimism and pessimism in optimised replay

Affiliations

Optimism and pessimism in optimised replay

Georgy Antonov et al. PLoS Comput Biol. .

Abstract

The replay of task-relevant trajectories is known to contribute to memory consolidation and improved task performance. A wide variety of experimental data show that the content of replayed sequences is highly specific and can be modulated by reward as well as other prominent task variables. However, the rules governing the choice of sequences to be replayed still remain poorly understood. One recent theoretical suggestion is that the prioritization of replay experiences in decision-making problems is based on their effect on the choice of action. We show that this implies that subjects should replay sub-optimal actions that they dysfunctionally choose rather than optimal ones, when, by being forgetful, they experience large amounts of uncertainty in their internal models of the world. We use this to account for recent experimental data demonstrating exactly pessimal replay, fitting model parameters to the individual subjects' choices.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Task structure and replay modelling.
(A) Structure of the state-space. Numbers in black and grey circles denote the number of reward points associated with that state respectively pre- and post- the reward association change between blocks 2 and 3. Grey arrows show the spatial re-arrangement that took place between blocks 4 and 5. Note that the stimuli images shown here differ from those which the subjects actually saw. (B) Change in the probability of choosing a different move when in the same state as a function of sequenceness of the just-experienced transitions measured from the MEG data in subjects with non-negligible sequenceness (n = 25). High sequenceness was defined as above median and low sequenceness as below median. Analysis of correlation between the decoded sequenceness and probability of policy change indicated a significant dependency (Spearman correlation, M = 0.04, SEM = 0.02, p = 0.04, Bootstrap test). Vertical lines show standard error of the mean (SEM). (C) Performance of the human subjects and the agent with parameters fit to the individual subjects. Unfilled hexagons show epochs which contained trials without feedback. Shaded area shows SEM. (D) Pessimism bias in the replay choices of human subjects for which our model predicted sufficient replay (n = 20) as reflected in the average number of replays of recent sub-optimal and optimal transitions at the end of each trial (sub-optimal vs optimal, Wilcoxon rank-sum test, W = 2.49, p = 0.013). ** p < 0.01.
Fig 2
Fig 2. Algorithm description and the effects of replay and forgetting on model performance.
(A) Schematic illustration of the algorithm in the behavioural task. Upon completing each trial, the algorithm uses its knowledge of the transition structure of the environment to replay the possible outcomes. Note that in 1-move trials the algorithm replays only single moves, while in 2-move trials it considers both single and coupled moves (thus optimizing this choice). (B) Effect of MF forgetting and replay on MF Q-values. After acting and learning on-line towards true reward R (white blocks; controlled by learning rate, η), the algorithm learns off-line by means of replay (green blocks). Immediately after each replay bout, the algorithm forgets its MF Q-values towards the average reward experienced from the beginning of the task (red blocks; controlled by MF forgetting rate, ϕMF). Note that after trials 5 and 6, the agent chooses to replay the objectively optimal action, whereas after trials 8 and 9 it replays the objectively sub-optimal action. (C) Left: without MB forgetting, the algorithm’s estimate of reward obtained for a given move corresponds to the true reward function. Right: with MB forgetting (controlled by MB forgetting rate, ϕMB), the algorithm’s estimate of reward becomes an expectation of the reward function under its state-transition model. The state-transition model’s probabilities for the transitions are shown as translucent lines. (D) Steady-state performance (proportion of available reward obtained) of the algorithm in the behavioural task as a function of MF forgetting, ϕMF, and MB forgetting, ϕMB. Note how the agent still achieves high performance with substantial MF forgetting (high ϕMF) when its state-transition model accurately represents the transition probabilities (low ϕMB).
Fig 3
Fig 3. Incremental model comparison.
(A) Average performance (cumulative proportion of available reward obtained) of agents with varying degree of model complexity. (B) Evolution of MF Q-values during learning. Dashed grey lines indicate true reward R for each action. Blue and orange lines indicate MF Q-values for the objectively sub-optimal and optimal actions respectively. (C) Number of replays in each trial. (D) Maximal gain for objectively sub-optimal and optimal actions as estimated by the agents in each trial. Shaded areas show 95% confidence intervals.
Fig 4
Fig 4. How MF forgetting influences gain estimation.
(A) Estimated gain as a function of the difference between the agent’s current MF Q-value and the model-estimated MB Q-value, Q^MB-QMF, for varying degrees of MF forgetting, ϕMF. The dashed grey lines show the x- and y-intercepts. Note that the estimated gain is negative whenever the model-generated Q^MB estimates are worse than the current MF Q-values. (B) Current MF Q-values for the optimal and sub-optimal actions with varying MF forgetting rate, coloured in the same way as above. The horizontal solid black bar is the average reward experienced so far, towards which MF values tend. The true Q-value for each action is shown in dashed black.
Fig 5
Fig 5. Action entropy limits estimated gain.
(A) Left: under high action entropy, the distribution over the potential states to which the agent can transition given the current state and a chosen action is close to uniform. Right: under low action entropy, the agent is more certain about the state to which a chosen action will transition it. (B) Left: for an objectively sub-optimal action, the gain is positive throughout most action entropy values. Right: for an objectively optimal action, the gain becomes positive only when the state-transition model is sufficiently accurate. With heavier MF forgetting (higher ϕMF), however, the intercept shifts such that the agent is able to benefit from a less accurate model (grey dashed lines show the x- and y-intercepts). The inset magnifies the estimated gain for the optimal action. Moreover, note how the magnitude of the estimated gain for an objectively optimal action is lower than that of a sub-optimal one, which is additionally influenced by the asymmetry of MF forgetting and on-line learning.
Fig 6
Fig 6. Epistemology of replay.
(A) An example move predicted by our agent with subject-specific parameters. (B) State-transition model of the agent after executing the move in (A) and the associated action entropy values. Objectively optimal actions are shown as arrows with orange outlines; sub-optimal—with blue outlines. (C) State of MF and MB knowledge of the agent. The arrows above the leftmost bar plot indicate the directions of the corresponding actions in each plot. The horizontal black lines represent the true reward obtainable for each action. The agent’s knowledge at the state where the trial began is highlighted in a purple dashed box and is additionally magnified above. The blue bar for the MF Q-value that corresponds to the predicted move in (A) shows what the agent knew before executing the move, and the neighbouring green bar—what the agent has learnt on-line after executing the move (note that the agent always learnt on-line towards the true reward). (D) Replay choices of the agent. (E) Changes in the objective value function (relative to the true obtainable reward) of each state as a result of the replay in (D), not drawn to scale. (F) Same as in (D) but across the entire experiment and averaged over all states. (G-H) Average replay statistics over the entire experiment. (G) Just-experienced transitions; (H) Other transitions. First column: proportion of sub-optimal and optimal trials in which objectively sub-optimal or optimal action(s) were replayed. Second column: proportion of action entropy values at which the replays were executed. Upper and lower y-axes show the action entropy distribution for 1-move and 2-move trials respectively. ** p < 0.01, *** p < 0.001, ns: not significant.
Fig 7
Fig 7. Overall on-task replay statistics across MI subjects.
(A) Left: average number of replays of just-experienced optimal and sub-optimal actions; right: proportion of action entropy values at which just-experienced optimal and sub-optimal actions were replayed. (B) Same as above but for other transitions. (C) Average change in objective value function due to replay. (D) Average change in the probability of selecting an (objectively) optimal action due to replay. * p < 0.05, ** p < 0.01, *** p < 0.001.

Similar articles

Cited by

References

    1. O’Keefe J, Dostrovsky J. The hippocampus as a spatial map: Preliminary evidence from unit activity in the freely-moving rat. Brain research. 1971. - PubMed
    1. O’Keefe J, Nadel L. The hippocampus as a cognitive map. Oxford: Clarendon Press; 1978.
    1. Wilson MA, McNaughton BL. Reactivation of hippocampal ensemble memories during sleep. Science. 1994;265(5172):676–679. doi: 10.1126/science.8036517 - DOI - PubMed
    1. Lee AK, Wilson MA. Memory of sequential experience in the hippocampus during slow wave sleep. Neuron. 2002;36(6):1183–1194. doi: 10.1016/S0896-6273(02)01096-6 - DOI - PubMed
    1. Foster DJ, Wilson MA. Reverse replay of behavioural sequences in hippocampal place cells during the awake state. Nature. 2006;440(7084):680–683. doi: 10.1038/nature04587 - DOI - PubMed

Publication types