Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Sep 6;15(9):e1007334.
doi: 10.1371/journal.pcbi.1007334. eCollection 2019 Sep.

Learning the structure of the world: The adaptive nature of state-space and action representations in multi-stage decision-making

Affiliations

Learning the structure of the world: The adaptive nature of state-space and action representations in multi-stage decision-making

Amir Dezfouli et al. PLoS Comput Biol. .

Abstract

State-space and action representations form the building blocks of decision-making processes in the brain; states map external cues to the current situation of the agent whereas actions provide the set of motor commands from which the agent can choose to achieve specific goals. Although these factors differ across environments, it is currently unknown whether or how accurately state and action representations are acquired by the agent because previous experiments have typically provided this information a priori through instruction or pre-training. Here we studied how state and action representations adapt to reflect the structure of the world when such a priori knowledge is not available. We used a sequential decision-making task in rats in which they were required to pass through multiple states before reaching the goal, and for which the number of states and how they map onto external cues were unknown a priori. We found that, early in training, animals selected actions as if the task was not sequential and outcomes were the immediate consequence of the most proximal action. During the course of training, however, rats recovered the true structure of the environment and made decisions based on the expanded state-space, reflecting the multiple stages of the task. Similarly, we found that the set of actions expanded with training, although the emergence of new action sequences was sensitive to the experimental parameters and specifics of the training procedure. We conclude that the profile of choices shows a gradual shift from simple representations to more complex structures compatible with the structure of the world.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Different phases of the experiment.
The experiment started with two magazine training sessions (phase 1), followed by several lever training sessions (phase 2), in which animals learned that pressing each lever (left and right levers corresponding to ‘L’ and ‘R’ in the figure) would delivered a reward (presented by ‘O’ in the figure). The next phase was discrimination training (phase 3), in which animals learned that when stimulus S1 was presented action ‘L’ should be taken to earn a reward, and when S2 was presented action ‘R’ should be taken to earn a reward. S1 and S2 were a constant and blinking house light, respectively. The next phase of the experiment was two-stage training, in which animals were trained on a two-stage decision-making task. This training phase comprised multiple training sessions and, in the middle or at the end of these training sessions, several ‘probe sessions’ were inserted.
Fig 2
Fig 2
(a) The flow of events in the two-stage task. Trials started in state S0, which was signalled by the absence of the house light. After an action (‘L’ or ‘R’) was taken at stage 1, either constant or blinking house light started (S1 or S2). Next, subjects could take another action at stage 2 (‘L’ or ‘R’), which could lead either to the delivery of the outcome or to no outcome. Actions taken in S0 immediately lead to the presentation of either S1 or S2, and actions taken in S1 or S2 immediately lead to the outcome or no outcome. The inter-trial interval (ITI) was zero in this experiment, but in the experiments reported in the S5, S6 and S7 Figs it was greater than zero, as detailed in S2 Text. (b) The structure of the task. Stage 1 actions in S0 led to the stage 2 stimuli (S1/S2) in a deterministic manner. The rewarding stage 2 state changed with a probability of 0.14 after earning an outcome (indicated by ‘reversal’ in the graph). ‘O’ represents outcome, and ‘X’ no-outcome. (c) The structure of the probe sessions. The probe sessions were similar to the training sessions (panel (b)), except that stage 1 actions led to the stage 2 states in a probabilistic manner. Taking action ‘L’ led to state S2 commonly (80% of the time), and to state S1 rarely (dashed lines). Taking action ‘R’ led to state S1 commonly (80% of the time), and to state S2 rarely (dashed lines).
Fig 3
Fig 3
(a) Logarithm of odds ratio of staying on the same stage 1 action after getting rewarded on the previous trial over the odds ratio after not getting rewarded. The zero point on the y-axis represents the indifference point (equal probability of staying on the same stage 1 action after reward or no reward). Each bar represents the odds ratio for a single training session. In the sessions marked with ‘#’ in Figure 3a the contingency between stage 1 actions and stage 2 states were revered (‘L’ leads to S1 and ‘R’ to S2). ‘Strict sequence’ refers to sessions in which a trial was aborted if the animal entered the magazine between stage 1 and stage 2 actions. Sessions marked with ‘*’ are probe sessions in which the task involved both rare and common transitions. (b) Reaction times (RT) averaged over subjects. RT refers to the delay between performing the stage 1 and stage 2 actions. Each dot represents a training session. (c) An example of how the performance of action sequences can be detected in the probe session. On a certain trial a rat has earned a reward by taking ‘L’ at stage 1 and ‘R’ at stage 2. The subject then repeats the whole action sequence (‘L’ and then ‘R’), even though after executing ‘L’ it ends up in S1 (due to a rare transition) and action ‘R’ is never rewarded in that state. (d) The probability of staying on the same stage 2 action in the probe session averaged over subjects, as a function of whether the previous trial was rewarded (reward/no reward) and whether subjects stayed on the same stage 1 action (stay/switch). As shown in panel (c) only the trials in which state 2 state is different from the previous trial are included. (e) The probability of staying on the same stage 1 action in the probe session averaged over subjects as a function of whether the previous trial was rewarded (reward/no reward) and whether the transition in the previous trial was common or rare. (f) Model simulations depicting the probability of staying on the same stage 1 action when the model is using exclusively action sequences. (g) Model simulations depicting the probability of staying on the same stage 1 action when the model is using the true state-space of the task but not action sequences. (h) Simulation of stage 2 choices, and (i) stage 1 choices using the best-fitted parameters for each subject. Error bars represent ±1 SEM.
Fig 4
Fig 4. Negative log model-evidence (− log p(D|M); lower numbers indicate better models) for the first best eight models in each family of computational models.
Different models are shown on the y-axis using different colours for better visualisation.

Similar articles

Cited by

References

    1. Sutton RS, Barto AG. Reinforcement learning: an introduction. Cambridge, MA: MIT Press; 1998.
    1. Ito M, Doya K. Multiple representations and algorithms for reinforcement learning in the cortico-basal ganglia circuit. Current Opinion in Neurobiology. 2011;21(3):368–373. 10.1016/j.conb.2011.04.001 - DOI - PubMed
    1. Gershman SJ, Blei DM, Niv Y. Context, learning, and extinction. Psychological review. 2010;117(1):197–209. 10.1037/a0017808 - DOI - PubMed
    1. Redish aD, Jensen S, Johnson A, Kurth-Nelson Z. Reconciling reinforcement learning models with behavioral extinction and renewal: implications for addiction, relapse, and problem gambling. Psychological review. 2007;114(3):784–805. 10.1037/0033-295X.114.3.784 - DOI - PubMed
    1. Botvinick MM, Niv Y, Barto AG. Hierarchically organized behavior and its neural foundations: a reinforcement learning perspective. Cognition. 2009;113(3):262–80. 10.1016/j.cognition.2008.08.011 - DOI - PMC - PubMed

Publication types

Grants and funding

AD and this research were supported by grants DP150104878, FL0992409 from the Australian Research Council to BWB. BWB was supported by a Senior Principal Research Fellowship from the National Health & Medical Research Council of Australia, GNT1079561. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.