Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jun 22;16(6):e1007944.
doi: 10.1371/journal.pcbi.1007944. eCollection 2020 Jun.

Combined model-free and model-sensitive reinforcement learning in non-human primates

Affiliations

Combined model-free and model-sensitive reinforcement learning in non-human primates

Bruno Miranda et al. PLoS Comput Biol. .

Abstract

Contemporary reinforcement learning (RL) theory suggests that potential choices can be evaluated by strategies that may or may not be sensitive to the computational structure of tasks. A paradigmatic model-free (MF) strategy simply repeats actions that have been rewarded in the past; by contrast, model-sensitive (MS) strategies exploit richer information associated with knowledge of task dynamics. MF and MS strategies should typically be combined, because they have complementary statistical and computational strengths; however, this tradeoff between MF/MS RL has mostly only been demonstrated in humans, often with only modest numbers of trials. We trained rhesus monkeys to perform a two-stage decision task designed to elicit and discriminate the use of MF and MS methods. A descriptive analysis of choice behaviour revealed directly that the structure of the task (of MS importance) and the reward history (of MF and MS importance) significantly influenced both choice and response vigour. A detailed, trial-by-trial computational analysis confirmed that choices were made according to a combination of strategies, with a dominant influence of a particular form of model sensitivity that persisted over weeks of testing. The residuals from this model necessitated development of a new combined RL model which incorporates a particular credit assignment weighting procedure. Finally, response vigor exhibited a subtly different collection of MF and MS influences. These results provide new illumination onto RL behavioural processes in non-human primates.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Two-stage decision task.
(A) Timeline of events. Eye fixation was required while a red fixation cue was shown, otherwise subjects could saccade freely and indicate their decision (arrow as an example) by moving a manual joystick in the direction of the chosen stimulus. Once the second-stage choice had been made, the nature of the outcome was revealed by a secondary reinforcer cue (here, the pause symbol represents high reward). Once the latter cue was off the screen, there was a fixed 500 ms delay and the possibility of a further delay (for both medium and low rewards) before juice was provided (for both high and medium rewards). (B) The state-transition structure (kept fixed throughout the experiment). Each second-stage stimuli had an independent reward structure: the outcome level (defined by the magnitude of the reward and the delay to its delivery) remained the same for a minimum number of trials (a uniformly distributed pseudorandom integer between 5 and 9) and then, either stayed in the same level (with one-third probability) or changed randomly to one of the other two possible outcome levels.
Fig 2
Fig 2. The impact of both reward and transition information on first-stage choice behaviour.
(A) Likelihood of first-stage choice repetition, averaged across sessions, as a function of reward and transition on the previous trial. Error bars depict SEM. (B-C) Logistic regression results on first-stage choice with the contributions of the reward main effect (B) and reward × transition (C) from the five previous trials. Dots represent fixed-effects coefficients for each session (red when p < 0.05, grey otherwise). (D-F) Similar results obtained from simulations (100 runs per session and respecting the exact reward structure subjects experienced) using the best fit Hybrid+ model. Bar and error bar values correspond, respectively, to mixed-effect coefficients and their SE. Dashed lines illustrate the exponential best fit on the mean fixed-effects coefficients of each trial into the past. ** α = 0.01 and * α = 0.05 in two-tailed one sample t-test with null-hypothesis mean equal to zero for the fixed-effects estimates.
Fig 3
Fig 3. The impact of both reward and transition information on first-stage choice reaction time.
(A) The averaged across sessions z-scored first-stage reaction time (RT) difference between previous common and previous rare trials as a function of reward on the previous trial (high z-scores indicate responses faster if previous transition was rare). Error bars depict SEM. (B-C) Multiple linear regression results on first-stage reaction time with the contributions of the reward main effect (B) and the reward × transition interaction term (C) from the five previous trials. Dots represent the fixed-effects coefficients for each session (coloured red when p < 0.05 and grey otherwise). Bar and error bar values correspond, respectively, to the mixed-effect coefficients and their SE. Dashed lines illustrate the exponential best fit on the mean fixed-effects coefficients of each trial into the past. ** α = 0.01 and * α = 0.05 in two-tailed one sample t-test with null-hypothesis mean equal to zero.

Similar articles

Cited by

References

    1. Sutton RS, Barto AG. Introduction to Reinforcement Learning. 1st ed Cambridge, MA, USA: MIT Press; 1998.
    1. Tolman EC. Cognitive maps in rats and men. Psychological review. 1948;55(4):189–208. - PubMed
    1. Dickinson A. Actions and Habits: The Development of Behavioural Autonomy. Philosophical Transactions of the Royal Society of London B, Biological Sciences. 1985;308(1135):67–78. 10.1098/rstb.1985.0010 - DOI
    1. Dickinson A, Balleine B. Motivational control of goal-directed action. Animal Learning & Behavior. 1994;22(1):1–18.
    1. Thorndike EL. Animal intelligence;. New York,The Macmillan company,; 1911. Available from: http://www.biodiversitylibrary.org/item/16001.

Publication types