Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Aug 5;29(31):9861-74.
doi: 10.1523/JNEUROSCI.6157-08.2009.

Validation of decision-making models and analysis of decision variables in the rat basal ganglia

Affiliations

Validation of decision-making models and analysis of decision variables in the rat basal ganglia

Makoto Ito et al. J Neurosci. .

Abstract

Reinforcement learning theory plays a key role in understanding the behavioral and neural mechanisms of choice behavior in animals and humans. Especially, intermediate variables of learning models estimated from behavioral data, such as the expectation of reward for each candidate choice (action value), have been used in searches for the neural correlates of computational elements in learning and decision making. The aims of the present study are as follows: (1) to test which computational model best captures the choice learning process in animals and (2) to elucidate how action values are represented in different parts of the corticobasal ganglia circuit. We compared different behavioral learning algorithms to predict the choice sequences generated by rats during a free-choice task and analyzed associated neural activity in the nucleus accumbens (NAc) and ventral pallidum (VP). The major findings of this study were as follows: (1) modified versions of an action-value learning model captured a variety of choice strategies of rats, including win-stay-lose-switch and persevering behavior, and predicted rats' choice sequences better than the best multistep Markov model; and (2) information about action values and future actions was coded in both the NAc and VP, but was less dominant than information about trial types, selected actions, and reward outcome. The results of our model-based analysis suggest that the primary role of the NAc and VP is to monitor information important for updating choice behaviors. Information represented in the NAc and VP might contribute to a choice mechanism that is situated elsewhere.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
A, Schematic illustration of the experimental chamber. The chamber was equipped with three holes for nose poking (L, left hole; C, center hole; R, right hole) and a pellet dish (D) on the opposite wall. B, Schematic representation of conditional free-choice task. After a rat maintained a nose poke in the center hole for 500–1000 ms, one of two discriminative stimuli, tone A or tone B, was stochastically chosen and presented. For tone A presentation, the rat was required to perform a left or right nose poke (choice trial). After the left or right nose poke, a sucrose pellet was delivered stochastically with a certain probability depending on the rat's choice (for example, 90% reward probability for the left choice and 50% reward probability for the right choice). Reward availability was informed by different tone signals, which were presented immediately after the left or right nose poke. The reward probability in choice trials was fixed in a block, and the block was changed to the next block with a different reward probability when the average of the last 20 choices reached 80% optimal. One of four types of reward probability [(left, right), (90, 50%), (50, 90%), (50, 10%), and (10, 50%)] was used for each block. For tone B presentation, a pellet was delivered deterministically 1000 ms after the exit from the center hole (no-choice trial).
Figure 2.
Figure 2.
A, Representative example of a rat's performance during one session of the conditional free-choice task. The blue and red vertical lines indicate individual choices in choice trials. The orange and black vertical lines indicate no-choice trials and error trials, respectively. The long lines and short lines represent rewarded and no-reward trials, respectively. The light blue trace in the middle indicates the probability of a left choice in choice trials (average of the last 20 choice trials). B–E, The rat's strategy in choice trials, represented by left choice probabilities after different experiences with 99% confidence intervals (shaded bands). B, The left choice probability for all possible experiences in one and two previous trials. Four types of experiences in one trial [left or right times rewarded (1) or no reward (0)] are represented by different colors and types of line. For instance, left probability after R0 is indicated by the right edge of a red broken line (the green arrowhead), and left probability after R0 L1 (R0 and then L1) is indicated by the right edge of a blue solid line connecting to the red broken line (blue arrowhead). C, Left choice probabilities for frequently occurring sequences of four experiences. These patterns indicate rewarded experiences that gradually reinforce the selected action. A blue arrowhead and a blue arrow represent the same data indicated by the blue arrow and the arrowhead in B. D, Left choice probabilities for sequences of four no-reward experiences. No-reward experiences tended to switch the rat's choices. E, Left choice probabilities for persevering behavior. An increase in the probability of a selected action after a no-reward outcome suggests that rats tend to continue selecting the same choice regardless of a no-reward outcome.
Figure 3.
Figure 3.
Model fitting to the rats' strategy by a least-square method. The choice probabilities predicted by the local matching law (A), the standard Q-learning model (B), F-Q-learning model (C), and DF-Q-learning model (D) are shown for repeated rewarded choices, the sequences after one unrewarded choice (top panel) and repeated unrewarded choices (bottom panel). The broken lines indicate the choice probabilities of the rats (same as in Fig. 2C,D), and the solid lines indicate the choice probabilities predicted by each model from the choice and reward sequences of the corresponding color. In the lower panel of A, the green solid line completely overlaps with the orange solid line. The free parameters of each model were determined so that squared errors of choice probabilities between the model and rats was minimized (Table 1). The numbers of free parameters including the initial action value [Q0 = QL(1) = QR(1)] in local matching law, the standard Q, F-Q, and DF-Q are 2, 3, 4, and 5, respectively.
Figure 4.
Figure 4.
Example of trial-by-trial predictions of rats' choices based on reinforcement learning algorithms. A–F, Representative examples of trial-by-trial predictions using the standard Q-learning model (A, B), F-Q-learning model (C, D), and DF-Q-learning model (E, F). In all models, parameters were assumed to be variable (see Materials and Methods). In the panels of the left side and the right sides, different choice data were applied. The probability that a rat would select left at trial t was estimated from the rat's past experiences e(1), … e(t − 1) and plotted at trial t. The actual rat's choice at each trial is represented by a vertical line. The top lines and bottom lines indicate left and right choices, respectively. The black and gray colors indicate rewarded and no-rewarded trials, respectively. G, H, Estimated model parameters of F-Q- (broken lines) and DF-Q-learning model (solid lines) during the predictions.
Figure 5.
Figure 5.
Accuracy of each model in trial-by-trial prediction of rats' choice. The prediction accuracy was defined by the normalized likelihood of test data. The free parameters of each model were determined by the maximization of the likelihood of training data. Numbers followed by the name of models indicate the numbers of free parameters of each model. “const” means that the parameters of the model, such as the learning rate, was assumed to be constant for all sessions, and “variable” means that the parameters were assumed to be variable. The double and single asterisks indicate a significant difference from the prediction accuracy of F-Q-learning model (variable); p < 0.01 and p < 0.05 in paired-sample Wilcoxon's signed rank tests, respectively.
Figure 6.
Figure 6.
Tracks of accepted electrode bundles for all rats are illustrated by rectangles. Each diagram represents a coronal section referenced to the bregma (Paxinos and Watson, 1998). Data recording from the sites in A and B were treated as neuronal activity in the NAc and VP, respectively. core, Nucleus accumbens core; sh, nucleus accumbens shell; VP, ventral pallidum.
Figure 7.
Figure 7.
Examples of neuronal activity in the NAc (A, C, E) and VP (B, D, F) modulated by various task events. A, B, D, and E were neuronal responses recorded in the same session. Of these, A and E are data from the same neuron. A, B, Examples of neuronal activity modulated by the selected action (action-coding neurons). Top (bottom) rasters show spikes and events on choice trials in which a left (right) nose poke was selected. The perievent time histograms in the bottom panels are aligned with the exit from the center hole. C, D, Examples of neuronal activity modulated by the availability of reward (reward-coding neurons). The perievent time histograms are aligned with the onset of a reward tone or no-reward tone. E, F, Examples of neuronal activity coding reward probability for one of two actions (action value-coding neurons). The perievent time histograms for last 20 choice trials in four different blocks are shown by different colors. E, The histograms were aligned with entry to the center hole. There is a significant difference in the activity between block (50, 10) and block (50, 90), but no difference between block (90, 50) and block (10, 50) around the entry (yellow bins, p < 0.01, from −1 to 1 s). This suggests that the activity codes the reward probability for right action. Because this neuron also coded the selected action (as shown in A), the firing rates were significantly different between (90, 50) and (10, 50), and between (50, 90) and (50, 10), 3 s after the entry to the center hole (pink bins). F, The histogram were aligned with the exit from the center hole. This VP neuron coded the reward probability for left action. The yellow bins indicated a significant difference in the firing rate between block (10, 50) and block (90, 50) (p < 0.01) and no difference between blocks (50, 90) and (50, 10). The green bands in the rasters show the time of presentation of tone A. The pink bands behind green bands represent the time periods of center nose pokes. The blue and red bands represent left and right nose pokes, respectively. The green and black diamonds indicate the onset of reward and no-reward tones, respectively. The red triangles indicate the time of a rat's picking up a sucrose pellet from the pellet dish. Each perievent time histograms were constructed for 100 ms bins (A–D) or 500 ms bins (E, F). The yellow bins in the histograms show significant differences in firing rate (Mann–Whitney U test, p < 0.01).
Figure 8.
Figure 8.
Information coded in the NAc and VP. A, Time bin neuronal activity was examined as follows: for 1 s before the onset of the a nose poke at the center hole (phase 1), after the onset of the cue tone (phase 2), before initiation of action (phase 3), after the action onset (phase 4), after the onset of the reward or no-reward tone (phase 5). B, The population of neurons that showed significant selectivity (Mann–Whitney U test, p < 0.01) for each event. State-coding neurons are defined as neurons that showed a significantly different firing rate in choice and no-choice trials for 1 s after the onset of the cue tone (phase 2). The neurons coding action values for left or right choices (Fig. 7E,F) were detected for three different time bins, phases 1–3. QLn and QRn indicate the action values for left and right during phase n, respectively. Note that these action value-coding neurons were detected by simple comparisons of firing rate in different blocks, not using computational models. Action command (AC)-coding neurons are defined as neurons that showed an action selectively during the 1 s before initiation of action (phase 3). Action-coding neurons are the neurons showing action selectivity during 1 s after the action onset (phase 4) (Fig. 7A,B). Reward-coding neurons are the neurons that showed different firing rate between rewarded trials and no-reward trials during 1 s after the onset of the reward or no-reward tone (phase 5) (Fig. 7C,D). C, The population of neurons coding the action values detected by a linear regression analysis. The reward probabilities for left and right were used as regressors (a model-free analysis). The neurons with a significant coefficient for the reward probability for either left or right were defined as the action value-coding neurons for left and right, respectively. QLn and QRn indicate the action values for left and right during phase n, respectively. D, The population of neurons coding the state value and the policy detected by a linear regression analysis. The sum of the reward probabilities for both actions and the difference of them were used as regressors (a model-free analysis). The neurons with a significant coefficient for ether the sum or the difference were defined as the state value and the policy-coding neurons, respectively. Vn indicates the state value during phase n, and Pn the policy during phase n. All populations were significantly larger than the chance level (binomial test, p < 0.01). The single and double asterisks indicate significant differences in the percentages of coding neurons between the NAc and VP; p < 0.05 and p < 0.01, respectively, in Mann–Whitney U test.
Figure 9.
Figure 9.
Information coded in the NAc and VP. The mutual information per 1 s between firing and each event was calculated using a sliding time window (duration, 500 ms; step size, 100 ms) and averaged across all neurons recorded in the NAc and VP. The mutual information on action values (QL and QR) was calculated using the estimated action values based on F-Q-learning model with time-varying parameters. A and B are aligned with the onset of the discriminative tone and the initiation of action (exit time from the center hole), respectively. The black lines close to the horizontal axes show a threshold indicating significant information (p < 0.01). The bottom panels show the normalized distribution of onset time (solid lines) and offset time (broken lines) for nose pokes at C, L, and R, presentation of tone A, and the sensor for detecting a pellet on the dish.

Similar articles

Cited by

References

    1. Barraclough DJ, Conroy ML, Lee D. Prefrontal cortex and decision making in a mixed-strategy game. Nat Neurosci. 2004;7:404–410. - PubMed
    1. Cardinal RN. Neural systems implicated in delayed and probabilistic reinforcement. Neural Netw. 2006;19:1277–1301. - PubMed
    1. Cardinal RN, Cheung TH. Nucleus accumbens core lesions retard instrumental learning and performance with delayed reinforcement in the rat. BMC Neurosci. 2005;6:9. - PMC - PubMed
    1. Cardinal RN, Howes NJ. Effects of lesions of the nucleus accumbens core on choice between small certain rewards and large uncertain rewards in rats. BMC Neurosci. 2005;6:37. - PMC - PubMed
    1. Chang JY, Chen L, Luo F, Shi LH, Woodward DJ. Neuronal responses in the frontal cortico-basal ganglia system during delayed matching-to-sample task: ensemble recording in freely moving rats. Exp Brain Res. 2002;142:67–80. - PubMed

Publication types

LinkOut - more resources