Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Nov 3;11(11):e1004540.
doi: 10.1371/journal.pcbi.1004540. eCollection 2015 Nov.

Parallel Representation of Value-Based and Finite State-Based Strategies in the Ventral and Dorsal Striatum

Affiliations

Parallel Representation of Value-Based and Finite State-Based Strategies in the Ventral and Dorsal Striatum

Makoto Ito et al. PLoS Comput Biol. .

Abstract

Previous theoretical studies of animal and human behavioral learning have focused on the dichotomy of the value-based strategy using action value functions to predict rewards and the model-based strategy using internal models to predict environmental states. However, animals and humans often take simple procedural behaviors, such as the "win-stay, lose-switch" strategy without explicit prediction of rewards or states. Here we consider another strategy, the finite state-based strategy, in which a subject selects an action depending on its discrete internal state and updates the state depending on the action chosen and the reward outcome. By analyzing choice behavior of rats in a free-choice task, we found that the finite state-based strategy fitted their behavioral choices more accurately than value-based and model-based strategies did. When fitted models were run autonomously with the same task, only the finite state-based strategy could reproduce the key feature of choice sequences. Analyses of neural activity recorded from the dorsolateral striatum (DLS), the dorsomedial striatum (DMS), and the ventral striatum (VS) identified significant fractions of neurons in all three subareas for which activities were correlated with individual states of the finite state-based strategy. The signal of internal states at the time of choice was found in DMS, and for clusters of states was found in VS. In addition, action values and state values of the value-based strategy were encoded in DMS and VS, respectively. These results suggest that both the value-based strategy and the finite state-based strategy are implemented in the striatum.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Design of the choice task.
(A) A schematic illustration of the experimental chamber. The chamber was equipped with three holes for nose poking (L, left hole; C, center hole; R, right hole) and a pellet dish (D).(B) The time sequence of the choice task. When a rat performed a nose poke in the center hole for 500–1,000 ms, a cue tone (white noise) was presented. The rat had to maintain the nose poke during the presentation of the cue tone, or the trial was terminated as an error trial after presentation of an error tone. After the cue tone, the rat was required to perform a nose-poke in either the left or right hole. Then either a reward tone or a no-reward tone was presented stochastically depending on the rat’s choice and the current left-right probability block. The reward tone was followed by delivery of a sucrose pellet to the pellet dish. Reward probabilities for left and right nose pokes were selected from four pairs [(left, right), (90%, 50%), (50%, 90%), (50%, 10%), and (10%, 50%)]. The probability pair was fixed during a block. Subsequently, the reward probability setting was changed when the choice frequency of the more advantageous side during the last 20 choice trials reached 80%. For this calculation, the same block was held until at least 20 choice trials were completed. A session consisted of four blocks, and the sequence of the reward probability pairs was given in a pseudorandom order, so all four pairs were used once per session. (C) Decision trees averaged by all rats. The left choice probability for all possible experiences in one and two previous trials in the higher reward probability blocks (left) and in the lower reward probability blocks (right). Four types of experiences in one trial [left or right times rewarded (1) or no reward (0)] are represented by different colors and line types. For instance, left probability after L1, P(L|L1), is indicated by the right edge of a blue solid line (upper black solid arrow in the left panel), and left probability after R1 L0 (R1 and then L0), P(L|R1 L0), is indicated by the right edge of a blue broken line connected to the red solid line (green arrow). Values of trials = 0 (x-axis) represent the left choice probability for all trials. Shaded bands indicate 95% confidence intervals. Significant differences in left choice probabilities for one previous trial between the higher and lower reward probability blocks are marked by brown circles in the right panel (thick circles for p < 0.01, a thin circle for p < 0.05; chi-squared tests).
Fig 2
Fig 2. Comparison of model fits.
(A) Normalized likelihoods for the Finite-state and Model-based strategies. For comparison, published data in Ito & Doya 2015, the likelihoods of Markov models and Value-based strategy are also shown. The fitness of the models was measured by the normalized likelihood of the test data, which were obtained from the geometric average of prediction accuracy for unknown data. Numbers in parentheses on the upper x-axis correspond to arithmetic averages of prediction accuracy. Numbers followed by the name of the model indicate numbers of free parameters in each model. “const” or “variable” means that the parameters of each model were assumed to be constant or variable, respectively. Green and brown asterisks indicate a significant difference from the normalized likelihood of the FSA model with 8 states (green arrow) and the FQ-learning model with variable parameters (brown arrow), respectively. ** for p < 0.01 and * for p < 0.05 in a paired-sample Wilcoxon test (See Materials and Methods). (B, C) Averaged likelihoods and standard errors (shaded bands) in last 20 trials in the higher (B) and the lower (C) reward probability blocks for the FQ-learning model with variable parameters (red), the FSA with 8 states (green), and in the ESE model with variable parameters (purple).
Fig 3
Fig 3. An example of behavioral performance and model fits.
(A) An example of behavioral performance and predictions made by the models. Vertical black lines indicate rat choice behavior. Left and right choices are represented by upper and lower bars, respectively. Rewarded and non-rewarded outcomes are represented by long and short bars, respectively. Model fits, representing the prediction probability that the rat selects left at trial t, were estimated using previous choices and the reward outcomes from trial 1 to t-1 based on the FQ-learning or the FSA model with 8 states. These are represented by red or green lines, respectively. (B) Estimated action values and varying parameters of the FQ-learning model and standard deviations of posterior probabilities. Q L and Q R, action values for left and right; α, the learning rate for the selected action (= forgetting rate for the action not chosen); κ 1, the strength of reinforcement by reward; and κ 2, the strength of the aversion resulting from the no-reward outcome. (C) Posterior probabilities of internal states (upper panel) and clusters (lower panel) of the FSA model with 8 states shown by stacked graphs. The index of states and clusters corresponds to the index in Fig 4C.
Fig 4
Fig 4. Estimated parameters of finite state agent (FSA) models.
Each state is represented by a blue or red circle, and numbers in circles represent indices of state and action probabilities (%) for left and right. States for which the probability of left (right) is larger than that of right (left) are shown in blue (red). Each arrow with a number indicates the transition probability (%) after left (blue) or right (red) is chosen and a reward is obtained (solid) or not obtained (dashed). For simplicity, only transition probabilities greater than 5% are shown. These parameters were estimated under symmetric constraints. States form clusters that represent different sub-strategies (cluster left, cluster right, and win-stay, lose-switch). See Materials and Methods for the mathematical definition of the clusters. (A) FSA model with 4 states, (B) 6 states, and (C) 8 states.
Fig 5
Fig 5. Comparison of simulated model behaviors with actual rat behaviors.
(A, B) Distributions of trials needed to reach the 80% optimality criterion for rats (gray), FSA with 8 states (green), FQ with constant parameters (red), and ESE with constant parameters (purple) for blocks with higher reward probabilities (A) and for blocks with lower reward probabilities (B). (C, D) The mean number of trials in one block. Data from rats are indicated by blue vertical lines, and confidence intervals (100-5/6%; Bonferroni Method) of the hypothesis that the behavioral data were replicated by each model are represented by horizontal lines. The red color of confidence intervals means that behavioral data are within the confidence interval. (E, F) The mean probability that the same action is selected after rewarded (solid blue lines) and non-rewarded (dashed blue lines) trials, and corresponding confidence intervals of the models (horizontal lines).
Fig 6
Fig 6. Examples of neuronal activities correlated with internal variables in the FSA model with 8 states.
(A) Firing activity of a DMS neuron that was correlated with the posterior probability of state 5 of the FSA model with 8 states. Firing rates in trials where the estimated posterior probability of state 5 was high and low were shown by green or gray event-aligned spike histograms (EASHs; see Materials and Methods). (B) Information coded in the neuron shown in (A). Blue and red time bins for each regressor indicate time bins where neuronal activity was positively and negatively correlated with the regressor, respectively. In regressors selected by lasso from the 30 regressors for each time bin, only regressors that were detected for more than two adjacent time bins are shown (two regressors, in this case). (C) The correlation between firing rate and posterior probability of state 5. The firing rate in yellow time bins shown in (A) and the posterior probability of state 5 for each trial are plotted with gray lines in the upper and lower panels, respectively. Black lines were smoothed with a Gaussian filter using the standard deviation of three trials. (D, E, F) Firing activity of a DMS neuron that was correlated with state 7 at the next trial estimated by the FSA model with 8 states. (G, H, I) Firing activity of a VS neuron that was correlated with the sub-strategy (win-stay, lose-switch; WSLS) estimated by the FSA model with 8 states.
Fig 7
Fig 7. Proportions of neurons coding variables of the value-based and finite state-based strategies.
Proportions of neurons showing significant correlations (p < 0.01, t test) with variables of the value-based strategy (FQ-learning) (A, B, C) and the finite state-based strategy (FSA model with 8 states) (D, E, F). These neurons were detected by lasso regularization of a Poisson regression model, which was conducted for 500 ms before and after the seven trial events (entry into the center hole, the tone onset, the tone offset, the exit from the center hole, the entry into the L/R hole, and the exit from the L/R hole) for DLS (blue), DMS (green), and VS (pink). Colored disks mean that the populations are significantly higher than by chance (p < 0.05, binominal test). (A) Neurons coding state values, the average of action values. (B) Neurons coding action values, Q L and/or Q R. (C) Neurons coding chosen values, action values for the selected action. (D) Neurons coding at least one cluster (sub-strategy) of the FSA model with 8 states; cluster left, and/or cluster right, and/or win-stay, lose-switch. (E) Neurons coding at least one current state from x 1(t) to x 8(t) of the FSA model. (F) Neurons coding at least one next state from x 1(t+1) to x 8(t+1) of the FSA model.
Fig 8
Fig 8. Breakdowns of state-coding neurons shown in Fig 7E and 7F.
(A, B), The proportion of neurons coding x(t) during the 500 ms before entry into the L/R hole, and x(t+1) during 500 ms after exit from L/R hole, respectively. The color for each state showing a significant proportion (p < 0.05, binominal test) corresponds to the color in the simplified diagram of the state transition in the FSA model with 8 states shown in (C). Populations with less than chance probabilities are shown in gray.

Similar articles

Cited by

References

    1. Doya K (1999) What are the computations of the cerebellum, the basal ganglia and the cerebral cortex? Neural Netw 12: 961–974. - PubMed
    1. Daw ND, Niv Y, Dayan P (2005) Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat Neurosci 8: 1704–1711. - PubMed
    1. Watkins CJCH, Dayan P (1992) Q-learning. Machine Learning 8: 279–292.
    1. Samejima K, Ueda Y, Doya K, Kimura M (2005) Representation of action-specific reward values in the striatum. Science 310: 1337–1340. - PubMed
    1. Ito M, Doya K (2009) Validation of decision-making models and analysis of decision variables in the rat basal ganglia. J Neurosci 29: 9861–9874. 10.1523/JNEUROSCI.6157-08.2009 - DOI - PMC - PubMed

Publication types

Grants and funding

This work was supported by MEXT KAKENHI Grant Number 23120007(KD), MEXT KAKENHI Grant Number 26120729,(MI) and JSPS KAKENHI Grant Number 25430017(MI). KAKENHI: these grants cover a full range of creative and pioneering research from basic to applied fields across the humanities, social sciences and natural sciences (MEXT KAKENHI http://www.mext.go.jp/english/; JSPS KAKENHI: https://www.jsps.go.jp/english/index.html). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

LinkOut - more resources