Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Feb 13:9:27.
doi: 10.3389/fnins.2015.00027. eCollection 2015.

Condition interference in rats performing a choice task with switched variable- and fixed-reward conditions

Affiliations

Condition interference in rats performing a choice task with switched variable- and fixed-reward conditions

Akihiro Funamizu et al. Front Neurosci. .

Abstract

Because humans and animals encounter various situations, the ability to adaptively decide upon responses to any situation is essential. To date, however, decision processes and the underlying neural substrates have been investigated under specific conditions; thus, little is known about how various conditions influence one another in these processes. In this study, we designed a binary choice task with variable- and fixed-reward conditions and investigated neural activities of the prelimbic cortex and dorsomedial striatum in rats. Variable- and fixed-reward conditions induced flexible and inflexible behaviors, respectively; one of the two conditions was randomly assigned in each trial for testing the possibility of condition interference. Rats were successfully conditioned such that they could find the better reward holes of variable-reward-condition and fixed-reward-condition trials. A learning interference model, which updated expected rewards (i.e., values) used in variable-reward-condition trials on the basis of combined experiences of both conditions, better fit choice behaviors than conventional models which updated values in each condition independently. Thus, although rats distinguished the trial condition, they updated values in a condition-interference manner. Our electrophysiological study suggests that this interfering value-updating is mediated by the prelimbic cortex and dorsomedial striatum. First, some prelimbic cortical and striatal neurons represented the action-reward associations irrespective of trial conditions. Second, the striatal neurons kept tracking the values of variable-reward condition even in fixed-reward-condition trials, such that values were possibly interferingly updated even in the fixed-reward condition.

Keywords: Q-learning; goal-directed; habit; prefrontal cortex; reinforcement learning; striatum; task switching.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Free choice task. Variable- and fixed-reward conditions were randomly assigned for each trial in a 70% / 30% ratio, respectively. In both conditions, each trial was initiated when a rat poked its nose into the center hole (C), after it had to keep poking for 1600–2600 ms until a “go” tone sounded (Hold). During the fixed-reward condition only, a short light stimulus was presented during the center-hole poking to inform rats that the trial was a fixed-reward-condition trial. After the presentation of “go” tone (Go), rats had to choose either a left (L) or a right (R) hole, and a reward of a food pellet was dispensed stochastically (D) (Choice). In the variable-reward condition, reward probabilities were selected randomly (90–50%, 50–10%, 10–50%, and 50–90% for left-right) and the reward setting changed based on the choice performance of rat. In the fixed-reward condition, the reward probability was constant with either 90–50% or 50–90% for all the sessions, and the reward setting was pre-determined for each rat.
Figure 2
Figure 2
Interference reinforcement learning model. Our reinforcement learning models assumed that rats estimated expected rewards (values) of left and right choices in both variable- and fixed-reward conditions, i.e., QV and QF; models had four action values in total. All action values were updated both in variable-reward-condition (A) and fixed-reward-condition (B) trials. α was the learning rate or forgetting rate in the selected or unselected option, respectively; α depended on the trial condition and value condition. k was the reinforcer strength of the outcome. A choice probability was predicted with a soft-max equation based on all values. The soft-max equation had a free parameter, G, which adjusted the contribution of action values from the non-trial condition in the choice prediction.
Figure 3
Figure 3
Example of choice behaviors. Vertical bars in the upper portions of the inset indicate the left (L) and right (R) choice in each trial. Tall and short bars show rewarded and non-rewarded trials, respectively. Dark blue and pink bars indicate trials with variable- and fixed-reward conditions and lines in the center indicate the left-choice frequency of a given rat in the last 20 trials. The reward probability of the fixed-reward condition was 50–90% for the left-right choice. The reward-probability setting of variable-reward-condition block is shown at the top. In blocks with a light-blue color, more-rewarding choices of variable- and fixed-reward conditions were identical. Rats succeeded in distinguishing the variable- and fixed-reward conditions for action learning.
Figure 4
Figure 4
Choices in variable- and fixed-reward conditions. (A) Extinction phase. Probabilities of the optimal choice were quantified before and during extinction-phase trials, which were introduced in the variable- and fixed-reward conditions. Means and standard errors are shown. Before the extinction phase, reward probabilities of variable- and fixed-reward conditions were identical: *p < 0.05; **p < 0.01 in a Mann–Whitney U-test. (Bi) Example of a conditional-probability calculation in repeated (a) and interleaved sequences (b). Depending on the action-outcome experience at trial t-1 for (a) and trial t-2 for (b), the conditional probability at trial t was analyzed. In these examples, the conditional probability of variable-reward-condition trial (Var.) was analyzed, based on the action-outcome experience in the last and the next-to-last trial with a variable-reward condition in (a) and (b), respectively. In (b), the experience in the interleaved trial t-1 with the fixed-reward condition was ignored. Action-outcome experiences had 4 types: optimal choice rewarded (Opt. rewarded); optimal choice not rewarded (Opt. not rewarded); non-optimal choice rewarded (Non-Opt. rewarded); non-optimal choice not rewarded (Non-opt. not rewarded). If the choices of variable- and fixed-reward conditions were independently learned, the conditional probabilities of repeated and interleaved sequences became the same. (ii,iii) Comparison of conditional probabilities in variable- (ii) and fixed-reward condition (iii). Conditional probabilities of making a choice to the optimal side of fixed-reward condition were compared between repeated (white bars) and interleaved sequences (black bars). Means and standard errors of probabilities are shown. Dotted line shows the average choice probability. White and black bars indicate significant differences under some action-outcome experiences, meaning that the interleaved trial interferingly affected the choices: **p < 0.01 in a Mann–Whitney U-test.
Figure 5
Figure 5
Learning interference model. (A) Prediction of choice probability. The learning interference model predicted choice behaviors of rat in Figure 3 between the 289 and 493 trials. Trials consisted of all four reward-probability settings in the variable-reward condition and the extinction phase. FQ-learning and fixed-choice models were employed as learning rules of variable- and fixed-reward conditions, respectively. Free parameters of the learning interference model were set to achieve maximum likelihood in this session. Dark blue and pink lines show predicted left choice probabilities of the model in variable- and fixed-reward conditions, respectively. Other lines and symbols are as in Figure 3. The learning interference model accurately predicted the choice behaviors of rat. (B) Model variables. (i) Action values of the variable-reward condition were predicted by FQ-learning. (ii) Choice probabilities in the fixed-reward condition were predicted by the fixed-choice model.
Figure 6
Figure 6
Model fitting. (A) Trials with variable- (i) and fixed-reward conditions (ii). Results of 2-fold cross validation were compared among the four models: F-C, fixed-choice model; Q: standard Q-learning; FQ: forgetting Q-learning; DFQ: differential forgetting Q-learning. Means and standard errors are shown. The number of free parameters in each model is shown in parentheses: **p < 0.01 in a two-sided paired t-test. (B) All trials. FQ-learning and fixed-choice models were employed as learning rules of variable- and fixed-reward conditions, respectively.
Figure 7
Figure 7
Tracks of electrode bundles. Each diagram shows a coronal section referenced to the bregma (Paxinos and Watson, 1997). Data recoding from the sites (A,B) were treated as neuronal activities from the prelimbic cortex and dorsomedial striatum, respectively. Gray-level of boxes distinguishes electrode tracks from each rat.
Figure 8
Figure 8
Coding of response-outcome (R-O) association. (A) The proportion of neurons coding actions (responses: R) (i), rewards (outcomes: O) (ii) and R-O associations (iii). In regression analysis with variable-reward-condition trials, the proportion of neurons with significant regression coefficients is shown: p < 0.01 in a two-sided Student's t-test. Orange and green lines indicate neurons in the prelimbic cortex and dorsomedial striatum, respectively. When the proportion of neurons exceeded the threshold (32.3%), we defined that the prelimbic cortex or the striatum significantly represented the variable (z-test, p < 0.05). In each column, proportions of neurons are aligned with a different timing: left, initiation of center-hole poking; middle, end of center-hole poking; right, onset of reward or no-reward cue (R/NR cue). Gray bars in the middle column show the timing of right- or left-hole poking in 99.3% of all trials. Small black triangles indicate that the proportions of neurons in the prelimbic cortex and dorsomedial striatum had significant differences: p < 0.01 in a χ2-test. (B) Representative prelimbic cortical neuron encoding the R-O association. Neural activities at the onset of reward or no-reward cues in the variable-reward condition are shown, as in the right column of (A). Green and orange colors show left and right choices in rewarded trials, respectively. Light blue and red colors show non-rewarded trials. Raster plot: colors of spikes differ with actions and outcomes of trials. Tone presentations and poking periods are shown with gray boxes. The lower part shows the average spike density function smoothed with a Gaussian kernel with a standard deviation of 50 ms.
Figure 9
Figure 9
Coding of stimulus-outcome (S-O) association. (A) Proportions of neurons coding rewards (outcomes: O) (i), conditions (stimuli: S) (ii) and S-O associations (iii). Regression analysis was performed on data from trials in which rats made a choice to the optimal side of fixed-reward condition. The proportion of neurons that had significant regression coefficients is shown: p < 0.01 in a two-sided Student's t-test. Lines and symbols as in Figure 8A. (B) Representative dorsomedial striatal neuron encoding the S-O association. Neural activities at the onset of reward or no-reward cues are shown, as in the right column of (A). Green and orange colors show activities in rewarded trials in the variable- and fixed-reward conditions, respectively. Light blue and red colors show non-rewarded trials. Lines and symbols as in Figure 8B. (C) Detailed neural coding of S-O associations in the dorsomedial striatum. The proportion of neurons encoding one of the four associations is shown before and after the onset of reward or no-reward cues, as in (B). Colors correspond to (B). Many striatal neurons encoded the no-reward cue in the variable-reward condition, indicating that they did not differentiate reward and no-reward cues in the fixed-reward condition.
Figure 10
Figure 10
Representative value-coding neuron. Representative neurons coding values of the variable-reward condition are shown from the prelimbic cortex (A) and dorsomedial striatum (B). Average spike density functions during both the variable-reward-condition and fixed-reward-condition trials (i) and during the fixed-reward-condition trials (ii) are shown, smoothed with a Gaussian kernel with a standard deviation of 50 ms. Each colored line shows the activity during a reward-probability block in the variable-reward condition; reward probabilities for left-right choices are shown in the inset. Activities are aligned with different timings as in Figure 8A. Prelimbic cortical neuron represented state values after the center-hole poking (A), while dorsomedial striatal neuron represented state and action values during and after the center-hole poking, respectively (B). Both neurons represented values of the variable-reward condition even during fixed-reward-condition trials (ii).
Figure 11
Figure 11
Coding of values in variable-reward condition. (A) The proportion of neurons coding the values of the presented-trial condition (i) and of the variable-reward condition (ii). In regression analysis with all trials, the proportion of neurons which had significant regression coefficients was shown: p < 0.01 in a two-sided Student's t-test. Values consisted of five variables (i.e., action values for left and right choices, state value, chosen value, and policy) in the learning interference model. Lines and symbols as in Figure 8A. Dorsomedial striatal neurons mainly encoded values of the variable-reward condition. (B) The proportion of neurons encoding values of the variable-reward condition in fixed-reward-condition trials. Neurons representing values of the variable-reward condition were investigated with regression analysis using fixed-reward-condition trials. Lines and symbols as in (A).

Similar articles

Cited by

References

    1. Balleine B. W. (2005). Neural bases of food-seeking: affect, arousal and reward in corticostriatolimbic circuits. Physiol. Behav. 86, 717–730. 10.1016/j.physbeh.2005.08.061 - DOI - PubMed
    1. Balleine B. W., Delgado M. R., Hikosaka O. (2007). The role of the dorsal striatum in reward and decision-making. J. Neurosci. 27, 8161–8165. 10.1523/JNEUROSCI.1554-07.2007 - DOI - PMC - PubMed
    1. Balleine B. W., Dickinson A. (1998). Goal-directed instrumental action: contingency and incentive learning and their cortical substrates. Neuropharmacology 37, 407–419. 10.1016/S0028-3908(98)00033-1 - DOI - PubMed
    1. Balleine B. W., Killcross S. (2006). Parallel incentive processing: an integrated view of amygdala function. Trends Neurosci. 29, 272–279. 10.1016/j.tins.2006.03.002 - DOI - PubMed
    1. Bishop C. M. (2006). Pattern Recognition and Machine Learning. New York, NY: Springer.

LinkOut - more resources