Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 1299, 74-94

Instructional Control of Reinforcement Learning: A Behavioral and Neurocomputational Investigation

Affiliations

Instructional Control of Reinforcement Learning: A Behavioral and Neurocomputational Investigation

Bradley B Doll et al. Brain Res.

Abstract

Humans learn how to behave directly through environmental experience and indirectly through rules and instructions. Behavior analytic research has shown that instructions can control behavior, even when such behavior leads to sub-optimal outcomes (Hayes, S. (Ed.). 1989. Rule-governed behavior: cognition, contingencies, and instructional control. Plenum Press.). Here we examine the control of behavior through instructions in a reinforcement learning task known to depend on striatal dopaminergic function. Participants selected between probabilistically reinforced stimuli, and were (incorrectly) told that a specific stimulus had the highest (or lowest) reinforcement probability. Despite experience to the contrary, instructions drove choice behavior. We present neural network simulations that capture the interactions between instruction-driven and reinforcement-driven behavior via two potential neural circuits: one in which the striatum is inaccurately trained by instruction representations coming from prefrontal cortex/hippocampus (PFC/HC), and another in which the striatum learns the environmentally based reinforcement contingencies, but is "overridden" at decision output. Both models capture the core behavioral phenomena but, because they differ fundamentally on what is learned, make distinct predictions for subsequent behavioral and neuroimaging experiments. Finally, we attempt to distinguish between the proposed computational mechanisms governing instructed behavior by fitting a series of abstract "Q-learning" and Bayesian models to subject data. The best-fitting model supports one of the neural models, suggesting the existence of a "confirmation bias" in which the PFC/HC system trains the reinforcement system by amplifying outcomes that are consistent with instructions while diminishing inconsistent outcomes.

Figures

Fig. 1
Fig. 1
Probabilistic selection task. Example stimulus pairs, which minimize explicit verbal encoding by using Japanese Hiragana characters. Each pair is presented separately in different trials. The three different pairs are presented in random order to create blocks of 60 trials (20 per stimulus pair). Instructed subjects were misinformed either that F would have the highest probability of being correct or that E would have the lowest probability of being correct. Correct choices are determined probabilistically, with percent positive/negative feedback shown in parentheses for each stimulus. When reward was programmed for a given stimulus, a punishment was programmed for its paired alternative. A test (transfer) phase follows in which all possible stimulus pairs are presented and no feedback is given after choices are made. The effect of instructions on learning is measured by performance on all pairs featuring the instructed stimulus. “Choose F” refers to test pairs in which choice of stimulus F is optimal according to reinforcement probabilities, whereas “avoid F” refers to pairs in which the optimal choice is to select the alternative stimulus. Deviations from the accurate response (e.g. choose F, avoid F) indicate instructional control.
Fig. 2
Fig. 2
(a) Instructed subjects frequently chose stimulus F in the last block of the training phase, despite the repeated negative feedback that resulted from doing so. These subjects were told that either that the F stimulus (40% correct) would have the highest probability of being correct, or that the E stimulus (60% correct) would have the lowest probability of being correct. In actual fact, the E stimulus was more likely to be correct. The instructions did not affect learning of the uninstructed pairs, AB and CD. Performance in the last 20 trials of each stimulus pair is shown here. Historical controls (Frank et al., 2007c) plotted here show rough probability matching on all stimulus pairs. (b) Experience with the true contingencies reduced the influence of instructions on choice. However, by the end of training, subjects continued to choose more in accordance with the instructions than with the true probabilistic contingencies.
Fig. 3
Fig. 3
(a) Subjects instructed that F had the highest probability of being correct were more likely to choose F in the test phase when it was statistically suboptimal according to reinforcement probabilities (avoid F condition), and were just as likely as uninstructed subjects to select F when it was optimal. (b) Subjects instructed that E had the lowest probability of being correct were marginally more likely to avoid E when it was actually the more optimal response in the test phase (choose E condition), and were just as likely as uninstructed subjects to avoid E when it was suboptimal.
Fig. 4
Fig. 4
(a) Complete (dual projection) model performance on a reduced probabilistic selection task involving four stimuli. When presented with stimulus S1 (S2), response R1 (R2) is positively reinforced on 80% of trials. For S3 (S4), R1 (R2) is reinforced on 60% of trials. Instructed models were “misled” in an initial instructed trial that R1 would be correct in response to the critical (instructed) stimulus S4*. The instructed model shows the expected matching behavior on all but the instructed stimulus-response mapping. Choice on the instructed stimulus is suboptimal with respect to actual reinforcement probability, as in human subjects. (b) The instructed model, like human subjects, shows some learning of the true probabilities over time. Over 10 epochs performance on the instructed stimulus drifts up to match the allocation of F stimulus responses seen in human subjects. The uninstructed model begins somewhat below 50%. This occurs because the model does not always clearly choose a specific response early in training, instead producing a blend of responses (which is counted as incorrect). As feedback accumulates in training, the model begins to probability match the S4 stimulus.
Fig. 5
Fig. 5
Striatal Go and NoGo unit activation-based receptive fields in the test phase when presented with the instructed stimulus. Here positive values indicate greater Go than NoGo activity for selecting R1 compared to R2. Uninstructed models show negative values, indicating a correct preference for R2 over R1 in response to the instructed stimulus, S4. Although both single projection models behaviorally chose response R1 (consistent with the instructions but inconsistent with reinforcement probabilities), their test phase striatal activations show that they learned fundamentally differently. Whereas the striatum in the PFC-MC (override) model appropriately learned NoGo to the instructed response, the PFC-BG (bias) model was biased to learn Go.
Fig. 6
Fig. 6
The basic BG model (Frank, 2005, 2006) simulates effects of dopaminergic manipulation on a variety of probabilistic learning tasks using the same network parameters. Stimuli presented in the input layer directly (but weakly) activate motor cortex. In order to execute an action, the motor cortex response requires bottom-up thalamic activation, which occurs via action selection in the BG. When activated, striatal Go units (in left half of Striatum) encode stimulus-response conjunctions and inhibit the internal segment of the globus pallidus (GPi). Because the GPi is normally tonically active and inhibits the thalamus, the effect of striatal Go signal is to release the thalamus from tonic inhibition, allowing it to become activated by top-down projections from motor cortex (PreSMA). In turn, thalamic activation reciprocally amplifies PreSMA activity, thereby generating a response. Striatal NoGo units have the opposite effect, via additional inhibitory projections to the external segment of the globus pallidus (Gpe), which effectively prevents a response from being selected. The net Go–NoGo activity difference is computed for each response in parallel by the BG circuitry and the response with the greatest difference is generally selected. (The subthalamic nucleus (STN) additionally modulates the threshold at which a response is executed, in proportion to cortical response conflict, and is included here for consistency but is not required for the effects reported in this paper).
Fig. 7
Fig. 7
Alternative pathways by which rule-based representations can bias responding in the network. (a) In the PFC-MC model, the PFC/HC “rule” layer projects to the motor cortex, but not to the striatum. (b) In the PFC-BG model, the PFC/HC layer projects to the Striatum, but not to the motor cortex. The complete model features both of these projections.
Fig. 8
Fig. 8
The effect of different learning rates for the instructed trial on each network model. For each model type we reported the results for the learning rate that provided the best fit of data from human subjects. Proportion correct is the amount of time the model chose according to the actual contingencies (60% for the critical stimulus), rather than the instructions. Higher learning rates in instructed trials generally produce more rule-following and less accurate responding.
Fig. 9
Fig. 9
Bayesian override model testing the possibility that subjects would abruptly abandon the rule upon accumulating sufficient evidence. Though the model fit the test data poorly compared to other models, the training data produced a good fit. The diversity of fits in the training phase indicate individual differences. Data here smoothed over 5 point moving average. (a) Subjects fit poorly by this model appeared to gradually shift from choosing according to instructions to choosing according to contingencies (this subject: pseudoR2 = 0.03). (b) Subjects best fit by this model also showed a learning curves most indicative of “insight.” (this subject: pseudoR2 = 0.21).
Fig. 10
Fig. 10
Plots of representative posterior estimated distributions for E and F stimuli. The basic Bayesian model computes optimally inferred probability distributions based on individual subject data. This model revealed that 4 subjects did not receive sufficient evidence to discriminate between the E and F stimuli. (a) Typical subject discriminated the relationship of the EF stimulus pair, E being more reliably correct than F. (b) One of four subjects who were unable to infer the correct relationship of E and F based on the probabilistic feedback received.

Similar articles

See all similar articles

Cited by 76 articles

See all "Cited by" articles
Feedback