Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Mar 12;15(3):e1006827.
doi: 10.1371/journal.pcbi.1006827. eCollection 2019 Mar.

Optimizing the depth and the direction of prospective planning using information values

Affiliations

Optimizing the depth and the direction of prospective planning using information values

Can Eren Sezener et al. PLoS Comput Biol. .

Abstract

Evaluating the future consequences of actions is achievable by simulating a mental search tree into the future. Expanding deep trees, however, is computationally taxing. Therefore, machines and humans use a plan-until-habit scheme that simulates the environment up to a limited depth and then exploits habitual values as proxies for consequences that may arise in the future. Two outstanding questions in this scheme are "in which directions the search tree should be expanded?", and "when should the expansion stop?". Here we propose a principled solution to these questions based on a speed/accuracy tradeoff: deeper expansion in the appropriate directions leads to more accurate planning, but at the cost of slower decision-making. Our simulation results show how this algorithm expands the search tree effectively and efficiently in a grid-world environment. We further show that our algorithm can explain several behavioral patterns in animals and humans, namely the effect of time-pressure on the depth of planning, the effect of reward magnitudes on the direction of planning, and the gradual shift from goal-directed to habitual behavior over the course of training. The algorithm also provides several predictions testable in animal/human experiments.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Overview of the pruning scheme, illustrated via an example.
(A) A snapshot of the search tree. Nodes of the tree represent states, and each state has a number of available actions, denoted with circles, that lead to next states. Blue graphs show value distributions for the leaves of the tree, estimated by the model-free (MF) or any other heuristic system. Green graphs show the immediate rewards for previously expanded state-actions, estimated via the model-based (MB) system. (B) Each path from the root to a leave forms a strategy, Ai, with a corresponding value distribution. These distributions are obtained by summing up the value distributions of the leaves with the immediate reward distributions accumulated along the way. (C) To compute the value of uncertainty resolution (vur), say for A3, the agents assumes that one further expansion would result in a sharper value distribution (one of the black/grey distributions). The location (i.e., the mean) of the new distribution cannot be known in advance, but it can be treated as a random variable, whose distribution can be analytically obtained (Eq 14). The vur for A3 is therefore the expected value, over all possible sharper distributions (grey curves), of the additional rewards that can be obtained by a policy improvement in the light of that potential new information (i.e., the sharper distribution). (D) After computing vur for all strategies Ai, the highest vur (in this case, for A3) is compared to the cost of expansion. If it is bigger than the cost, the tree expands along the direction of that strategy. This corresponds to loading a new node, which is the successor state of the leaf of A3, from the MB system and adding it to the tree.
Fig 2
Fig 2. Grid-world pruning simulation results.
Reaching the bottom-right corner of the map with minimum moves is rewarding. The heatmaps show the frequencies of state-visits during the tree expansion when the agent starts from the middle of the map, and (A) the agent has had no prior exposure to the environment, or (B) after some exposure (i.e., 10 trajectory samples from each state) resulting in more accurate estimates of model-free values.
Fig 3
Fig 3. Example search trees from [12].
A: Starting at state 3, subjects make three consecutive decisions (pressing ‘U’ or ‘I’), each of which are associated with a gain or loss. Two trajectories maximize the cumulative rewards in this example and achieve −20. B and C: State transition frequencies of subjects. Higher frequencies are illustrated with thicker lines. If a transition is not taken by any of the subjects, then it is illustrated with a dashed line. Yellow backgrounds show the optimal trajectories. Colors red, black, green, and blue denote the transition rewards of P, −20, + 20 and + 140 respectively. B: P = −140 condition. It can be seen that the subjects avoid the action associated with the large punishment. C: P = −70 condition. Subjects are eager to take transitions with large losses when such transitions lead to large gains (i.e., + 140), which in fact is the optimal strategy. Reprinted with permission from [12].
Fig 4
Fig 4. The frequency of pruning the branch with the large punishment.
The black area on the right is the region where the agent does not prune (i.e., expands) the punishment branch. Each condition is averaged over 300 simulations.
Fig 5
Fig 5. The top panels show the effect of different factors on choosing the optimal sequence of action.
The panels are adapted from [12]. The x-axis denotes the number of actions the subjects were supposed to take, which determines the maximum depth of the search tree. The y-axis denotes the probability of choosing the Optimal Lookahead sequence. The blue lines represent the condition that the optimal sequences of actions included a big loss, and the green lines represent the condition that the optimal sequence of actions did not include a big loss. The amount of big loss is varied among the panels, and is mentioned by Group X on top of the panels, in which X denotes the amount of big loss (X = -140, -100, -70). The bottom panels are similar to the top panels but using the data obtained from the simulations of the model in the same settings.

Similar articles

Cited by

References

    1. Aurelius M. Meditations. Great Britain: Penguin Books; 2014.
    1. Sutton RS, Barto AG. Introduction to Reinforcement Learning. 1st ed Cambridge, MA, USA: MIT Press; 1998.
    1. Russell SJ, Norvig P. Artificial Intelligence: A Modern Approach (2nd Edition). Prentice Hall; 2002. Available from: http://www.amazon.ca/exec/obidos/redirect?tag=citeulike09-20&path=AS....
    1. Russell S, Wefald E. Do the right thing Studies in limited rationality. MIT Press; 1991.
    1. Schultz W, Dayan P, Montague RP. A Neural Substrate of Prediction and Reward. Science. 1997;275:1593–1599. 10.1126/science.275.5306.1593 - DOI - PubMed

Publication types

Grants and funding

AD was supported by grant DP150104878 from the Australian Research Council, and MK by the Gatsby Charitable Foundation and the Max Planck Society. We acknowledge support by the German Research Foundation and the Open Access Publication Fund of TU Berlin. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.