Mastering Atari, Go, chess and shogi by planning with a learned model

Julian Schrittwieser; Ioannis Antonoglou; Thomas Hubert; Karen Simonyan; Laurent Sifre; Simon Schmitt; Arthur Guez; Edward Lockhart; Demis Hassabis; Thore Graepel; Timothy Lillicrap; David Silver

doi:10.1038/s41586-020-03051-4

Mastering Atari, Go, chess and shogi by planning with a learned model

Nature. 2020 Dec;588(7839):604-609. doi: 10.1038/s41586-020-03051-4. Epub 2020 Dec 23.

Authors

Julian Schrittwieser^#¹, Ioannis Antonoglou^#^{1

2}, Thomas Hubert^#¹, Karen Simonyan¹, Laurent Sifre¹, Simon Schmitt¹, Arthur Guez¹, Edward Lockhart¹, Demis Hassabis¹, Thore Graepel^{1

2}, Timothy Lillicrap¹, David Silver^#^{3

4}

Affiliations

¹ DeepMind, London, UK.
² University College London, London, UK.
³ DeepMind, London, UK. davidsilver@google.com.
⁴ University College London, London, UK. davidsilver@google.com.

^# Contributed equally.

PMID: 33361790
DOI: 10.1038/s41586-020-03051-4

Abstract

Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess¹ and Go², where a perfect simulator is available. However, in real-world problems, the dynamics governing the environment are often complex and unknown. Here we present the MuZero algorithm, which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. The MuZero algorithm learns an iterable model that produces predictions relevant to planning: the action-selection policy, the value function and the reward. When evaluated on 57 different Atari games³-the canonical video game environment for testing artificial intelligence techniques, in which model-based planning approaches have historically struggled⁴-the MuZero algorithm achieved state-of-the-art performance. When evaluated on Go, chess and shogi-canonical environments for high-performance planning-the MuZero algorithm matched, without any knowledge of the game dynamics, the superhuman performance of the AlphaZero algorithm⁵ that was supplied with the rules of the game.