Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 13, 103
eCollection

Constrained Deep Q-Learning Gradually Approaching Ordinary Q-Learning

Affiliations

Constrained Deep Q-Learning Gradually Approaching Ordinary Q-Learning

Shota Ohnishi et al. Front Neurorobot.

Abstract

A deep Q network (DQN) (Mnih et al., 2013) is an extension of Q learning, which is a typical deep reinforcement learning method. In DQN, a Q function expresses all action values under all states, and it is approximated using a convolutional neural network. Using the approximated Q function, an optimal policy can be derived. In DQN, a target network, which calculates a target value and is updated by the Q function at regular intervals, is introduced to stabilize the learning process. A less frequent updates of the target network would result in a more stable learning process. However, because the target value is not propagated unless the target network is updated, DQN usually requires a large number of samples. In this study, we proposed Constrained DQN that uses the difference between the outputs of the Q function and the target network as a constraint on the target value. Constrained DQN updates parameters conservatively when the difference between the outputs of the Q function and the target network is large, and it updates them aggressively when this difference is small. In the proposed method, as learning progresses, the number of times that the constraints are activated decreases. Consequently, the update method gradually approaches conventional Q learning. We found that Constrained DQN converges with a smaller training dataset than in the case of DQN and that it is robust against changes in the update frequency of the target network and settings of a certain parameter of the optimizer. Although Constrained DQN alone does not show better performance in comparison to integrated approaches nor distributed methods, experimental results show that Constrained DQN can be used as an additional components to those methods.

Keywords: constrained reinforcement learning; deep Q network; deep reinforcement learning; learning stabilization; regularization; target network.

Figures

Figure 1
Figure 1
MNIST maze task. The agent aims to reach goal “5” on the 3 × 3 maze by selecting either an up, down, left, or right movement. The lines separating the squares are yellow, green, red, or pink, and they do not change over the episodes. It is impossible to pass through a pink wall, and if the agent selects the direction to a pink wall, the movement is canceled and that agent's position does not change. If the agent reaches “5” across the green line, a + 1 reward is provided, and if the agent reaches “5” across the red line, a − 1 reward is provided. The agent can observe an image of 24 × 24 pixels in which it resides. The number assignment is fixed for all episodes, but the image for each number is changed at the onset of each episode. For example, the upper left tile is always a “1,” but the image of “1” is randomly selected from the training data set of MNIST handwritten digits at the onset of each episode.
Figure 2
Figure 2
Network structures for the Q function. (A) MNIST maze task. (B) Mountain-Car task. (C) robot navigation task. Every convolutional layer is represented by its type, channel size, kernel size, and stride size. Other layers are represented by their types and dimensions.
Figure 3
Figure 3
Learning curve of DQN (red), Q learning (green), DQN with TC-loss (cyan), and Constrained DQN (blue) on the MNIST maze task. Here, Q learning refers to DQN without the use of experience reply or the target network. Horizontal axis denotes the number of learning episodes. Vertical axis denotes the moving average of the total reward received in each learning episode. Lightly colored zone represents the standard deviation.
Figure 4
Figure 4
Comparison of values of the Q function for each state and action pair on the MNIST maze task. (A) Average Q value obtained by Constrained DQN after 50,000 training steps (not training episodes). (B) Average Q value obtained by DQN. (C) True Q value. The values of up, down, left, and right at each state are filled at the corresponding sides of the number position of Figure 1.
Figure 5
Figure 5
Comparison of variance of the Q function for each state and action pair on the MNIST maze task. (A) Variance of Q value obtained by Constrained DQN after 50,000 training steps. (B) Variance of Q value obtained by DQN. The values of up, down, left, and right at each state are filled at the corresponding sides of the number position of Figure 1.
Figure 6
Figure 6
Effects of the update frequency of the target network on the MNIST maze task. (A) Learning curves of DQN. (B) Learning curves of Constrained DQN with λ = 1 and η = 10−5. (C) That of Constrained DQN with λ = 1 and η = 10−2. (D) That of Constrained DQN with λ = 2 and η = 10−5. Horizontal axis denotes the number of learning episodes in the logarithmic scale. Vertical axis denotes the moving average of the reward received in each learning episode. The legend indicates the update frequency of the target network. Shaded area represents the standard deviation. Each experiment was performed for 1,000,000 learning steps, and the results of up to 100,000 episodes are displayed.
Figure 7
Figure 7
Comparison of the number of steps by which the constraint is violated with different frequencies of updating the target network on the MNIST maze task. Horizontal axis represents the number of learning steps. Vertical axis represents the number of steps in which the constraints were activated within a fixed number of steps. The left column is the result when η = 10−5, and the right column is when η = 10−2 (λ = 1 in both columns). (A) The case of C = 10000 and η = 10−5, (B) that of C = 10000 and η = 10−2, (C) that of C = 1000 and η = 10−5, (D) that of C = 1000 and η = 10−2, (E) that of C = 100 and η = 10−5, (F) that of C = 100 and η = 10−2, (G) that of C = 10 and η = 10−5, and (H) that of C = 10 and η = 10−5.
Figure 8
Figure 8
Total rewards across different parameter settings on the MNIST maze task. Darker colors depict low total rewards and lighter colors depict higher ones. In each panel, the horizontal and vertical axes denote the update frequency of the target network and the λ value, respectively. (A) The case of η = 10−2, (B) that of η = 10−5, and (C) that of η = 0.
Figure 9
Figure 9
Learning curves of DQN (red), Q learning (green), DQN with TC-loss (cyan), and Constrained DQN (blue) on the Mountain-Car task. Horizontal axis denotes the number of learning episodes. Vertical axis denotes the moving average of the total reward received in each learning episode. The shaded area represents the standard deviation.
Figure 10
Figure 10
Comparison of learning curves with different random seeds on the Mountain-Car task. Each color indicates the learning curve for one random seed. We examined DQN and our Constrained DQN with two settings of ξ, which is the parameter of the optimizer (RMSProp). Horizontal axis denotes the number of learning steps (not learning episodes). Vertical axis denotes the moving average of the reward received in each learning episode. (A) The learning curves of DQN (ξ = 1), (B) those of DQN (ξ = 0.01), (C) those of Constrained DQN (ξ = 1), and (D) those of Constrained DQN (ξ = 0.01).
Figure 11
Figure 11
Comparison of the L1-norm of gradients of the last fully connected layer with different random seeds on the Mountain-Car task. We examined DQN and our Constrained DQN with two different settings of ξ, which is the parameter of the optimizer (RMSProp). Horizontal axis denotes the number of learning steps, and vertical axis denotes the moving average of the L1-norm of gradients for the last fully connected layer. (A) The L1-norm gradients of DQN (ξ = 1), (B) those of DQN (ξ = 0.01), (C) those of Constrained DQN (ξ = 1), and (D) those of Constrained DQN (ξ = 0.01).
Figure 12
Figure 12
Robot navigation task. Three mobile robots (Turtlebot3 waffle pi), six green trash cans, and various objects were placed in the environment. The objective of the robot is to move to one of the green trash cans without colliding against other objects, including obstacles.
Figure 13
Figure 13
Learning curves of Constrained DQN, DQN, DQN with TC-loss, Q learning, Double DQN (DDQN), and and Soft Q learning (SQL) on the robot navigation task. Here, Q learning refers to DQN without the use of experience reply. (A) Number of steps to reach the green trash can. (B) Number of collisions with obstacles. Horizontal axis denotes the number of learning episodes. Vertical axes denote, respectively, the number of steps and that of collisions. Shaded area represents the standard deviation.
Figure 14
Figure 14
Average performance over ten experiments of the standard Constrained DQN (C-DQN), C-DQN with the dueling architecture (C-DQL w/ DA), C-DQN with the entropy regularization (C-DQL w/ ER), the standard TC-loss, TC-loss and the dueling architecture (TC-loss w/ DA), and TC-loss and the entropy regularization (TC-loss w/ ER). Here, TC-loss refers to DQN with TC-loss. (A) Ms. PacMan. (B) Seaquest. Shaded area represents one standard deviation from the mean.

Similar articles

See all similar articles

References

    1. Achiam J., Knight E., Abbeel P. (2019). Towards characterizing divergence in deep Q-learning. arXiv[Prepront].arXiv:1903.08894.
    1. Andrychowicz M., Wolski F., Ray A., Schneider J., Fong R., Welinder P., et al. (2017). “Hindsight experience replay,” in Advances in Neural Information Processing Systems, Vol. 30, eds I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Long Beach, CA: Curran Associates, Inc.), 5048–5058.
    1. Anschel O., Baram N., Shimkin N. (2017). “Averaged-DQN: variance reduction and stabilization for deep reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning (Sydney, NSW: ), 176–185.
    1. Azar M. G., Munos R., Ghavamzadeh M., Kappen H. J. (2011). “Speedy Qlearning,” in Advances in Neural Information Processing Systems, Vol. 24, eds J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Granada: Curran Associates, Inc.), 2411–2419.
    1. Baird L. (1995). “Residual algorithms: reinforcement learning with function approximation,” in Proceedings of the 12th International Conference on Machine Learning (Montreal, QC: ), 30–37. 10.1016/B978-1-55860-377-6.50013-X - DOI

LinkOut - more resources

Feedback