A deep Q network (DQN) (Mnih et al., 2013) is an extension of Q learning, which is a typical deep reinforcement learning method. In DQN, a Q function expresses all action values under all states, and it is approximated using a convolutional neural network. Using the approximated Q function, an optimal policy can be derived. In DQN, a target network, which calculates a target value and is updated by the Q function at regular intervals, is introduced to stabilize the learning process. A less frequent updates of the target network would result in a more stable learning process. However, because the target value is not propagated unless the target network is updated, DQN usually requires a large number of samples. In this study, we proposed Constrained DQN that uses the difference between the outputs of the Q function and the target network as a constraint on the target value. Constrained DQN updates parameters conservatively when the difference between the outputs of the Q function and the target network is large, and it updates them aggressively when this difference is small. In the proposed method, as learning progresses, the number of times that the constraints are activated decreases. Consequently, the update method gradually approaches conventional Q learning. We found that Constrained DQN converges with a smaller training dataset than in the case of DQN and that it is robust against changes in the update frequency of the target network and settings of a certain parameter of the optimizer. Although Constrained DQN alone does not show better performance in comparison to integrated approaches nor distributed methods, experimental results show that Constrained DQN can be used as an additional components to those methods.
constrained reinforcement learning; deep Q network; deep reinforcement learning; learning stabilization; regularization; target network.
Pan J, Wang X, Cheng Y, Yu Q, Jie Pan, Xuesong Wang, Yuhu Cheng, Qiang Yu, Yu Q, Cheng Y, Pan J, Wang X.Pan J, et al.IEEE Trans Neural Netw Learn Syst. 2018 Jun;29(6):2227-2238. doi: 10.1109/TNNLS.2018.2806087.IEEE Trans Neural Netw Learn Syst. 2018.PMID: 29771674
Wang X, Gu Y, Cheng Y, Liu A, Chen CLP.Wang X, et al.IEEE Trans Neural Netw Learn Syst. 2019 Aug 6. doi: 10.1109/TNNLS.2019.2927227. Online ahead of print.IEEE Trans Neural Netw Learn Syst. 2019.PMID: 31398131
Meng W, Zheng Q, Yang L, Li P, Pan G.Meng W, et al.IEEE Trans Neural Netw Learn Syst. 2019 Nov 22. doi: 10.1109/TNNLS.2019.2948892. Online ahead of print.IEEE Trans Neural Netw Learn Syst. 2019.PMID: 31765320
Achiam J., Knight E., Abbeel P. (2019). Towards characterizing divergence in deep Q-learning. arXiv[Prepront].arXiv:1903.08894.
Andrychowicz M., Wolski F., Ray A., Schneider J., Fong R., Welinder P., et al. (2017). “Hindsight experience replay,” in Advances in Neural Information Processing Systems, Vol. 30, eds I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Long Beach, CA: Curran Associates, Inc.), 5048–5058.
Anschel O., Baram N., Shimkin N. (2017). “Averaged-DQN: variance reduction and stabilization for deep reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning (Sydney, NSW: ), 176–185.
Azar M. G., Munos R., Ghavamzadeh M., Kappen H. J. (2011). “Speedy Qlearning,” in Advances in Neural Information Processing Systems, Vol. 24, eds J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Granada: Curran Associates, Inc.), 2411–2419.
Baird L. (1995). “Residual algorithms: reinforcement learning with function approximation,” in Proceedings of the 12th International Conference on Machine Learning (Montreal, QC: ), 30–37. 10.1016/B978-1-55860-377-6.50013-X