Differentiable Logic Policy for Interpretable Deep Reinforcement Learning: A Study From an Optimization Perspective

IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):11654-11667. doi: 10.1109/TPAMI.2023.3285634. Epub 2023 Sep 5.

Abstract

The interpretability of policies remains an important challenge in Deep Reinforcement Learning (DRL). This paper explores interpretable DRL via representing policy by Differentiable Inductive Logic Programming (DILP) and provides a theoretical and empirical study of DILP-based policy learning from an optimization perspective. We first identified a fundamental fact that DILP-based policy learning should be solved as a constrained policy optimization problem. We then proposed to use Mirror Descent for policy optimization (MDPO) to deal with the constraints of DILP-based policies. We derived the closed-form regret bound of MDPO with function approximation, which is helpful to the design of DRL frameworks. Moreover, we studied the convexity of DILP-based policy to further verify the benefits gained from MDPO. Empirically, we experimented MDPO, its on-policy variant, and 3 mainstream policy learning methods, and the results verified our theoretical analysis.