Deep reinforcement learning (RL) algorithms typically parameterize the policy as a deep network that outputs either a deterministic action or a stochastic one modeled as a Gaussian distribution, hence restricting learning to a single behavioral mode. Meanwhile, diffusion models emerged as a powerful framework for multimodal learning. However, the use of diffusion policies in online RL is hindered by the intractability of policy likelihood approximation, as well as the greedy objective of RL methods that can easily skew the policy to a single mode. This paper presents Deep Diffusion Policy Gradient (DDiffPG), a novel actor-critic algorithm that learns from scratch multimodal policies parameterized as diffusion models while discovering and maintaining versatile behaviors. DDiffPG explores and discovers multiple modes through off-the-shelf unsupervised clustering combined with novelty-based intrinsic motivation. DDiffPG forms a multimodal training batch and utilizes mode-specific Q-learning to mitigate the inherent greediness of the RL objective, ensuring the improvement of the diffusion policy across all modes. Our approach further allows the policy to be conditioned on mode-specific embeddings to explicitly control the learned modes. Empirical studies validate DDiffPG’s capability to master multimodal behaviors in complex, high-dimensional continuous control tasks with sparse rewards, also showcasing proof-of-concept dynamic online replanning when navigating mazes with unseen obstacles.
Method
AntMazes
DDiffPG demonstrates consistent exploration and acquisition of multiple behaviors. In AntMazes, multiple paths exist. DDiffPG is capable of learning and freely executing all these paths
AntMazes with random obstacles
Robotic control
ReachingPeg-InsertionDrawer-CloseCabinet-Open
DDiffPG is capable of mastering different behaviors in challenging robotic manipulation tasks. For example, in Cabinet-Open, the agent can move the arm to either layer and subsequently pull the door open.
Performance
DDiffPG has comparable performance to the baselines on all eight tasks while acquiring multimodal behaviors. In the AntMaze tasks, the sample efficiency of DDiffPG, DIPO, TD3 and SAC are similar. DDiffPG generally demonstrates lower sample efficiency than baselines in tasks that pose significant exploration challenges — this is expected since our method strives to discover multiple solutions.
Analysis
We demonstrate the potential of DDiffPG in exploring better using the exploration density maps and state coverage rates in the AntMaze environments. In selected results for AntMaze-v3, DDiffPG explores multiple paths to the two separate goal positions, contrasting sharply with baselines that typically discover only a single path.
We showcase that DDiffPG effectively overcomes local minima when learning a multimodal policy. The key intuition is that unlike other methods that explore the first solution and collapse into it, DDiffPG continuously explores and seeks different solutions, enabling it to escape suboptimal local minima.
Given a collection of goal-reached trajectories, each consisting of a sequence of state-action pairs, we categorize them into clusters and consider each a behavior mode (each color represents a mode). In practice, we use an unsupervised hierarchical clustering approach with Dynamic Time Warping (DTW) measure.
Impact of hyper-parameters
We investigate the impact of the number of diffusion steps, batch size, action gradient learning rate, and number of Updates-To-Data (UTD) ratio. These hyperparameters are of particular interest given the diffusion policy and our learning procedure.