Dichotomous Diffusion Policy Optimization

Ruiming Liang, Yinan Zheng, Kexin ZHENG, Tianyi Tan, Jianxiong Li, Liyuan Mao, Zhihao Wang, Guang Chen, Hangjun Ye, Jingjing Liu, Jinqiao Wang, Xianyuan Zhan

January, 2026

Abstract

Diffusion-based policies have gained growing popularity in solving a wide range of decision-making tasks due to their superior expressiveness and controllable generation during inference. However, effectively training large diffusion policies using reinforcement learning (RL) remains challenging. Existing methods either suffer from unstable training due to directly maximizing value objectives, or face computational issues due to relying on crude Gaussian likelihood approximation, which requires a large amount of sufficiently small denoising steps. In this work, we propose DIPOLE (Dichotomous diffusion Policy improvement), a novel RL algorithm designed for stable and controllable diffusion policy optimization. We begin by revisiting the KL-regularized objective in RL, which offers a desirable weighted regression objective for diffusion policy extraction, but often struggles to balance greediness and stability. We then formulate a greedified policy regularization scheme, which naturally enables decomposing the optimal policy into a pair of stably learned dichotomous policies: one aims at reward maximization, and the other focuses on reward minimization. Under such a design, optimized actions can be generated by linearly combining the scores of dichotomous policies during inference, thereby enabling flexible control over the level of greediness.Evaluations in offline and offline-to-online RL settings on ExORL and OGBench demonstrate the effectiveness of our approach. We also use DIPOLE to train a large vision-language-action (VLA) model for end-to-end autonomous driving (AD) and evaluate it on the large-scale real-world AD benchmark NAVSIM, highlighting its potential for complex real-world applications.

Type

Conference paper

Publication

In the 14th International Conference on Learning Representations (ICLR 2026)

Algorithms Robotics

Dichotomous Diffusion Policy Optimization

Abstract

Ruiming Liang

Research Intern

Yinan Zheng

PhD Candidate

Kexin ZHENG

Research Intern

Tianyi Tan

Research Intern

Jianxiong Li

PhD Candidate

Liyuan Mao

PhD student at Shanghai Jiao Tong University

Zhihao Wang

Research Intern

Xianyuan Zhan

Faculty Member