Stepwise Diffusion Policy Optimization (SDPO)

This repository contains a PyTorch implementation of Stepwise Diffusion Policy Optimization (SDPO), as presented in our paper Aligning Few-Step Diffusion Models with Dense Reward Difference Learning.

🔥 News

[2026.02] Our paper has been accepted by IEEE TPAMI 🎉🎉🎉

📖 Overview

SDPO is a novel reinforcement learning framework tailored for aligning few-step diffusion models with downstream objectives.

Few-step diffusion models enable efficient high-resolution image synthesis but struggle to align with specific downstream objectives due to limitations of existing reinforcement learning (RL) methods in low-step regimes with limited state spaces and suboptimal sample quality. To address this, we propose Stepwise Diffusion Policy Optimization (SDPO), a novel RL framework tailored for few-step diffusion models. SDPO introduces a dual-state trajectory sampling mechanism, tracking both noisy and predicted clean states at each step to provide dense reward feedback and enable low-variance, mixed-step optimization. For further efficiency, we develop a latent similarity-based dense reward prediction strategy to minimize costly dense reward queries. Leveraging these dense rewards, SDPO optimizes a dense reward difference learning objective that enables more frequent and granular policy updates. Additional refinements, including stepwise advantage estimates, temporal importance weighting, and step-shuffled gradient updates, further enhance long-term dependency, low-step priority, and gradient stability. Our experiments demonstrate that SDPO consistently delivers superior reward-aligned results across diverse few-step settings and tasks.

Overall framework of Stepwise Diffusion Policy Optimization (SDPO)

Reward curves for low-step samples in SD-Turbo finetuning

🛠️ Installation

To set up this repository, clone it, create a new conda environment, and install all dependencies within it:

# Clone this repository
git clone https://github.com/ZiyiZhang27/sdpo.git
cd sdpo

# Create and activate a new conda environment (Python 3.10+)
conda create -n sdpo python=3.10 -y
conda activate sdpo

# Install dependencies
pip install -e .

# Configure accelerate based on your hardware setup
accelerate config

🚀 Quick Start

We provide pre-configured setups for multiple reward functions. Choose one of the following commands to start running SDPO:

Aesthetic Score:

accelerate launch scripts/train_sdpo.py --config config/config_sdpo.py:aesthetic

ImageReward:

accelerate launch scripts/train_sdpo.py --config config/config_sdpo.py:imagereward

HPSv2:

accelerate launch scripts/train_sdpo.py --config config/config_sdpo.py:hpsv2

PickScore:

accelerate launch scripts/train_sdpo.py --config config/config_sdpo.py:pickscore

💡 Tip: You can modify hyperparameters in the configuration files as needed:

config/base_sdpo.py - Base configuration with default values
config/config_sdpo.py - Task-specific configurations

Note: Values in config/base_sdpo.py override those in config/config_sdpo.py. The default configuration is optimized for 4× GPUs with 24GB+ memory each. Adjust batch sizes and gradient accumulation steps based on your hardware.

📝 Citation

If you find this work useful in your research, please consider citing our paper:

@article{zhang2026sdpo,
  title={Aligning Few-Step Diffusion Models with Dense Reward Difference Learning},
  author={Zhang, Ziyi and Shen, Li and Zhang, Sen and Ye, Deheng and Luo, Yong and Shi, Miaojing and Shan, Dongjing and Du, Bo and Tao, Dacheng},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2026}
}

🙏 Acknowledgements

This repository builds upon several excellent open-source projects:

DDPO-PyTorch - Foundation for RL-based diffusion model finetuning
D3PO - Foundation for DPO-based diffusion model finetuning
RLCM - DDPO and REBEL implementations for LCM finetuning
ImageReward, HPSv2, and PickScore - Reward function implementations

We thank the authors of these projects for their valuable contributions to the community.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
config		config
scripts		scripts
sdpo_pytorch		sdpo_pytorch
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stepwise Diffusion Policy Optimization (SDPO)

🔥 News

📖 Overview

🛠️ Installation

🚀 Quick Start

📝 Citation

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Stepwise Diffusion Policy Optimization (SDPO)

🔥 News

📖 Overview

🛠️ Installation

🚀 Quick Start

📝 Citation

🙏 Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages