LOGO

Robo-Dopamine: General Process Reward Modeling for High-Precision Robotic Manipulation

1Peking University; 2Beijing Academy of Artificial Intelligence (BAAI); 3University of Sydney; 4Institute of Automation;
*Equal contribution Project leaders  Corresponding author
highlight Project Video

 

Abstract

The primary obstacle for applying reinforcement learning (RL) to real-world robotics is the design of effective reward functions. While recently learning-based Process Reward Models (PRMs) are a promising direction, they are often hindered by two fundamental limitations: their reward models lack step-aware understanding and rely on single-view perception, leading to unreliable assessments of fine-grained manipulation progress; and their reward shaping procedures are theoretically unsound, often inducing a semantic trap that misguides policy optimization. To address these, we introduce Dopamine-Reward, a novel reward modeling method for learning a general-purpose, step-aware process reward model from multi-view inputs. At its core is our General Reward Model (GRM), trained on a vast 3,400+ hour dataset, which leverages Step-wise Reward Discretization for structural understanding and Multi-Perspective Reward Fusion to overcome perceptual limitations. Building upon Dopamine-Reward, we propose Dopamine-RL, a robust policy learning framework that employs a theoretically-sound Policy-Invariant Reward Shaping method, which enables the agent to leverage dense rewards for efficient self-improvement without altering the optimal policy, thereby fundamentally avoiding the semantic trap. Extensive experiments across diverse simulated and real-world tasks validate our approach. GRM achieves state-of-the-art accuracy in reward assessment, and Dopamine-RL built on GRM significantly improves policy learning efficiency. For instance, after GRM is adapted to a new task in a one-shot manner from a single expert trajectory, the resulting reward model enables Dopamine-RL to improve the policy from near-zero to 95% success with only 150 online rollouts, while retaining strong generalization across tasks.

Overview of Robo-Dopamine

teaser
Robo-Dopamine integrates large-scale reward modeling with a robust policy learning algorithm. (Left) We construct a General Reward Model (GRM) trained on a large and diverse 35M-sample dataset spanning real-world, simulation, and human-centric videos with our Dopamine-Reward, a step-aware fine-grained reward modeling method. This GRM learns to predict fine-grained, relative progress between states to accurately assess task progression. (Bottom Right) The pre-trained GRM is adapted to new tasks and provides dense reward signals to our Dopamine-RL framework. By using a theoretically-sound Policy-Invariant Reward Shaping method, Dopamine-RL efficiently guides the policy during online interactions without misaligning the task objective. (Top Right) Our integrated approach establishes a new state-of-the-art in reward accuracy (radar chart) and demonstrates high training efficiency, significantly boosting policy success rates in both simulation and the real world (bar chart).

Method of Robo-Dopamine

teaser
Robo-Dopamine is composed of two core components: (a) Dopamine-Reward Modeling Method and (b) Dopamine-RL Training Framework. (a) At the heart of our reward modeling is to build the General Reward Model (GRM), a vision-language model that is prompted with a task description and conditioned on multi-view images of initial, goal, "BEFORE," and "AFTER" states to predict a relative progress or regress hop. To ensure a stable and accurate signal, we employ Multi-Perspective Progress Fusion, which combines incremental, forward-anchored, and backward-anchored predictions into a final fused reward. (b) The Dopamine-RL framework first adapts the pre-trained GRM to a novel task using a single demonstration, i.e., One-Shot GRM Adaptation. Subsequently, it uses a theoretically-sound Policy-Invariant Reward Shaping method to convert the GRM's dense output into a reward signal that accelerates learning without altering the optimal policy. This approach is universally compatible with a wide range of RL algorithms.

Overview of GRM Training Data

teaser
(Left) The hierarchical composition of our 35M-sample training corpus. The dataset is derived from episodes spanning Real-World Robotics, Simulation, and Human-Centric domains, and is further expanded via multi-view augmentation. (Right) The long-tail distribution of task categories sorted by episode count (log scale). The dataset covers a broad spectrum of manipulation skills, ranging from atomic primitives to complex, multi-stage horizons.
highlight GRM Evaluation & Results

Evaluation on Different Data Sources

Fold the Pants (AgiLex).

Clean the Table (AgiLex).

Fill the Mug (RoboCasa).

Close the Drawer (RoboCasa).

Place Bowl on the Plate (LIBERO).

Open the Drawer (LIBERO).

Stack the Blocks (Flanka).

Stack the Blocks (Human).

Evaluation on Different Sampling Intervals

Sampling Intervals = 100

Sampling Intervals = 50

Sampling Intervals = 25

Sampling Intervals = 10

Analysis on Trajectory VOC and Status Detection

teaser
(Left) We evaluate reward models under three temporal sampling strategies: Sparse (S), Medium (M), and Dense (D). Our GRM variants (Ours-3B and Ours-8B) consistently outperform prior work. Notably, the Ours-8B (Multi-View) model sets a new state-of-the-art across all benchmarks and sampling densities, showcasing exceptional robustness and progress understanding. (Right) Task Completion Classification Accuracy (as successes out of 60). Our GRM more accurately classifies the final outcomes of robot rollouts compared to both specialized reward models and large generalist models.

A Challenging Real-world Rollout (Insert Square Block)

teaser
We plot the reference reward from human annotations, the VLAC baseline, and our GRM along the same trajectory. Our GRM tracks the reference signal more faithfully, sharply penalizing incorrect insertions, low positions, and misalignments, and only assigning high reward near successful task completion.
highlight Real-World RL DEMOs

Insert the Square Block.

Trigger the Circuit.

Cap the Pen.

How to Human-in-Loop (Hardware Setup)

teaser
Our multi-view hardware platform with the Pika teleoperation system and calibrated ZED cameras, providing synchronized wrist and third-person observations for GRM training and policy learning.

Analysis on Generalization and Efficiency

teaser
(Left) The table compares success counts (out of 20 trials) for Behavioral Cloning (BC) and Ours. The final row, Avg. Relative Drop (∆), quantifies the average relative success rate drop from ID performance when tested on OOD settings. (Right) Dopamine-RL achieves significantly higher performance with fewer human demonstrations. Sample efficiency is measured by episodes needed to reach 80% of the final success rate (lower is better).

Robustness to Artificial Disturbance

teaser
We visualize a rollout of the converged policy (Insert the Square Block, success rate > 95%) under human interference. Each sub-figure shows the third-person view, the ego-centric view, and the real-time GRM inference (Top: Hop, Bottom: Progress). (a) Artificial Disturbance Position: A human hand intervenes and shifts the target board while the robot attempts to approach. (b) Fall Into Misalignment: The robot misses the new position. Note that the GRM Progress curve drops significantly (indicated by the red dot in the bottom inset), reflecting the failure state. (c) Misalignment Recovery: The policy reacts to the visual feedback and the drop in reward, adjusting the end-effector position. (d) Move to the top: The robot realigns directly above the target slot. (e) Align with the Slot: Precise fine-tuning before insertion. (f) Successful Insertion: The task is completed, with the progress estimation reaching its peak.

Citation

If you find our work helpful, feel free to cite it:


@article{tan2025robodopamine,
    title={Robo-Dopamine: General Process Reward Modeling for High-Precision Robotic Manipulation}, 
    author={Tan, Huajie and Chen, Sixiang and Xu, Yijie and Wang, Zixiao and Ji, Yuheng and Chi, Cheng and Lyu, Yaoxu and Zhao, Zhongxia and Chen, Xiansheng and Co, Peterson and Xie, Shaoxuan and Yao, Guocai and Wang, Pengwei and Wang, Zhongyuan and Zhang, Shanghang},
    journal={arXiv preprint arXiv:2512.23703},
    year={2025}
}