Robo-Dopamine

Robo-Dopamine: General Process Reward Modeling for High-Precision Robotic Manipulation

¹Peking University; ²Beijing Academy of Artificial Intelligence (BAAI); ³University of Sydney; ⁴Institute of Automation;

^*Equal contribution ^†Project leaders ^✉Corresponding author

Paper Code Dataset (Coming Soon) Checkpoints

Overview of Robo-Dopamine

Robo-Dopamine integrates large-scale reward modeling with a robust policy learning algorithm. (Left) We construct a General Reward Model (GRM) trained on a large and diverse 35M-sample dataset spanning real-world, simulation, and human-centric videos with our Dopamine-Reward, a step-aware fine-grained reward modeling method. This GRM learns to predict fine-grained, relative progress between states to accurately assess task progression. (Bottom Right) The pre-trained GRM is adapted to new tasks and provides dense reward signals to our Dopamine-RL framework. By using a theoretically-sound Policy-Invariant Reward Shaping method, Dopamine-RL efficiently guides the policy during online interactions without misaligning the task objective. (Top Right) Our integrated approach establishes a new state-of-the-art in reward accuracy (radar chart) and demonstrates high training efficiency, significantly boosting policy success rates in both simulation and the real world (bar chart).

Method of Robo-Dopamine

Robo-Dopamine is composed of two core components: (a) Dopamine-Reward Modeling Method and (b) Dopamine-RL Training Framework. (a) At the heart of our reward modeling is to build the General Reward Model (GRM), a vision-language model that is prompted with a task description and conditioned on multi-view images of initial, goal, "BEFORE," and "AFTER" states to predict a relative progress or regress hop. To ensure a stable and accurate signal, we employ Multi-Perspective Progress Fusion, which combines incremental, forward-anchored, and backward-anchored predictions into a final fused reward. (b) The Dopamine-RL framework first adapts the pre-trained GRM to a novel task using a single demonstration, i.e., One-Shot GRM Adaptation. Subsequently, it uses a theoretically-sound Policy-Invariant Reward Shaping method to convert the GRM's dense output into a reward signal that accelerates learning without altering the optimal policy. This approach is universally compatible with a wide range of RL algorithms.

Overview of GRM Training Data

(Left) The hierarchical composition of our 35M-sample training corpus. The dataset is derived from episodes spanning Real-World Robotics, Simulation, and Human-Centric domains, and is further expanded via multi-view augmentation. (Right) The long-tail distribution of task categories sorted by episode count (log scale). The dataset covers a broad spectrum of manipulation skills, ranging from atomic primitives to complex, multi-stage horizons.

GRM Evaluation & Results

Evaluation on Different Data Sources

Fold the Pants (AgiLex).

Clean the Table (AgiLex).

Fill the Mug (RoboCasa).

Close the Drawer (RoboCasa).

Place Bowl on the Plate (LIBERO).

Open the Drawer (LIBERO).

Stack the Blocks (Flanka).

Stack the Blocks (Human).

Evaluation on Different Sampling Intervals

Sampling Intervals = 100

Sampling Intervals = 50

Sampling Intervals = 25

Sampling Intervals = 10

Analysis on Trajectory VOC and Status Detection

A Challenging Real-world Rollout (Insert Square Block)

How to Human-in-Loop (Hardware Setup)

Our multi-view hardware platform with the Pika teleoperation system and calibrated ZED cameras, providing synchronized wrist and third-person observations for GRM training and policy learning.

Analysis on Generalization and Efficiency

(Left) The table compares success counts (out of 20 trials) for Behavioral Cloning (BC) and Ours. The final row, Avg. Relative Drop (∆), quantifies the average relative success rate drop from ID performance when tested on OOD settings. (Right) Dopamine-RL achieves significantly higher performance with fewer human demonstrations. Sample efficiency is measured by episodes needed to reach 80% of the final success rate (lower is better).

Robustness to Artificial Disturbance

We visualize a rollout of the converged policy (Insert the Square Block, success rate > 95%) under human interference. Each sub-figure shows the third-person view, the ego-centric view, and the real-time GRM inference (Top: Hop, Bottom: Progress). (a) Artificial Disturbance Position: A human hand intervenes and shifts the target board while the robot attempts to approach. (b) Fall Into Misalignment: The robot misses the new position. Note that the GRM Progress curve drops significantly (indicated by the red dot in the bottom inset), reflecting the failure state. (c) Misalignment Recovery: The policy reacts to the visual feedback and the drop in reward, adjusting the end-effector position. (d) Move to the top: The robot realigns directly above the target slot. (e) Align with the Slot: Precise fine-tuning before insertion. (f) Successful Insertion: The task is completed, with the progress estimation reaching its peak.

Citation

If you find our work helpful, feel free to cite it:

@article{tan2025robo, title={Robo-Dopamine: General Process Reward Modeling for High-Precision Robotic Manipulation}, author={Tan, Huajie and Chen, Sixiang and Xu, Yijie and Wang, Zixiao and Ji, Yuheng and Chi, Cheng and Lyu, Yaoxu and Zhao, Zhongxia and Chen, Xiansheng and Co, Peterson and others}, journal={arXiv preprint arXiv:2512.23703}, year={2025} }