Leaderboard

OPD metrics on RoboChallenge Table30. Each model runs 10 rollouts per task.

BenchmarkRoboChallenge Table30 Last UpdatedMarch 20, 2026 EvaluatorRobo-Dopamine (GRM-2.0-8B-Preview)

Evaluator note: The leaderboard scores on this page are produced with Robo-Dopamine (GRM-2.0-8B-Preview). Our current RoboChallenge results are reproduced by auditing publicly available execution videos from the official benchmark release. We warmly welcome stronger PRM evaluators from the community to improve robustness, coverage, and fairness of dense trajectory auditing.

Methodology and metric definitions are described in our technical blog, and we will keep this leaderboard synchronized as new benchmark collaborations go live.

Community Call For Transparent Evaluation

We encourage benchmark organizers and model developers to release transparent execution videos together with leaderboard submissions. Open rollout evidence makes it possible to inspect not only whether a policy succeeded, but also how it succeeded or failed.

PRM-as-a-Judge is not limited to RoboChallenge. We are actively looking to collaborate with other benchmark teams and researchers to co-build a broader, cross-benchmark, transparent evaluation ecosystem. We are happy to provide dense OPD-based auditing reports (progress depth, execution quality, and failure diagnosis) on shared trajectories, and we also welcome stronger PRM models from the community to co-develop this open evaluation stack.

We welcome collaboration and feedback from the robotics community. If you have questions or would like to work with us on dense trajectory evaluation, please reach out at jiyuheng2023@ia.ac.cn.

RoboChallenge Table30 — Overall Performance

Average OPD metrics across all 30 tasks. Sorted by Avg MC@100 (task completion rate).

Per-Task Results

Select a task to view OPD metrics for each model on that specific task.

Task:

Select a task above.

Citation

If this leaderboard or evaluation pipeline helps your work, please cite us:

@article{ji2026prmjudge,
  title   = {PRM-as-a-Judge: A Dense Evaluation Paradigm for Fine-Grained Robotic Auditing},
  author  = {Ji, Yuheng and Liu, Yuyang and Tan, Huajie and Huang, Xuchuan and Huang, Fanding and Xu, Yijie and Chi, Cheng and Zhao, Yuting and Lyu, Huaihai and Co, Peterson and others},
  journal = {arXiv preprint arXiv:2603.21669},
  year    = {2026}
}