Back to World Model Guide返回 World Model 赛道指南

World Model as Evaluator — Scoring Criteria World Model as Evaluator — 评分细则

Human evaluation protocol for WM-as-VLA-evaluator submissions (0–3 per video). 世界模型作为 VLA 评测器能力的人工评测说明(每条视频 0~3 分)。

Evaluation method 评估方法

Annotators watch videos and use the reference (ground-truth) video as the baseline. They judge whether the generated video is consistent with the reference in terms of outcome: both should represent success, or both should represent failure.

评测时播放视频,以参考视频(GT)为基准,判断生成视频与参考视频在结果意义上是否一致(是否都为成功或都为失败)。

Each evaluated clip receives a score from 0 to 3. Detailed criteria are below. Reference videos for each task are provided in the official dataset.

每条待评视频记 0~3 分,评分细节见下文。各 Task 的参考视频随官方数据集提供。

Aggregation & final score 汇总与最终分数

  • For each team and each task, scores are collected from three annotators. Each annotator may be assigned up to 10 videos.
  • 每个队伍的每个 task 由 3 名采集员评测;每名采集员最多分配 10 条视频。
  • Per-task score is a weighted average over all scored videos in that task: (3×N₃ + 2×N₂ + 1×N₁) / N_total, where N_k is the count of videos scored k, and N_total is the total number of videos. If a submission contains fewer videos than GT, the denominator follows the total number of videos in GT for that task.
  • 按每个 task 内所有已评视频计算加权平均分:(3×3 分视频数 + 2×2 分视频数 + 1×1 分视频数)/ 总视频数。若提交视频数量不足,以该 task 在 GT 中的总视频数为分母。
  • The team’s final score is the average of the per-task scores across the 8 tasks.
  • 8 个 task 的平均分作为该队的最终分数

Score definitions & reference examples 分数含义与参考示例

In each example below, the first row is the real-robot reference video; the second row is the video to be evaluated. 以下示例中,上一行为真机参考视频,下一行为待评估视频。

3 points 3 分

Action matches the reference; final object state matches; no obvious object deformation during the episode; physics and collisions look plausible. 与参考视频动作一致,物体最终状态一致,过程中物体无明显形变,物理与碰撞无明显异常。

Example (score 3) 示例(3 分)

Reference: insert the ear into the box. The evaluated video also completes the insertion without noticeable deformation → 3. 参考:将耳朵插入盒子。待评视频同样完成插入且过程无明显形变 → 3 分。

2 points 2 分

Final object state matches the reference, but there is visible deformation, unrealistic physics/collisions, or motion inconsistency during the trajectory. 最终物体状态与参考一致,但过程中出现物体形变、物理或碰撞不真实,或与参考动作不一致。

Example A (score 2) 示例 A(2 分)

Reference: place the banana in the basket. The evaluated video ends with the banana in the basket (state OK), but the banana deforms during the process → 2. 参考:香蕉放入篮子。待评视频最终香蕉在篮中(状态一致),但过程中香蕉发生形变 → 2 分。

Example B (score 2) 示例 B(2 分)

Reference: banana not placed in the basket. The evaluated video also ends with failure (consistent outcome), but objects deform or interact unnaturally → 2. 参考:香蕉未放入篮子。待评视频最终也未放入(状态一致),但过程中香蕉与其他物体发生形变 → 2 分。

1 point 1 分

Robot motion is broadly aligned with the reference, but the final object state disagrees with the reference (such as success vs failure mismatch, object disappearance, etc.). 机械臂整体动作与参考基本一致,但最终物体状态与参考不一致,如成功与失败状态不一致、物体消失等情形。

Example A (score 1) 示例 A(1 分)

Reference: banana not in the basket. The evaluated video ends with the banana in the basket (wrong outcome), while the arm motion still resembles the reference → 1. 参考:香蕉未在篮中。待评视频最终香蕉在篮中(状态不符),但整体动作与参考相近 → 1 分。

Example B (score 1) 示例 B(1 分)

Reference: banana in the basket. The evaluated video shows grasping, but the banana disappears from the basket at the end (wrong state) while arm motion looks normal → 1. 参考:香蕉在篮中。待评视频虽有抓取,但最终香蕉在篮中消失(状态不符),机械臂动作大体正常 → 1 分。

0 points 0 分

Action and object state both disagree with the reference. Other cases: severely chaotic visuals, incoherent arm motion vs reference, and inconsistent final state. 动作与物体状态均与参考不一致;或画面混乱、机械臂动作与参考严重不符且最终状态也不一致等情形。