Evaluation Criteria | World Model Track | GigaBrain Challenge 2026

Evaluation method 评估方法

Annotators watch videos and use the reference (ground-truth) video as the baseline. They judge whether the generated video is consistent with the reference in terms of outcome: both should represent success, or both should represent failure.

评测时播放视频，以参考视频（GT）为基准，判断生成视频与参考视频在结果意义上是否一致（是否都为成功或都为失败）。

Each evaluated clip receives a score from 0 to 3. Detailed criteria are below. Reference videos for each task are provided in the official dataset.

每条待评视频记 0～3 分，评分细节见下文。各 Task 的参考视频随官方数据集提供。

Aggregation & final score 汇总与最终分数

For each team and each task, scores are collected from three annotators. Each annotator may be assigned up to 10 videos.
每个队伍的每个 task 由 3 名采集员评测；每名采集员最多分配 10 条视频。
Per-task score is a weighted average over all scored videos in that task: (3×N₃ + 2×N₂ + 1×N₁) / N_total, where N_k is the count of videos scored k, and N_total is the total number of videos. If a submission contains fewer videos than GT, the denominator follows the total number of videos in GT for that task.
按每个 task 内所有已评视频计算加权平均分：（3×3 分视频数 + 2×2 分视频数 + 1×1 分视频数）/ 总视频数。若提交视频数量不足，以该 task 在 GT 中的总视频数为分母。
The team’s final score is the average of the per-task scores across the 8 tasks.
将 8 个 task 的平均分作为该队的最终分数。

Score definitions & reference examples 分数含义与参考示例

In each example below, the first row is the real-robot reference video; the second row is the video to be evaluated. 以下示例中，上一行为真机参考视频，下一行为待评估视频。

3 points 3 分

Action matches the reference; final object state matches; no obvious object deformation during the episode; physics and collisions look plausible. 与参考视频动作一致，物体最终状态一致，过程中物体无明显形变，物理与碰撞无明显异常。

Example (score 3) 示例（3 分）

Reference: insert the ear into the box. The evaluated video also completes the insertion without noticeable deformation → 3. 参考：将耳朵插入盒子。待评视频同样完成插入且过程无明显形变 → 3 分。

2 points 2 分

Final object state matches the reference, but there is visible deformation, unrealistic physics/collisions, or motion inconsistency during the trajectory. 最终物体状态与参考一致，但过程中出现物体形变、物理或碰撞不真实，或与参考动作不一致。

Example A (score 2) 示例 A（2 分）

Reference: place the banana in the basket. The evaluated video ends with the banana in the basket (state OK), but the banana deforms during the process → 2. 参考：香蕉放入篮子。待评视频最终香蕉在篮中（状态一致），但过程中香蕉发生形变 → 2 分。

Example B (score 2) 示例 B（2 分）

Reference: banana not placed in the basket. The evaluated video also ends with failure (consistent outcome), but objects deform or interact unnaturally → 2. 参考：香蕉未放入篮子。待评视频最终也未放入（状态一致），但过程中香蕉与其他物体发生形变 → 2 分。

1 point 1 分

Robot motion is broadly aligned with the reference, but the final object state disagrees with the reference (such as success vs failure mismatch, object disappearance, etc.). 机械臂整体动作与参考基本一致，但最终物体状态与参考不一致，如成功与失败状态不一致、物体消失等情形。

Example A (score 1) 示例 A（1 分）

Reference: banana not in the basket. The evaluated video ends with the banana in the basket (wrong outcome), while the arm motion still resembles the reference → 1. 参考：香蕉未在篮中。待评视频最终香蕉在篮中（状态不符），但整体动作与参考相近 → 1 分。

Example B (score 1) 示例 B（1 分）

Reference: banana in the basket. The evaluated video shows grasping, but the banana disappears from the basket at the end (wrong state) while arm motion looks normal → 1. 参考：香蕉在篮中。待评视频虽有抓取，但最终香蕉在篮中消失（状态不符），机械臂动作大体正常 → 1 分。

0 points 0 分

Action and object state both disagree with the reference. Other cases: severely chaotic visuals, incoherent arm motion vs reference, and inconsistent final state. 动作与物体状态均与参考不一致；或画面混乱、机械臂动作与参考严重不符且最终状态也不一致等情形。