Evaluations
Run, view, and compare evaluation results across models and tasks.
| Run Name | Model | Agent | Tasks | Accuracy | Duration | Status | Date | Actions |
|---|---|---|---|---|---|---|---|---|
| live_eval_0320 | qwen3.5-9b-lora-v3 | planner-grounder | 6/7 | 85.7% | 42 min | Completed | Mar 20, 2026 09:15 | |
| live_eval_0318 | qwen3.5-9b-lora-v2 | planner-grounder | 5/7 | 71.4% | 38 min | Completed | Mar 18, 2026 14:30 | |
| live_eval_0315 | qwen3-7b | api-claude | 4/7 | 57.1% | 45 min | Completed | Mar 15, 2026 10:00 | |
| live_eval_0320_exp | qwen3.5-9b-grpo | planner-grounder | --/4 | -- | -- | Running | Mar 20, 2026 11:30 | |
| live_eval_0312 | qwen3-7b | api-claude | 3/7 | 42.9% | 51 min | Failed | Mar 12, 2026 16:45 | |
| live_eval_0310 | qwen3-7b-base | api-claude | 2/7 | 28.6% | 55 min | Completed | Mar 10, 2026 09:00 | |
| live_eval_0308 | qwen3-7b-base | api-claude | 1/7 | 14.3% | 48 min | Completed | Mar 8, 2026 11:20 |