Evaluations

Run, view, and compare evaluation results across models and tasks.

Run NameModelAgentTasksAccuracyDurationStatusDateActions
live_eval_0320qwen3.5-9b-lora-v3planner-grounder6/785.7%42 minCompletedMar 20, 2026 09:15
live_eval_0318qwen3.5-9b-lora-v2planner-grounder5/771.4%38 minCompletedMar 18, 2026 14:30
live_eval_0315qwen3-7bapi-claude4/757.1%45 minCompletedMar 15, 2026 10:00
live_eval_0320_expqwen3.5-9b-grpoplanner-grounder--/4----RunningMar 20, 2026 11:30
live_eval_0312qwen3-7bapi-claude3/742.9%51 minFailedMar 12, 2026 16:45
live_eval_0310qwen3-7b-baseapi-claude2/728.6%55 minCompletedMar 10, 2026 09:00
live_eval_0308qwen3-7b-baseapi-claude1/714.3%48 minCompletedMar 8, 2026 11:20