Leaderboard
Compare model performance across all task categories
Skill Effectiveness
Average scores per skill/variant within each task category
Resource Consumption
Average total tokens per agent (prompt + completion + cache) — one chart per task
Compare model performance across all task categories
Average scores per skill/variant within each task category
Average total tokens per agent (prompt + completion + cache) — one chart per task