Back to Home

Leaderboard

Compare model performance across all task categories

Skill Effectiveness

Average scores per skill/variant within each task category

Resource Consumption

Average total tokens per agent (prompt + completion + cache) — one chart per task