Back to Home

About

OpenSkillEval is a dynamic benchmark for skill-augmented agent systems — measuring how the open-source skill ecosystem actually changes what LLM agents produce on real knowledge work.

Why Skills, Why Now

Recent LLMs and agent client frameworks have made it practical to deploy models for structured knowledge work — reports, slides, websites, data visualizations. Practitioners increasingly distill their own workflows into step-by-step prompts and tool kits called skills. These encode reusable expertise and can substantially lift model performance.

The community has produced hundreds of such skills in just a few months. But growth has outpaced understanding: it remains unclear how skills interact with different models, which ones generalize, and how to choose between competing options on a cost–quality tradeoff. Low-quality and redundant submissions also bloat the ecosystem.

Our Approach

Instead of a static benchmark, OpenSkillEval dynamically generates test cases that track evolving user needs, then evaluates both effectiveness and efficiency of model × skill × agent-framework combinations. By holding the task fixed and varying the skill, we make controlled comparisons of skill quality, robustness, and transferability across models.

Coverage

5 task families — Data Visualization, Poster Generation, PPT Generation, Report Generation, Web Design
600+ dynamically generated cases spanning business, science, health, engineering, and creative briefs
30 community skills covering popular publishing and design toolkits
10 frontier models across 4 agent frameworks — Claude Code, Codex, Gemini CLI, Kimi CLI

What This Site Shows

The Showcase presents 100 curated cases with side-by-side outputs across all skill variants and models. The Leaderboard aggregates the full benchmark results. Click any case to see per-model previews, judge-model evaluation breakdowns, and token / runtime details.