Back to Home

About

OpenSkillEval is a dynamic benchmark for skill-augmented agent systems — measuring how the open-source skill ecosystem actually changes what LLM agents produce on real knowledge work.

Why Skills, Why Now

Recent LLMs and agent client frameworks have made it practical to deploy models for structured knowledge work — reports, slides, websites, data visualizations. Practitioners increasingly distill their own workflows into step-by-step prompts and tool kits called skills. These encode reusable expertise and can substantially lift model performance.

The community has produced hundreds of such skills in just a few months. But growth has outpaced understanding: it remains unclear how skills interact with different models, which ones generalize, and how to choose between competing options on a cost–quality tradeoff. Low-quality and redundant submissions also bloat the ecosystem.

Our Approach

Instead of a static benchmark, OpenSkillEval dynamically generates test cases that track evolving user needs, then evaluates both effectiveness and efficiency of model × skill × agent-framework combinations. By holding the task fixed and varying the skill, we make controlled comparisons of skill quality, robustness, and transferability across models.

Coverage

What This Site Shows

The Showcase presents 100 curated cases with side-by-side outputs across all skill variants and models. The Leaderboard aggregates the full benchmark results. Click any case to see per-model previews, judge-model evaluation breakdowns, and token / runtime details.