Overall Results
Radar view of each model's score by task category
Cost–Performance Frontier
Average score plotted against tokens consumed per task — the upper-left is where you want to be.
Open Skills
Specialized prompt + tool kits that agents can invoke to do real work — sourced from the open ecosystem and evaluated on the same benchmark.