Large language models (LLMs) have achieved impressive performance across various natural language benchmarks, prompting a continual need to curate more difficult datasets for larger LLMs, which is costly and time-consuming.
In this paper, we propose to automate dataset updating and provide systematical analysis regarding its effectiveness in dealing with benchmark leakage issue, difficulty control, and stability. Thus, once current benchmark has been mastered or leaked, we can update it for timely and reliable evaluation. There are two updating strategies: 1) mimicking strategy to generate similar samples based on original data, preserving stylistic and contextual essence, and 2) extending strategy that further expands existing samples at varying cognitive levels by adapting Bloom’s taxonomy of educational objectives. Extensive experiments on updated MMLU and BIG-Bench demonstrate the stability of the proposed strategies and find that the mimicking strategy can effectively alleviate issues of overestimation from benchmark leakage. In cases where the efficient mimicking strategy fails, our extending strategy still shows promising results. Additionally, by controlling the difficulty, we can better discern the models’ performance and enable fine-grained analysis — neither too difficult nor too easy an exam can fairly judge students’ learning status.
To the best of our knowledge, we are the first to automate updating benchmarks for reliable and timely evaluation.
Clearly, a major concern is whether they can produce consistent evaluation results. To answer the above stability question, we choose the mimicking strategy and iterate the update process four times on the tasks from MMLU and BIG-Bench. and conducted experiments using seven open source models: Llama-2-7b-chat, Llama-2-13b-chat, Llama-3-8b-Instruction, Mistral-7B-Instruct-v0.2, Mixtral-8x7B-Instruct-v0.1, Yi-6b-chat, Yi-34b-chat, and four closed-source models: GTP-4, ChatGPT, Claude2, Gemini-pro. Compared with the performance of baselines on original datasets, their performance on our datasets are similar, the difference of the two scores is 5% on average. Among four different mimicked datasets, the standard deviation is limited, ranging from 0% to 3%. This demonstrates the stability of our dataset update strategy.
Question: {{sampledData[qindex]['question']}}
Answer: {{sampledData[qindex]['answer']}}
Question: {{sampledData[qindex]['followingQuestion']}}
Answer: {{sampledData[qindex]['followingAnswer']}}
{{lm}}: ✅ ❌ {{ questiondata[index].text }}
Question: {{sampledData[qindex]['question']}}
Question: {{sampledData[qindex]['fourQ'][selectedQ]['question']}}
Reference Answer: {{sampledData[qindex]['fourQ'][selectedQ]['groundtruth']}}
{{lm}}: {{ questiondata[index].text }}
Eval:
{{ questiondata[index].eval_text }}