We introduce a Dual Evaluation Framework for assessing the multilingual capabilities of large language models (LLMs). Our framework separately examines two key aspects: the language used for the question (linguistic medium) and the cultural context being tested. Using an extended BLEnD dataset, we evaluate various models and find two main results: (1) LLMs perform best in English-speaking cultural scenarios, regardless of question language, and (2) models achieve higher accuracy when questions are asked in the language that matches the cultural context. We call this effect "Cultural-Linguistic Synergy". Further analysis shows that the proportion of language- and culture-specific neurons activated in the model correlates with this synergy, offering a new lens for understanding and improving multilingual AI.
Our Dual Evaluation Framework assesses multilingual capabilities of LLMs across two key axes: language and cultural context. As shown in the figure, each question is evaluated in four scenarios—covering both American and Chinese cultural knowledge, asked in both English and Chinese. This reveals not just how well models handle native-language, native-culture tasks, but also ability to answer cross-cultural questions in single language.
Using our Dual Evaluation Framework, we compare models’ adaptability in cross-cultural contexts. The results indicate that, in general, the selected models perform better on English-speaking culture questions compared to other languages when asked in the respective language.
Beyond comparing model behavior across different cultural contexts within the same language, our Dual Evaluation Framework also reveals how models perform when answering culture-specific questions in different languages. Strikingly, as shown in the examples above, models consistently achieve higher scores when culture-specific questions are posed in their corresponding native language, compared to English. For instance, questions about Chinese culture yield better results when asked in Chinese rather than English; the same pattern holds for Indonesian and Iranian culture questions. We refer to this counterintuitive phenomenon as ''Cultural-Linguistic Synergy''. That is, aligning the cultural context with the appropriate linguistic medium, we can achieve superior performance — even for models primarily trained on English data, which perform better on English-specific tasks than on other language benchmarks.
When comparing the proportions of specialized neurons activated under American cultural contexts and those of other cultures, we observe that models tend to activate a higher proportion of neurons specialized for the target language when the cultural context matches the linguistic medium used for questioning. In contrast, this alignment is less pronounced when such a match is absent. The activation of these specialized neurons enables the model to more effectively access culture- and language-specific knowledge. Notably, this knowledge may remain underutilized if the question is posed in English rather than the target language, which in turn explains why models achieve better performance when the language of the question is aligned with its cultural context.
Building on these observations, we propose and empirically validate the following hypotheses:
@misc{ying2025disentanglinglanguagecultureevaluating,
title={Disentangling Language and Culture for Evaluating Multilingual Large Language Models},
author={Jiahao Ying and Wei Tang and Yiran Zhao and Yixin Cao and Yu Rong and Wenxuan Zhang},
year={2025},
eprint={2505.24635},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.24635},
}