Beyond Benchmarks: Understanding MoE Models Through Internal Mechanisms
We study how Mixture‑of‑Experts (MoE) models work on the inside — not just how well they score. By explicitly incorporating routing signals and analyzing expert‑level behavior, our internal metric MUI exposes capacity, dynamics, and specialization beyond what benchmarks show.
Mixture-of-Experts (MoE) architectures have emerged as a promising direction, offering efficiency and scalability by activating only a subset of parameters during inference. However, current research remains largely performance-centric, with limited understanding of its internal mechanisms, thereby constraining broader progress. In this work, we use an internal metric to investigate the mechanisms of MoE architecture by explicitly incorporating routing mechanisms and analyzing expert-level behaviors. Through systematic analyses of a wide range of publicly available MoE models, we uncover several findings: (interal indicator; dynamic training trajectory; collaborative experts; neuron-level patterns as proxy for data diversity). Together, these results demonstrate the potential of MUI as a complementary indicator to benchmark performance, offering new insights into the capacity, dynamics, and specialization of MoE models.
Key Findings
What MUI reveals across a wide range of public MoE models.
1) Neuron utilization ↓ with model evolution
As models scale and mature, we observe reduced neuron utilization - an indicator of stronger generalization and more efficient representations.
2) Training is dynamic benchmarks alone mislead
Benchmark scores can stay flat while MUI shows meaningful internal changes, capturing shifts in learining stages.
3) Tasks are collaborative across experts
Successful completion typically involves multiple experts. Shared experts emerge as anchors, driving concentration and reuse.
4) Neuron patterns ↔ data diversity
Fine‑grained activation patterns correlate with the diversity of data seen, offering a useful proxy when explicit labels are unavailable.
Takeaway: MUI complements benchmarks. It offers a richer view of capacity, dynamics, and expert specialization - informing model design and evaluation.
MUI - Model Level Indicators
What is MUI? An internal diagnosis metric that reflects the easures the proportion of neurons required for task completion
$$ \textbf{MUI}_{\text{}}(\mathcal{T})= \frac{\left|\bigcup N_\text{activated}(s_i)\right|} {N \times L \times \bigl(|{E}_{s}| + |{E}_{r}|\bigr)}, \quad $$
where $N$ is the number of neurons per expert, $L$ is the number of MoE layers, $|{E}_{s}|$ is the number of shared experts, $|{E}_{r}|$ is the number of routed experts per layer and $N_\text{activated}(s_i)$ denotes the set of key neurons in the model that are required to process sample $s_i$ in Task $\mathcal{T}$. By comparing earlier and later versions within the same model families, we examine how MUI reflects the impact of model iteration or evolution. It indicated that later-released models consistentlyachieve stronger performance on the same datasets while exhibiting lower MUI. These newer models indeed possess higher true capability and stronger generalization, and MUI may serve as an indicator of intrinsic capacity and generalization rather than benchmark-specific performance.MUI and Performance. Click to focus within the different model families
Training Trajectories
Does MUI decrease monotonically throughout training, or do different phases exhibit distinct trajectories? To address this, we monitor MUI for OLMoE serise models across the entire training process, with the goal of deriving insights that can inform training strategies and model development. At earlier stages, performance improvements are accompanied by an increase in MUI, which we refer to as the Accumulating. At later stages, however, further performance gains occur together with a decrease in MUI, which we call the Evolving.Collaborative Expert Contributions
We here extend our analysis to the expert level by calculating the Task specific key Experts:$$ \text{KeyExpertProportion} (\mathcal{T})=\frac{ | E_{\text{key}}(\mathcal{T}) | }{ L \times (|{E}_{s}| + |{E}_{r}|) }, $$ where $E_{\text{key}}(\mathcal{T})$ is the experts that consistently contribute across taske $\mathcal{T}$. We find that, MoE models, activating a larger number of experts while requiring fewer neurons within each expert is often associated with stronger true capability and better generalization. With GPT-OSS have the highest proportion of key experts among the models studied.
Resources
Replace links below with your artifacts.
Paper
Code
BibTeX
@misc{ying2025benchmarksunderstandingmixtureofexpertsmodels,
title={Beyond Benchmarks: Understanding Mixture-of-Experts Models through Internal Mechanisms},
author={Jiahao Ying and Mingbao Lin and Qianru Sun and Yixin Cao},
year={2025},
eprint={2509.23933},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2509.23933},
}