Beyond Benchmarks: Understanding MoE Models Through Internal Mechanisms

We study how Mixture‑of‑Experts (MoE) models work on the inside — not just how well they score. By explicitly incorporating routing signals and analyzing expert‑level behavior, our internal metric MUI exposes capacity, dynamics, and specialization beyond what benchmarks show.

MoE Evalution Neuron Activation Expert Specialization

📄 Paper (PDF) 💻 Code (coming soon)

Authors: Jiahao Ying, Mingbao Lin, Qianru Sun Yixin Cao · 2025

Abstract

Mixture-of-Experts (MoE) architectures have emerged as a promising direction, offering efficiency and scalability by activating only a subset of parameters during inference. However, current research remains largely performance-centric, with limited understanding of its internal mechanisms, thereby constraining broader progress. In this work, we use an internal metric to investigate the mechanisms of MoE architecture by explicitly incorporating routing mechanisms and analyzing expert-level behaviors. Through systematic analyses of a wide range of publicly available MoE models, we uncover several findings: (interal indicator; dynamic training trajectory; collaborative experts; neuron-level patterns as proxy for data diversity). Together, these results demonstrate the potential of MUI as a complementary indicator to benchmark performance, offering new insights into the capacity, dynamics, and specialization of MoE models.

Key Findings

What MUI reveals across a wide range of public MoE models.

1) Neuron utilization ↓ with model evolution

As models scale and mature, we observe reduced neuron utilization - an indicator of stronger generalization and more efficient representations.

2) Training is dynamic benchmarks alone mislead

Benchmark scores can stay flat while MUI shows meaningful internal changes, capturing shifts in learining stages.

3) Tasks are collaborative across experts

Successful completion typically involves multiple experts. Shared experts emerge as anchors, driving concentration and reuse.

4) Neuron patterns ↔ data diversity

Fine‑grained activation patterns correlate with the diversity of data seen, offering a useful proxy when explicit labels are unavailable.

Takeaway: MUI complements benchmarks. It offers a richer view of capacity, dynamics, and expert specialization - informing model design and evaluation.

MUI - Model Level Indicators

What is MUI? An internal diagnosis metric that reflects the easures the proportion of neurons required for task completion

$$ \textbf{MUI}_{\text{}}(\mathcal{T})= \frac{\left|\bigcup N_\text{activated}(s_i)\right|} {N \times L \times \bigl(|{E}_{s}| + |{E}_{r}|\bigr)}, \quad $$

where $N$ is the number of neurons per expert, $L$ is the number of MoE layers, $|{E}_{s}|$ is the number of shared experts, $|{E}_{r}|$ is the number of routed experts per layer and $N_\text{activated}(s_i)$ denotes the set of key neurons in the model that are required to process sample $s_i$ in Task $\mathcal{T}$. By comparing earlier and later versions within the same model families, we examine how MUI reflects the impact of model iteration or evolution. It indicated that later-released models consistentlyachieve stronger performance on the same datasets while exhibiting lower MUI. These newer models indeed possess higher true capability and stronger generalization, and MUI may serve as an indicator of intrinsic capacity and generalization rather than benchmark-specific performance.

MUI as an Indicator

Combining performance with MUI offers an indicator of a model’s underlying generalization capability, mitigating the risks of misleading evaluations caused by leakage.

MUI and Performance. Click to focus within the different model families

Training Trajectories

Does MUI decrease monotonically throughout training, or do different phases exhibit distinct trajectories? To address this, we monitor MUI for OLMoE serise models across the entire training process, with the goal of deriving insights that can inform training strategies and model development. At earlier stages, performance improvements are accompanied by an increase in MUI, which we refer to as the Accumulating. At later stages, however, further performance gains occur together with a decrease in MUI, which we call the Evolving.

MUI Moniting MoE Training

Monitoring performance alone is insufficient; MUI provides a complementary perspective for performance for detecting divergent trajectories and adjusting training accordingly. For example, as shown above, in coding tasks such as MBPP, OLMoE consistently remains in the Accumulating phase without entering the Evolving phase. This suggests that additional coding data, or a higher proportion of coding tasks during earlier training stages, may be required to help the model further improve its generalization ability.

Collaborative Expert Contributions

We here extend our analysis to the expert level by calculating the Task specific key Experts:

$$ \text{KeyExpertProportion} (\mathcal{T})=\frac{ | E_{\text{key}}(\mathcal{T}) | }{ L \times (|{E}_{s}| + |{E}_{r}|) }, $$ where $E_{\text{key}}(\mathcal{T})$ is the experts that consistently contribute across taske $\mathcal{T}$. We find that, MoE models, activating a larger number of experts while requiring fewer neurons within each expert is often associated with stronger true capability and better generalization. With GPT-OSS have the highest proportion of key experts among the models studied.

After analyzing the overall trend of expert utilization, we further analyze the distribution of key experts, particularly in light of architectural differences between shared and routed experts. Here we depict the top-10 Experts for Task MMLU. For MoE architectures that include shared experts, the findings reveal that the top-10 most frequently activated experts are exclusively shared experts. By contrast, in GPT-OSS, a routed-only MoE, the activation rate is extremely low.

Expert "Collaboration" in MoE

Although the feed-forward networks in MoE architectures are referred to as "Experts," it is difficult in practice to interpret them as independent task-specific units. In models having shared experts, their persistent activation leads to concentrated responsibility within the shared pool, whereas in routed-only architectures, the influence of load-balancing losses drives a more dispersed "many-hands" collaboration among a broader set of experts.

Resources

Replace links below with your artifacts.

Paper

📄 arXiv / PDF link

Code

💻 GitHub Repository

BibTeX

@misc{ying2025benchmarksunderstandingmixtureofexpertsmodels,
      title={Beyond Benchmarks: Understanding Mixture-of-Experts Models through Internal Mechanisms}, 
      author={Jiahao Ying and Mingbao Lin and Qianru Sun and Yixin Cao},
      year={2025},
      eprint={2509.23933},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2509.23933}, 
}

Contact

📧 jhying.2022@phdcs.smu.edu.sg

📧 caoyixin2011@gmail.com