🔬 Expert Co-activation Geometry

Research Question: Does OLMoE's router systematically co-activate geometrically complementary expert pairs (low head_sim)?

Step Hardware What it does
1 CPU Precompute & cache 64×64 head_sim matrices
2 GPU (≤60s) Forward pass on WikiText-2, capture routing indices
3 CPU Statistical analysis: co-activated vs random

Each layer ≈ 3–5 min. Cached to disk — runs once only.

Projection