I first learnt about the Muon optimizer from an obscure blog post about a “geometry-aware” optimizer that supposedly halved training time for large language models (LLMs)! Half the training FLOPs (ie half the compute) for the same perplexity initially sounded too good to be true. But several late-night experiments later, I became much more convinced. This article discusses how Muon’s clever use of polar decomposition and spectral normalization speeds up LLM pre-training, how it plays very nicely with large batches and more sophisticated architectural tricks like Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE), and what all this means in simple practical terms for LLM researchers and engineers same Chasing Efficiency in Training LLMs Chasing Efficiency in Training LLMs Training large language models is as much an art of trade-offs as it is of architectural or research novelties. Researchers often find themselves balancing memory, training time, and model quality like an impossible three-body problem. Over the past few years, two parallel approaches emerged to push the efficiency frontier: architectural innovations (efficient attention mechanisms, better positional embeddings, MoE sparsity) and optimizer advances. The transformer architecture itself (with multi-head attention) has become standard, with research innovations around compressing the key/value dimensions by half using latent attention to cut memory usage with minimal loss[1][2], or using MoE to activate only portions of the model for each input. On the optimization side, AdamW has generally been the trusty workhorse, but new optimizers like Muon have promised faster convergence with some math magic. [1] [2] Muon came onto the scene in late 2024 when an independent researcher, Keller Jordan, introduced it in a blog post and immediately broke some training speed records on benchmarks (NanoGPT and CIFAR-10 speedruns) using Muon[3]. This got a lot of researchers from frontier labs to pay attention. The premise was as follows: take a standard momentum-based optimizer and orthogonalize its gradient updates. In practice, Muon processes each weight matrix’s gradient with a polar decomposition, essentially factoring the gradient update into an orthogonal (rotation) part and a symmetric (scaling) part, then using only the orthogonal part. This yields an update step with all directions treated more equally (spectral norm ≈ 1), preventing any single large singular value from blowing up the update[4]. Intuitively, if Adam or SGD momentum is pushing your model primarily in a couple of dominant directions (because gradients often have a few large eigenvalues), Muon tries to also not ignore those other directions in the weight matrix. By orthogonalizing the update matrix (via an efficient Newton–Schulz iteration rather than an expensive SVD), Muon normalizes the update’s magnitude across all directions while preserving its overall direction[4]. This lets us use a larger learning rate without instability and harvest more gain from each batch of data. [3] [4] [4] Implementing Muon’s update rule for the first time in my experimental setup was quite tricky and it felt like that surely must break something! But in reality, the training curves actually looked remarkably smooth: none of the more jagged loss spikes early on that I’d grown more accustomed to with AdamW. Across many runs, Muon consistently produces smoother convergence curves with fewer oscillations[5]. By the time we hit a given loss target, we had often used around 48–52% of the FLOPs that AdamW would require[6]. In simpler words, I was getting the same accuracy in roughly half the compute. Figure 1 below shows an example of this effect in action: Muon (red curve) vs AdamW (blue curve) on models ranging from 17M (XS) to 202M (XL) parameters. The Muon-trained models hit each loss milestone with roughly half the training FLOPs of AdamW, and even end up with slightly lower final loss given the same compute budget[6]. Notice how the red curves not only drop faster, but also look less spiky than the blue ones: that’s the spectral normalization at work, keeping the optimizer steady and on-target. must Equally impressive is how these gains hold up as you scale models and batch sizes. One of my concerns was whether Muon will keep up when you throw millions of tokens in a batch or scale to a few hundred million parameters? The empirical answer: yes it will! The relative speedup from Muon persists (and even grows) at larger model sizes, in my experiments the efficiency gap actually widened slightly going from 30M to 200M params[6]. And when I increased the batch size into the millions-of-tokens regime, Muon really stepped up. A neat way to analyze large-batch efficiency is via the “token consumption ratio” proposed by Sam McCandlish and team at Anthropic: basically, how many tokens does optimizer A need vs optimizer B to reach a given loss L at batch size B. I found that Muon’s advantage holds even past the critical batch size where AdamW starts floundering in comparison[7]. In fact, the absolute token difference (extra data AdamW needs) grew super-linearly at huge batches[8][9]. The practical lesson here is that if you have the compute to throw very large batches at the problem, Muon will make sure that compute isn’t wasted. It expands the Pareto frontier on the compute-vs-time trade-off curve, implying that for any fixed number of GPUs, you reach your target loss faster; or conversely, for a fixed training time, you can get away with fewer GPUs to hit the same quality[10]. As a researcher, that translates to more flexibility: whether you’re in a hurry to reach a result or trying to stretch a budget, Muon gives you better options on both fronts. As I'd called out before, optimizers are only one side of the efficiency coin. The other side is model architecture: making the model itself leaner or faster without sacrificing (too much) quality. This is where Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE) come in. The combination of Muon + MLA + MoE yielded almost multiplicative gains. A quick primer for context: MLA is a drop-in replacement for standard multi-head attention where you project the keys and values to a lower-dimensional latent space for the attention computations. The idea is you don’t really need the full dimension for every head: by using a smaller r < d you'd slash the memory (and some compute) required for the attention mechanism [11]. In one study, using r = d/2 saved ~45% of the attention memory at negligible loss increase (~0.3%)[1]. In practice, for me this meant my 16-head model effectively stored and processed keys/values as if it had 8 heads, which is a big deal when memory is at a premium. MoE, on the other hand, expands model capacity by having multiple expert sub-networks (for the feed-forward layers) and routing each token through only a few of them. It’s like having a team of specialists where each token’s data only wakes up the two or three experts it needs, instead of activating the whole team. Google’s Switch Transformer is a pretty famous example: it showed that you can scale to trillions of parameters by using MoEs, while keeping the per-token compute similar to a smaller model. The (relatively tolerable) downside is usually some overhead in coordination and increased memory usage for storing all those experts. combination latent r < d [11] r = d/2 [1] But here’s the really cool thing: when we combined Muon with MLA and MoE, the benefits stacked up in a beautiful way. Using MLA halved the attention memory and even gave a modest speed boost (less data per token to shuffle around). Using MoE dramatically increased throughput, in one configuration, going from a dense model to an MoE with the same overall size nearly tripled the tokens processed per second at inference, thanks to parallelizing experts and skipping unnecessary compute[12]. Quality went up too, the MoE-augmented model’s perplexity was ~8–12% better than the dense baseline, because effectively it had a larger capacity (more parameters) focused per token[13]. The trade-off was higher peak memory due to more parameters in play. However, MLA’s compression counteracted some of that, and Muon’s efficiency reduced the total training FLOPs needed. The net result? 68% less memory usage and a 3.2× inference speedup for the MLA+MoE+Muon model compared to the baseline (standard attention + AdamW)[13]. Even when I compare against intermediate setups, the trend holds: for instance, the bigger model (around 200M params) trained with plain multi-head attention + AdamW took 24.3 hours to reach target loss, whereas with Muon it took 14.1h; switching to MLA cut memory from ~4.7 GB to ~4.3 GB and upped speed from 1000 to 1200 tok/s; and the full MoE+MLA+Muon variant hit target in 12.3h and pushed throughput to 3350 tok/s[14]. That’s beyond a 3× speedup in throughput and roughly half the training time, all while achieving the best final perplexity of the lot (7.25 vs 8.54 for the baseline)[14]. beyond To be fair, MoE does introduce some complexity, load balancing the experts, ensuring stable routing, etc., can be fairly tricky. I used a fixed small number of experts and standard gating, and thanks to Muon’s stable updates, didn’t hit any instability that sometimes plagues MoE training. In fact, one unexpected observation was that Muon’s spectral normalization seemed to complement MoE’s conditional training dynamics; I saw slightly more consistent loss curves across different expert-route configurations, perhaps because Muon prevented any single expert’s gradient from dominating. This is anecdotal, but it makes sense: orthogonal updates ensure no one part of the model runs away with huge gradient norms, which is good when different experts might receive uneven training signals. some Lessons and Looking Forward Lessons and Looking Forward My experience with Muon taught me a broader, more useful lesson: the Pareto frontier can always be pushed further with some ingenuity and creativity. We often assume there’s a hard trade-off, but techniques like Muon, MLA, and MoE show that there can still be “free lunches” on the table, or at least relatively cheap lunches! By tackling the problem from multiple angles (optimizer mathematics, network architecture, distributed batch scaling), we can expand what’s achievable at a fixed budget. It’s telling that very respected research teams in the industry are adopting Muon: for instance, fintech company Nubank reported integrating Muon into their 330M-parameter model pretraining and consistently saw faster convergence and better final metrics compared to AdamW, reaching the same validation loss with about 52% of the training FLOPs in their experiments[15]. In their words, Muon’s orthogonalized updates let the model learn more from each token it processes*,* translating to concrete cost savings on GPU bills[16][15]. [15] [16] [15] It's also important to remember that Muon isn’t a silver bullet for every scenario. Its benefits are most pronounced for matrix-shaped parameters (like the dense layers in Transformers), and you still use AdamW or similar for biases, embeddings, etc. There’s also a bit of hyperparameter tuning to find the right learning rate and momentum, although thanks to Muon’s design, I found that hyperparameters scaled quite cleanly from smaller models to larger ones, especially when using techniques like μTransfer (maximal update parametrization)[17][18]. In my case, a learning rate that worked on a 17M model worked almost out-of-the-box on a 200M model with Muon (just needed slight adjustment of weight decay), which is a huge time-saver. Back when I first read about Muon, I wouldn’t have imagined that I’d be routinely training models in roughly half the time with better final accuracy, or deploying a model that uses only a third of the memory and runs 3× faster at inference, without sacrificing anything on the user experience. The combination of Muon, latent attention, and MoE has expanded the envelope of what a small to medium model can do, to the point where these models punch well above their weight. It’s quite fair to say this work expands the Pareto frontier for model training and deployment[10]. For those of us trying to deliver frontier ML systems under tight constraints, these are some very serious gains! roughly half the time better To wrap up, while we often hear about the latest hundred billion-parameter model grabbing headlines, there's an equally important revolution in making leaner models more efficient. Techniques like Muon remind us that there’s plenty of juice to squeeze by optimizing the training process, and not just the architecture. And when you stack these optimizations together, the gains can be multiplicative, opening up new design possibilities. References: References: Nubank engineering field report on Muon[15][16] [15] [16] [1] [2] Latent Multi-Head Attention for Small Language Models [1] [2] [3] Muon: An optimizer for hidden layers in neural networks | Keller Jordan blog [3] [4] [15] [16] Muon for Improved Foundation Model Pretraining Data Efficiency - Building Nubank [4] [15] [16] Building Nubank [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [17] [18] https://arxiv.org/pdf/2509.24406 [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [17] [18] https://arxiv.org/pdf/2509.24406