May 26, 2026

Nvidia servers run Moonshot AI and other models up to 10x faster

  • Nvidia shows gains for Moonshot AI while AMD and Cerebras ready rivals.
  • Labs are shifting to serving MoE models, raising pressure on Nvidia.

New data from Nvidia is drawing attention to how the company hopes to stay ahead in a market that is changing fast. This update focuses on mixture-of-experts models, or MoE models, which have become a major part of how many frontier systems are built today.

Nvidia says its newest AI server delivers a tenfold jump in performance for several well-known MoE models, including two from China. The claim comes at a time when developers are shifting their focus from training models to running them for large numbers of users, an area where Nvidia faces stronger competition.

Why MoE models are reshaping AI development

MoE has moved into the centre of model design over the past year, partly due to the rise of DeepSeek. Early in 2025, DeepSeek released an open-source model that performed well and required much less training on Nvidia chips than many expected. The move caught the industry by surprise and pushed many labs to rethink their approach.

Since then, MoE designs have been used by OpenAI, Mistral, Moonshot AI, and other builders that want to improve speed and efficiency without building ever-larger dense models.

MoE models draw on a simple idea. Instead of using the full set of parameters for every token, the model breaks the work into parts and sends each part to the most relevant “experts.” Only a small set of experts activates at any moment, which makes the model faster and easier to run. In practice, this means a system can have hundreds of billions of parameters while still keeping the cost of serving each token under control.

But this design also creates technical challenges: each expert must talk to others to reach a final answer, and that communication must happen at high speed. When experts are spread in many chips, delays can build up. Nvidia argues that this is where its new system stands out. The server, known as GB200 NVL72, houses 72 chips in one machine and connects them with fast internal links.

Nvidia says this setup acts as a single large unit rather than a group of separate devices, which helps remove many of the delays that slow down MoE systems on older platforms.

According to Nvidia, this design is what allowed Moonshot’s Kimi K2 Thinking model to run ten times faster on the NVL72 system than on the previous HGX H200 platform. The company says it saw similar results with DeepSeek-R1 and Mistral Large 3. The models appear near the top of the Artificial Analysis leaderboard, which tracks the performance of open-source systems.

The results are part of Nvidia’s push to show that even if labs need fewer of its chips to train new MoE models, its hardware still plays a central role when those models run in production.

Nvidia’s push to improve how these models run at scale

The long MoE explainer released by Nvidia offers a detailed look at why the architecture has become so widely used. Dense models were the norm for years, with most developers building larger and larger systems.

The designs required huge amounts of compute and energy because every parameter had to take part in every step. But as the cost of training and serving grew, developers began looking for ways to make models smarter without simply scaling them up.

MoE models aim to solve that. The router inside the model sends each token to a small group of experts. A token about math may go to one set, while a token about images may go to another. This mirrors how the human brain activates different regions based on what it needs to do. Because only a few experts are active at a time, MoE systems can improve quality without using the full size of the model for every user request.

The approach has spread fast, and Nvidia says more than 60% of new open-source models released this year use MoE designs. Some of the most prominent include DeepSeek-R1, Mistral Large 3, Kimi K2 Thinking, and OpenAI’s gpt-oss-120B. These systems show how much room there is to push performance without matching the scale of dense models like GPT-4 or earlier large transformers.

But to serve MoE models at scale, systems need more than just a clever design. They need hardware that can move data quickly, hold large sets of experts in memory, and avoid delays as experts communicate. This is where Nvidia’s codesign message comes in. The company argues that the NVL72 platform brings hardware and software together in a way that clears the main bottlenecks.

One of those bottlenecks is memory pressure. Each expert has its own set of parameters, and the system needs to load those parameters on demand. When many experts share a single GPU, memory can fill up and slow the process. By spreading experts in 72 GPUs, Nvidia says the NVL72 machine reduces the amount of work each chip must do. The frees up space and makes it easier to support long inputs and large numbers of users.

Another bottleneck is communication – experts need to exchange information at high speed to produce a complete answer. On older systems, this exchange often moves in slower network connections once the model grows beyond eight GPUs. Nvidia says the NVL72 structure, which links all 72 GPUs through NVLink Switch, allows each chip to talk to any other almost instantly. The company also notes that NVLink Switch can handle part of the work needed to combine expert outputs, which further reduces delays.

The claims link back to Nvidia’s argument about performance per watt. The company says the NVL72 system can run MoE models with ten times better efficiency than the H200 generation. A higher ratio of tokens per unit of power can translate into lower operating costs for companies that run large AI services. Some cloud providers, including AWS, Azure, Google Cloud, and CoreWeave, are already deploying the NVL72 racks.

Companies building their own models are also testing the system. DeepL says it is using GB200 hardware to train MoE models and improve both training and serving. Fireworks AI has deployed Kimi K2 on the B200 platform and sees NVL72 as a path toward faster and more efficient serving. Together AI says its work with Nvidia has helped meet customer demands for large MoE inference.

All this comes at a time when Nvidia faces growing pressure from other hardware firms. AMD is working on its own server that bundles many high-end chips in a similar way. The company has said it plans to bring that system to market next year. Cerebras is also active in the inference space, offering hardware that uses a very different design from Nvidia’s.

The strong focus on serving, rather than training, means more companies are looking to show they can run the newest models at lower cost and with lower power use.

How other hardware makers are responding

Nvidia’s message is that MoE models fit well with systems that can act as a single large unit. The company is also pointing to its software stack, which includes SGLang, TensorRT-LLM, and other tools designed to support MoE workloads. The tools help split requests in GPUs and assign prefill and decode steps to different parts of the system, which can speed things up.

The company also ties its results to the future of AI. Many multimodal models already activate different parts of the network for different tasks, which is similar to how MoE works. Agent-based systems, which use different components for planning, reasoning, or tool use, follow the same pattern. Nvidia suggests that as these systems grow, the need for hardware that can route data in many chips at high speed will grow with them.

For now, Nvidia aims to show that the shift toward MoE does not weaken its position. Instead, it argues that this shift plays to the strengths of its newest systems, which combine dense GPU clusters with fast internal links. Whether this approach holds as more competitors enter the space will depend on how quickly developers adopt the next wave of models and how much performance they can gain from new designs.

But for the moment, Nvidia’s message is clear: as MoE models become more common, the hardware needed to serve them well is becoming just as important as the hardware used to train them. The company hopes its newest results show that it still has an edge in that part of the market, even as the wider industry continues to move.

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is part of TechEx and is co-located with other leading technology events, click here for more information.

AI News is powered by TechForge Media. Explore other upcoming enterprise technology events and webinars here.

TNG – Latest News & Reviews