Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs

A new training method enables efficient 4-bit floating-point (FP4) precision training for massive Mixture-of-Experts (MoE) models on Hopper GPUs despite the lack of native hardware support. The technique integrates FP4 into existing BF16/FP8 pipelines, achieving a 14.8% reduction in peak activation memory and a 12.5% improvement in training throughput for a 671B parameter model. This breakthrough makes training trillion-parameter-scale models more feasible by alleviating activation memory and expert-parallel communication bottlenecks.

Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs

New Training Recipe Enables Efficient 4-Bit Computation for Massive AI Models

A groundbreaking new method has been developed to train colossal Mixture-of-Experts (MoE) models using 4-bit floating-point (FP4) precision on current-generation Hopper GPUs, overcoming a major hardware limitation. The technique bypasses the lack of native MXFP4 or NVFP4 Tensor Core support by integrating FP4 directly into existing BF16/FP8 training pipelines, delivering significant memory and bandwidth savings without sacrificing model performance. This software-hardware co-design marks a pivotal step toward making the training of trillion-parameter-scale models more feasible and efficient.

Overcoming the Native FP4 Hardware Barrier

The primary obstacle to FP4 training on Hopper-class GPUs like the H100 has been the absence of native 4-bit computation units, which forces inefficient precision conversions. Traditionally, using FP4 would require costly round-trip conversions (e.g., FP4 ↔ BF16 ↔ FP8), negating any potential speed or memory benefits. The new recipe directly addresses this by introducing a streamlined quantization process that avoids these bottlenecks.

At its core, the method executes the main MoE computations in FP8 for numerical stability. The innovation lies in applying MXFP4 compression specifically to the model's activations and to the data communicated between experts in an expert-parallel setup. This is achieved through direct FP8-to-FP4 quantization and de-quantization, coupled with a scaling-aware technique for converting data from row-wise to column-wise format.

Substantial Performance Gains at Scale

The efficiency gains are substantial, particularly for models at the frontier of AI scale. In tests with a massive 671B parameter model, the FP4 method demonstrated its practical value. It achieved an end-to-end training performance on par with strong FP8 baselines, proving that the lower precision does not harm convergence.

More importantly, it delivered concrete hardware benefits: a 14.8% reduction in peak activation memory (saving 11.8 GB) and a 12.5% improvement in training throughput. Performance increased from 1,157 to 1,302 tokens processed per GPU per second. These savings directly alleviate two of the most critical bottlenecks in large-scale training: activation memory and expert-parallel communication bandwidth.

Why This Breakthrough Matters for AI Development

This research provides a crucial bridge, enabling next-generation efficiency on current-generation hardware. It demonstrates that the industry does not need to wait for future GPUs with native FP4 support to begin leveraging 4-bit precision for training. The implications for developing even larger and more capable models are significant.

  • Enables Larger Models: By drastically reducing activation memory, researchers can train models with more parameters or larger batch sizes on existing hardware clusters.
  • Lowers Training Cost: The 12.5% throughput improvement translates directly into faster training times and lower computational costs, a key factor for organizations running million-dollar training jobs.
  • Software-Hardware Co-Design: This work is a prime example of how innovative software can extract new levels of performance from hardware, extending its useful lifecycle and capabilities.
  • Path to Trillion-Parameter Models: Techniques like this that optimize memory and communication are essential building blocks for the sustainable development of future trillion-parameter AI systems.

常见问题