Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training

Abstract

Existing checkpointing approaches seem ill-suited for distributed training even though hardware limitations make model parallelism, i.e., sharding model state across multiple accelerators, a requirement for model scaling. Consolidating distributed model state into a single checkpoint unacceptably slows down training, and is impractical at extreme scales. Distributed checkpoints, in contrast, are tightly coupled to the model parallelism and hardware configurations of the training run, and thus unusable on different configurations. To address this problem, we propose Universal Checkpointing, a technique that enables efficient checkpoint creation while providing the flexibility of resuming on arbitrary parallelism strategy and hardware configurations. Universal Checkpointing unlocks unprecedented capabilities for large-scale training such as improved resilience to hardware failures through continued training on remaining healthy hardware, and reduced training time through opportunistic exploitation of elastic capacity.

The key insight of Universal Checkpointing is the selection of the optimal representation in each phase of the checkpointing life cycle: distributed representation for saving, and consolidated representation for loading. This is achieved using two key mechanisms. First, the universal checkpoint format, which consists of a consolidated representation of each model parameter and metadata for mapping parameter fragments into training ranks of arbitrary model-parallelism configuration. Second, the universal checkpoint language, a simple but powerful specification language for converting distributed checkpoints into the universal checkpoint format. Our evaluation demonstrates the effectiveness and generality of Universal Checkpointing on state-of-the-art model architectures and a wide range of parallelism techniques.

Method

UCP supports flexible checkpoint transformation along any training parallelism techniques (e.g., ZeRO-DP, TP, PP, SP). It enables elastic resource management, allowing easy scaling up and down of training and fine-tuning with varying hardware resources. UCP includes a convenient language-integrated programming interface that allows users to describe various parallelism patterns and provides transformation operations to easily transform distributed checkpoints into UCP. UCP also provides cross-framework support, enabling resuming training of checkpoints from other popular training frameworks, such as HuggingFace transformer accelerate and PyTorch lightning with DeepSpeed as a backend.

Evaluation

We evaluate UCP on several real-world large-scale LLM models, including Megatron-LM GPT, LLaMA , and sparse MoEs. Our evaluation results show that UCP enables system capabilities to resume training with a wide range of parallelism strategies, such as ZeRO-1/2/3 data parallelism, tensor-slicing parallelism, pipeline parallelism, and sequence parallelism, on elastic resources without compromising model quality. Our evaluation also shows that UCP is lightweight, adding zero cost when saving checkpoints and resuming training with different parallelism strategies at a small cost of UCP transformation. UCP has been used to train the BLOOM 176B model and several real-world large-scale models at Microsoft such as Phi-3, greatly improving these models's resilience to hardware failures during training and reducing their training time by exploiting elastic capacity

BibTeX

@article{ucp2024,
  author    = {Lian, Xinyu and Jacobs, Sam Ade and Kurilenko, Lev and Tanaka, Masahiro and Bekman, Stas and Ruwase, Olatunji and Zhang, Minjia},
  title     = {Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training},
  journal   = {arXiv preprint arXiv:2406.18820},
  year      = {2024},
}

Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training

UCP boosts large-scale training efficiency: 🚀 Flexible change of parallelism (PP, SP, TP, ZeRO-DP) or GPU count mid-stream 🚀 Improve resilience by scaling down to healthy nodes 🚀 Increase throughput by scaling up to elastic nodes

Abstract

Method

Evaluation

BibTeX

UCP boosts large-scale training efficiency:

🚀 Flexible change of parallelism (PP, SP, TP, ZeRO-DP) or GPU count mid-stream

🚀 Improve resilience by scaling down to healthy nodes

🚀 Increase throughput by scaling up to elastic nodes