Existing checkpointing approaches seem ill-suited for distributed training even though hardware limitations make model parallelism, i.e., sharding model state across multiple accelerators, a requirement for model scaling. Consolidating distributed model state into a single checkpoint unacceptably slows down training, and is impractical at extreme scales. Distributed checkpoints, in contrast, are tightly coupled to the model parallelism and hardware configurations of the training run, and thus unusable on different configurations. To address this problem, we propose Universal Checkpointing, a technique that enables efficient checkpoint creation while providing the flexibility of resuming on arbitrary parallelism strategy and hardware configurations. Universal Checkpointing unlocks unprecedented capabilities for large-scale training such as improved resilience to hardware failures through continued training on remaining healthy hardware, and reduced training time through opportunistic exploitation of elastic capacity.
UCP supports flexible checkpoint transformation along any training parallelism techniques (e.g., ZeRO-DP, TP, PP, SP). It enables elastic resource management, allowing easy scaling up and down of training and fine-tuning with varying hardware resources. UCP includes a convenient language-integrated programming interface that allows users to describe various parallelism patterns and provides transformation operations to easily transform distributed checkpoints into UCP. UCP also provides cross-framework support, enabling resuming training of checkpoints from other popular training frameworks, such as HuggingFace transformer accelerate and PyTorch lightning with DeepSpeed as a backend.