graph LR
DeepSpeedEngine["DeepSpeedEngine"]
deepspeed_checkpoint["deepspeed_checkpoint"]
state_dict_factory["state_dict_factory"]
deepspeed_runtime_zero["deepspeed.runtime.zero"]
swap_tensor["swap_tensor"]
data_parallel_writer_factory["data_parallel_writer_factory"]
ds_to_universal["ds_to_universal"]
DeepSpeedEngine -- "orchestrates" --> deepspeed_checkpoint
DeepSpeedEngine -- "interacts with" --> deepspeed_runtime_zero
DeepSpeedEngine -- "integrates with" --> data_parallel_writer_factory
deepspeed_checkpoint -- "leverages" --> state_dict_factory
deepspeed_runtime_zero -- "depends on" --> swap_tensor
deepspeed_runtime_zero -- "provides partitioned state information to" --> ds_to_universal
state_dict_factory -- "may interact with" --> deepspeed_runtime_zero
This subsystem is responsible for the comprehensive handling of model weights, optimizer states, and other training/inference states within DeepSpeed. It encompasses functionalities for saving, loading, partitioning, and converting these states, with a strong emphasis on supporting various DeepSpeed parallelisms and optimized checkpoint formats.
The central orchestrator for training and inference, responsible for initiating and managing the checkpointing process (saving and loading model/optimizer states). It acts as the high-level interface for state management within the DeepSpeed training loop.
Related Classes/Methods:
Manages the core DeepSpeed checkpointing logic, including initialization, validation, and building mappings for different parallelism types (TP, PP, DP). It provides methods to retrieve specific parts of the model state and checkpoint metadata, and handles reshaping for Megatron-LM 2D parallelism.
Related Classes/Methods:
Acts as a factory for loading and manipulating state dictionaries. It provides functionalities to get and set modules within a state dictionary, and to split or merge state dictionaries, which is crucial for handling different model parallelisms and checkpoint formats.
Related Classes/Methods:
Manages the partitioning, communication, and reassembly of model parameters, gradients, and optimizer states for ZeRO (Zero Redundancy Optimizer) stages. This is fundamental for enabling large model training by distributing memory load. stage3 specifically handles offloading to CPU/NVMe.
Related Classes/Methods:
Manages the dynamic movement of tensors (parameters, gradients, optimizer states) between different memory tiers (GPU, CPU, NVMe). This is a critical enabler for ZeRO-Offload and allows training models larger than available GPU memory.
Related Classes/Methods:
Provides the logic for creating configurations for data parallel checkpoint writers. It determines how to slice and assign resources for efficient parallel saving of model states across different data and model parallelism configurations (DDP, 2D, 3D).
Related Classes/Methods:
Facilitates the conversion of DeepSpeed-specific checkpoint formats, particularly those from ZeRO, into a more universal and consolidated format. This is crucial for interoperability and analysis outside of the DeepSpeed ecosystem.
Related Classes/Methods: