awesome-architecture-mds/ai-ml/DeepSpeed/Model_State_Management.md at main · CodeBoarding/awesome-architecture-mds

graph LR
    DeepSpeedEngine["DeepSpeedEngine"]
    deepspeed_checkpoint["deepspeed_checkpoint"]
    state_dict_factory["state_dict_factory"]
    deepspeed_runtime_zero["deepspeed.runtime.zero"]
    swap_tensor["swap_tensor"]
    data_parallel_writer_factory["data_parallel_writer_factory"]
    ds_to_universal["ds_to_universal"]
    DeepSpeedEngine -- "orchestrates" --> deepspeed_checkpoint
    DeepSpeedEngine -- "interacts with" --> deepspeed_runtime_zero
    DeepSpeedEngine -- "integrates with" --> data_parallel_writer_factory
    deepspeed_checkpoint -- "leverages" --> state_dict_factory
    deepspeed_runtime_zero -- "depends on" --> swap_tensor
    deepspeed_runtime_zero -- "provides partitioned state information to" --> ds_to_universal
    state_dict_factory -- "may interact with" --> deepspeed_runtime_zero

Details

This subsystem is responsible for the comprehensive handling of model weights, optimizer states, and other training/inference states within DeepSpeed. It encompasses functionalities for saving, loading, partitioning, and converting these states, with a strong emphasis on supporting various DeepSpeed parallelisms and optimized checkpoint formats.

DeepSpeedEngine

The central orchestrator for training and inference, responsible for initiating and managing the checkpointing process (saving and loading model/optimizer states). It acts as the high-level interface for state management within the DeepSpeed training loop.

Related Classes/Methods:

DeepSpeedEngine:195-4078

deepspeed_checkpoint

Manages the core DeepSpeed checkpointing logic, including initialization, validation, and building mappings for different parallelism types (TP, PP, DP). It provides methods to retrieve specific parts of the model state and checkpoint metadata, and handles reshaping for Megatron-LM 2D parallelism.

Related Classes/Methods:

deepspeed_checkpoint

state_dict_factory

Acts as a factory for loading and manipulating state dictionaries. It provides functionalities to get and set modules within a state dictionary, and to split or merge state dictionaries, which is crucial for handling different model parallelisms and checkpoint formats.

Related Classes/Methods:

state_dict_factory

deepspeed.runtime.zero

Manages the partitioning, communication, and reassembly of model parameters, gradients, and optimizer states for ZeRO (Zero Redundancy Optimizer) stages. This is fundamental for enabling large model training by distributing memory load. stage3 specifically handles offloading to CPU/NVMe.

Related Classes/Methods:

swap_tensor

Manages the dynamic movement of tensors (parameters, gradients, optimizer states) between different memory tiers (GPU, CPU, NVMe). This is a critical enabler for ZeRO-Offload and allows training models larger than available GPU memory.

Related Classes/Methods:

swap_tensor

data_parallel_writer_factory

Provides the logic for creating configurations for data parallel checkpoint writers. It determines how to slice and assign resources for efficient parallel saving of model states across different data and model parallelism configurations (DDP, 2D, 3D).

Related Classes/Methods:

data_parallel_writer_factory

ds_to_universal

Facilitates the conversion of DeepSpeed-specific checkpoint formats, particularly those from ZeRO, into a more universal and consolidated format. This is crucial for interoperability and analysis outside of the DeepSpeed ecosystem.

Related Classes/Methods:

ds_to_universal

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Details

DeepSpeedEngine

deepspeed_checkpoint

state_dict_factory

deepspeed.runtime.zero

swap_tensor

data_parallel_writer_factory

ds_to_universal

FAQ

FilesExpand file tree

Model_State_Management.md

Latest commit

History

Model_State_Management.md

File metadata and controls

Details

DeepSpeedEngine

deepspeed_checkpoint

state_dict_factory

deepspeed.runtime.zero

swap_tensor

data_parallel_writer_factory

ds_to_universal

FAQ