Skip to content

Latest commit

 

History

History
104 lines (58 loc) · 5.21 KB

File metadata and controls

104 lines (58 loc) · 5.21 KB
graph LR
    ParameterServer["ParameterServer"]
    __init__["__init__"]
    start["start"]
    stop["stop"]
    run["run"]
    action_listener["action_listener"]
    get_parameters["get_parameters"]
    update_parameters["update_parameters"]
    ParameterServer -- "defines" --> __init__
    ParameterServer -- "defines" --> start
    ParameterServer -- "defines" --> stop
    ParameterServer -- "defines" --> run
    ParameterServer -- "defines" --> action_listener
    ParameterServer -- "defines" --> get_parameters
    ParameterServer -- "defines" --> update_parameters
    start -- "can call" --> stop
    run -- "delegates to" --> action_listener
    action_listener -- "calls" --> get_parameters
    action_listener -- "calls" --> update_parameters
Loading

CodeBoardingDemoContact

Details

The Parameter Server subsystem is a critical component within Elephas, designed to manage and synchronize model parameters in a distributed deep learning environment. It embodies the central coordination point for data parallelism, ensuring all worker nodes operate with a consistent and up-to-date global model.

ParameterServer

The core entity of the subsystem, responsible for centralizing and synchronizing model parameters (weights and gradients) across worker nodes. It acts as the authoritative source for the global model state, enabling data parallelism by providing workers with the current model and integrating their computed updates.

Related Classes/Methods:

init

Initializes the ParameterServer instance. This includes setting up the underlying communication mechanism (e.g., a Flask web service) that will listen for incoming requests from worker nodes, and initializing the model parameters.

Related Classes/Methods:

start

Initiates the server's operation, bringing it online to actively listen for and process parameter requests from worker nodes. It sets up the server to begin accepting connections.

Related Classes/Methods:

stop

Halts the server's operation, ensuring a graceful shutdown. This involves releasing resources and closing communication channels to prevent leaks or orphaned processes.

Related Classes/Methods:

run

Executes the server's continuous operation loop. It is responsible for keeping the server alive, constantly listening for incoming requests, and delegating their processing to the appropriate handlers.

Related Classes/Methods:

action_listener

Acts as the primary request handler for incoming HTTP requests from worker nodes. It interprets the request's intent, determining whether it's for retrieving the current model parameters or for updating them with gradients/weights from a worker.

Related Classes/Methods:

get_parameters

Provides the current global model parameters (weights) to requesting worker nodes. This ensures that all workers are training with the most up-to-date model state, crucial for synchronized distributed training.

Related Classes/Methods:

update_parameters

Incorporates parameter updates (e.g., gradients or updated weights) received from worker nodes into the global model state. This is crucial for synchronizing the model across the distributed training process, typically by applying an aggregation strategy.

Related Classes/Methods: