graph LR
run_master["run_master"]
register_slave["register_slave"]
FutureResult["FutureResult"]
SlavePipe["SlavePipe"]
nr_slaves["nr_slaves"]
run_master -- "consumes" --> FutureResult
run_master -- "queries" --> nr_slaves
register_slave -- "creates" --> SlavePipe
register_slave -- "returns" --> FutureResult
register_slave -- "increments" --> nr_slaves
The Inter-Device Communication Manager subsystem is a critical part of the Synchronized-BatchNorm-PyTorch project, specifically designed to facilitate the aggregation and distribution of batch statistics (mean, variance, counts) across multiple GPUs. Its boundaries are primarily defined by the sync_batchnorm/comm.py file, which encapsulates the core logic for master-slave communication in a distributed computing environment.
This component serves as the central control point residing on the master GPU. Its primary responsibility is to initiate the aggregation process, wait for batch statistics (mean, variance, counts) from all registered slave devices, and then potentially distribute aggregated results back. It is fundamental for coordinating the synchronized batch normalization across all participating GPUs.
Related Classes/Methods:
Responsible for establishing and configuring the communication channel for each slave GPU. It sets up a SlavePipe instance for bidirectional data exchange and returns a FutureResult object, which the master uses to asynchronously collect the slave's statistics. This component is crucial for dynamically adding and managing slave devices within the communication network.
Related Classes/Methods:
This component provides a non-blocking, asynchronous mechanism to hold and retrieve batch statistics (mean, variance, counts) sent from slave devices to the master. It embodies a common asynchronous programming pattern, allowing the master to continue its operations while waiting for contributions from slave GPUs.
Related Classes/Methods:
Represents the concrete communication conduit for data exchange between a slave GPU and the master. It facilitates the sending of local batch statistics (mean, variance, counts) from the slave to the master and can also receive aggregated data or commands from the master, enabling the distributed nature of the batch normalization process.
Related Classes/Methods:
This component is a simple counter that tracks the number of active slave devices participating in the synchronized batch normalization process. This count is essential for the run_master component to ensure that all expected contributions are received before performing the final aggregation of statistics, preventing deadlocks or incomplete aggregations.
Related Classes/Methods: