graph LR
Training_Orchestrators["Training Orchestrators"]
Trainer["Trainer"]
Argument_Parsing["Argument Parsing"]
Distributed_Utilities["Distributed Utilities"]
Checkpoint_Management["Checkpoint Management"]
Training_Orchestrators -- "depends on" --> Argument_Parsing
Training_Orchestrators -- "leverages" --> Distributed_Utilities
Training_Orchestrators -- "delegates to" --> Trainer
Training_Orchestrators -- "interacts with" --> Checkpoint_Management
Trainer -- "depends on" --> Distributed_Utilities
Trainer -- "uses" --> Checkpoint_Management
Checkpoint_Management -- "relies on" --> Distributed_Utilities
The Training & Experiment Orchestrator subsystem is crucial for managing the entire lifecycle of deep learning models within the project. It encompasses the setup, execution, and persistence of training and inference experiments.
These are the primary entry points for initiating and managing the overall lifecycle of training runs for specific deep learning tasks (e.g., classification, detection, segmentation). They set up the environment, parse arguments, and kick off the main training process.
Related Classes/Methods:
Encapsulates the core training and inference loop logic. It handles the iteration over data, performs model forward and backward passes, calculates loss, and executes optimization steps.
Related Classes/Methods:
classification.run_with_submitit.Trainerclassification.engine.train_one_epoch:19-67classification.engine.evaluate:70-100
Manages the parsing of command-line arguments to configure various aspects of the training run, including model parameters, dataset paths, training hyperparameters, and distributed settings.
Related Classes/Methods:
Provides foundational services for setting up and managing distributed training environments, including process initialization, synchronization primitives, and collective operations.
Related Classes/Methods:
Handles the saving and loading of model weights, optimizer states, and other training progress checkpoints. This is crucial for resuming training, fault tolerance, and deploying trained models.
Related Classes/Methods: