graph LR
DeepSpeedDataSampler["DeepSpeedDataSampler"]
DeepSpeedDataSampler___iter__["DeepSpeedDataSampler.__iter__"]
DeepSpeedDataSampler_get_next_global_batch["DeepSpeedDataSampler.get_next_global_batch"]
DeepSpeedDataSampler_get_sample_from_cluster["DeepSpeedDataSampler.get_sample_from_cluster"]
DeepSpeedDataSampler_get_new_cluster["DeepSpeedDataSampler.get_new_cluster"]
DeepSpeedDataSampler_sample_from_clusters["DeepSpeedDataSampler.sample_from_clusters"]
get_sample_based_on_metric_value["get_sample_based_on_metric_value"]
get_start_end_idx["get_start_end_idx"]
DeepSpeedDataSampler -- "coordinates calls to" --> DeepSpeedDataSampler_get_next_global_batch
DeepSpeedDataSampler -- "utilizes" --> get_start_end_idx
DeepSpeedDataSampler___iter__ -- "calls" --> DeepSpeedDataSampler_get_next_global_batch
DeepSpeedDataSampler___iter__ -- "utilizes" --> get_start_end_idx
DeepSpeedDataSampler_get_next_global_batch -- "calls" --> DeepSpeedDataSampler_get_sample_from_cluster
DeepSpeedDataSampler_get_next_global_batch -- "calls" --> DeepSpeedDataSampler_get_new_cluster
DeepSpeedDataSampler_get_next_global_batch -- "utilizes" --> DeepSpeedDataSampler_sample_from_clusters
DeepSpeedDataSampler_get_next_global_batch -- "calls" --> get_start_end_idx
DeepSpeedDataSampler_get_new_cluster -- "calls" --> get_sample_based_on_metric_value
The Data Pipeline & Loading subsystem in DeepSpeed is primarily responsible for managing advanced data loading strategies, including curriculum learning, dynamic batching, and efficient data sampling for distributed training. Its core functionality is encapsulated within the deepspeed.runtime.data_pipeline.data_sampling.data_sampler.py module, with DeepSpeedDataSampler serving as the central component.
The primary orchestrator of the data pipeline, acting as an iterable interface to yield global batches of data. It encapsulates the complex logic for advanced data loading strategies, crucial for high-performance distributed training.
Related Classes/Methods:
Enables the DeepSpeedDataSampler to be iterated over, providing a continuous stream of processed data batches to the training loop. This method is the entry point for consuming the data pipeline.
Related Classes/Methods:
Manages the logic for fetching the next global batch of data. This involves deciding whether to retrieve samples from existing clusters or acquire new ones, and then applying the appropriate sampling strategy for distributed processing.
Related Classes/Methods:
Retrieves data samples from already loaded data clusters, focusing on efficient reuse of data to minimize redundant loading operations and optimize memory usage in distributed environments.
Related Classes/Methods:
Handles the process of obtaining new data clusters, often involving metric-driven sampling to prioritize certain data points, which is a key aspect of curriculum learning for improved training efficiency.
Related Classes/Methods:
Implements the actual sampling of data points from the available clusters based on the defined strategies (e.g., random sampling, curriculum-based sampling), ensuring data diversity and training stability.
Related Classes/Methods:
deepspeed.runtime.data_pipeline.data_sampling.data_sampler.DeepSpeedDataSampler:sample_from_clusters
A helper function that supports curriculum learning by sampling data based on a specific metric value, allowing the model to learn from easier examples first, thereby accelerating convergence.
Related Classes/Methods:
Calculates the start and end indices for data batches, managing the partitioning of data for efficient distributed processing across multiple devices or nodes, crucial for scalability.
Related Classes/Methods: