Skip to content

Latest commit

 

History

History
46 lines (25 loc) · 2.96 KB

File metadata and controls

46 lines (25 loc) · 2.96 KB
graph LR
    Shard_List_Manager["Shard List Manager"]
    Generalized_I_O_Handler["Generalized I/O Handler"]
    Data_Cache["Data Cache"]
    Shard_List_Manager -- "provides URLs/paths to" --> Generalized_I_O_Handler
    Generalized_I_O_Handler -- "interacts with" --> Data_Cache
    Data_Cache -- "intercepts and stores/retrieves data streams from" --> Generalized_I_O_Handler
Loading

CodeBoardingDemoContact

Details

The Data Ingestion Layer in webdataset is a critical subsystem responsible for efficiently accessing raw data shards from diverse sources. It acts as the initial gateway for data entering the processing pipeline, abstracting away the complexities of data location and access.

Shard List Manager

This component is responsible for generating, expanding, and managing the list of data shard URLs or paths. It handles various input formats, including glob patterns and brace expansion, to produce a concrete list of data sources. It also provides mechanisms for distributing these shard lists across multiple workers or nodes, which is essential for parallel and distributed data loading in deep learning.

Related Classes/Methods:

Generalized I/O Handler

This component provides a unified and abstracted interface for opening data streams from various sources, including local files, remote HTTP/HTTPS URLs, and potentially other cloud storage protocols. It encapsulates the logic for handling different I/O backends, allowing the rest of the data pipeline to interact with data sources uniformly, regardless of their origin.

Related Classes/Methods:

Data Cache

This component implements caching mechanisms, specifically an LRU (Least Recently Used) file cache, to store frequently accessed data shards locally. By intercepting I/O requests and serving data from the cache when available, it significantly reduces redundant network or disk I/O operations, thereby improving data loading performance, especially in scenarios with repeated data access or distributed training.

Related Classes/Methods: