Skip to content

Latest commit

 

History

History
45 lines (24 loc) · 2.56 KB

File metadata and controls

45 lines (24 loc) · 2.56 KB
graph LR
    TarWriter["TarWriter"]
    ShardWriter["ShardWriter"]
    Generalized_I_O_Handler["Generalized I/O Handler"]
    ShardWriter -- "uses" --> TarWriter
    TarWriter -- "uses" --> Generalized_I_O_Handler
Loading

CodeBoardingDemoContact

Details

The webdataset.writer module is central to the Data Output/Writer component, handling the serialization of processed data into WebDataset (TAR) format.

TarWriter

This component is responsible for the low-level writing of individual data samples (represented as dictionaries) into a single .tar or compressed .tar.gz file. It manages the encoding of various data types (e.g., images, audio, text) into byte streams suitable for TAR archiving, including handling metadata and file properties. It acts as the core serialization engine for a single archive.

Related Classes/Methods:

ShardWriter

This component orchestrates the creation of multiple sharded .tar files. It manages the logic for splitting the output data into new archives based on configurable limits, such as the maximum number of records or the maximum file size per shard. It utilizes the TarWriter for the actual writing operations to each individual shard.

Related Classes/Methods:

Generalized I/O Handler

This component (represented by webdataset.gopen) handles opening output file streams, allowing writing to local files, network streams, or other supported destinations. It provides a unified interface for various file system operations, abstracting away the underlying storage mechanism.

Related Classes/Methods: