graph LR
Modality_Specific_Data_Loaders["Modality-Specific Data Loaders"]
Data_Splitting_Imbalance_Management["Data Splitting & Imbalance Management"]
Data_Augmentation_Engine["Data Augmentation Engine"]
Data_Collation_Batching["Data Collation & Batching"]
Data_Sampling_Strategies["Data Sampling Strategies"]
Modality_Specific_Data_Loaders -- "provides raw data to" --> Data_Splitting_Imbalance_Management
Modality_Specific_Data_Loaders -- "provides individual samples to" --> Data_Augmentation_Engine
Data_Splitting_Imbalance_Management -- "provides processed datasets to" --> Data_Collation_Batching
Data_Augmentation_Engine -- "provides augmented samples to" --> Data_Collation_Batching
Data_Sampling_Strategies -- "provides samples/weights to" --> Data_Collation_Batching
Data_Sampling_Strategies -- "influences" --> Data_Splitting_Imbalance_Management
The semilearn/datasets subsystem provides a robust and flexible framework for managing data within the semi-supervised learning pipeline. It begins with Modality-Specific Data Loaders responsible for ingesting raw data from diverse sources. This raw data then flows into Data Splitting & Imbalance Management for partitioning into labeled and unlabeled sets and handling class imbalances, or to the Data Augmentation Engine for on-the-fly transformations. The Data Sampling Strategies component guides both the splitting process and the final data collation. Ultimately, all prepared data, whether augmented or split, is channeled to Data Collation & Batching, which aggregates individual samples into efficient batches for model training and evaluation. This structured approach ensures data integrity and optimal preparation for semi-supervised learning tasks.
This component is responsible for loading and providing access to raw datasets across different modalities, including Computer Vision, Audio, and Natural Language Processing. It encapsulates the specific logic for handling various dataset formats and provides a unified interface for accessing raw data, often integrating base dataset functionalities.
Related Classes/Methods:
semilearn/datasets/cv_datasets/semilearn/datasets/audio_datasets/semilearn/datasets/nlp_datasets/semilearn/datasets/cv_datasets/datasetbase.pysemilearn/datasets/audio_datasets/datasetbase.pysemilearn/datasets/nlp_datasets/datasetbase.py
Manages the core semi-supervised learning data preparation, including splitting raw datasets into labeled and unlabeled subsets and addressing data imbalance through various utility functions. This is crucial for the semi-supervised learning paradigm.
Related Classes/Methods:
Implements and applies various data augmentation techniques to enhance dataset diversity and model robustness. It operates on individual data samples, transforming them before they are batched for training.
Related Classes/Methods:
Handles the batching of individual data samples, including padding and formatting, to prepare data for efficient input into machine learning models during training and evaluation. It ensures data is correctly structured for model consumption.
Related Classes/Methods:
Manages data sampling strategies, particularly for generating sample weights to address data imbalance or specific sampling requirements during training, ensuring fair representation or focus on certain data points.
Related Classes/Methods: