awesome-architecture-mds/ai-ml/GraphGym/Data_Pipeline.md at main · CodeBoarding/awesome-architecture-mds

graph LR
    Dataset_Orchestrator["Dataset Orchestrator"]
    Raw_Data_Loader["Raw Data Loader"]
    Data_Transformation_Module["Data Transformation Module"]
    Feature_Augmentation_Module["Feature Augmentation Module"]
    Dataset_Orchestrator -- "invokes" --> Raw_Data_Loader
    Dataset_Orchestrator -- "applies operations from" --> Data_Transformation_Module
    Dataset_Orchestrator -- "applies operations from" --> Feature_Augmentation_Module

Details

The Data Pipeline subsystem is a critical part of the GraphGym project, embodying the "Pipeline Architecture" and "Data Management" patterns. It is responsible for the end-to-end processing of graph datasets, from initial loading to preparing them for model consumption, ensuring data integrity and suitability for various GNN tasks. It encompasses all functionalities related to data loading, transformation, and feature engineering. Its core components are primarily located within graphgym/loader.py, graphgym/models/transform.py, and graphgym/models/feature_augment.py. It acts as the preparatory layer before data is fed into the model training and evaluation stages.

Dataset Orchestrator

This component serves as the central control point for the entire data preparation pipeline. It orchestrates the sequential steps of data loading, initial filtering, and the application of various transformations and feature augmentations, ensuring data is correctly prepared for model consumption. It embodies the "Pipeline Flow" pattern by coordinating the data processing stages.

Related Classes/Methods:

create_dataset:217-270

Raw Data Loader

Responsible for the initial acquisition and parsing of raw graph datasets from diverse formats (e.g., PyTorch Geometric, NetworkX). It serves as the primary interface to external data sources, abstracting away data format specifics. This aligns with the "Data Management" aspect of an ML toolkit by providing robust data ingestion capabilities.

Related Classes/Methods:

Data Transformation Module

Provides a collection of generic and task-specific functions to modify the structure or content of the dataset. These transformations (e.g., negative sampling for link prediction, creating link labels) are applied as part of the create_dataset pipeline, ensuring data conforms to specific model or task requirements. This is a core "Data Management" utility, enabling data manipulation.

Related Classes/Methods:

Feature Augmentation Module

Manages and applies techniques to enhance, modify, or generate new features within the dataset, ensuring they are in a suitable format for model input. This module also supports the "Plugin/Extension Architecture" by providing mechanisms (register_feature_fun) for users to register and integrate custom augmentation functions, thereby extending the toolkit's capabilities.

Related Classes/Methods:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Details

Dataset Orchestrator

Raw Data Loader

Data Transformation Module

Feature Augmentation Module

FAQ

FilesExpand file tree

Data_Pipeline.md

Latest commit

History

Data_Pipeline.md

File metadata and controls

Details

Dataset Orchestrator

Raw Data Loader

Data Transformation Module

Feature Augmentation Module

FAQ