Skip to content

Latest commit

 

History

History
102 lines (56 loc) · 6.04 KB

File metadata and controls

102 lines (56 loc) · 6.04 KB
graph LR
    ML_Model_Abstraction["ML Model Abstraction"]
    Engine_Specific_ML_Algorithms["Engine-Specific ML Algorithms"]
    Data_Splitting_Utility["Data Splitting Utility"]
    Base_String_Clustering_Logic["Base String Clustering Logic"]
    String_Clustering_Orchestrator["String Clustering Orchestrator"]
    Key_Collision_Clustering_Algorithms["Key Collision Clustering Algorithms"]
    Base_Key_Collision_Function["Base Key Collision Function"]
    Spark_ML_Pipeline_Functions["Spark ML Pipeline Functions"]
    ML_Model_Abstraction -- "is base for" --> Engine_Specific_ML_Algorithms
    Engine_Specific_ML_Algorithms -- "depends on" --> Data_Splitting_Utility
    String_Clustering_Orchestrator -- "delegates to" --> Base_String_Clustering_Logic
    String_Clustering_Orchestrator -- "leverages" --> Key_Collision_Clustering_Algorithms
    Key_Collision_Clustering_Algorithms -- "relies on" --> Base_Key_Collision_Function
    Spark_ML_Pipeline_Functions -- "orchestrates" --> Engine_Specific_ML_Algorithms
    Spark_ML_Pipeline_Functions -- "orchestrates" --> Data_Splitting_Utility
    Spark_ML_Pipeline_Functions -- "orchestrates" --> Base_String_Clustering_Logic
    Spark_ML_Pipeline_Functions -- "orchestrates" --> Key_Collision_Clustering_Algorithms
Loading

CodeBoardingDemoContact

Details

The optimus ML subsystem is designed with a strong emphasis on engine-agnosticism and modularity, primarily facilitating machine learning operations across various data processing backends like Pandas, Dask, and Spark. At its core, the ML Model Abstraction provides a unified interface for diverse machine learning algorithms, which are then concretely implemented by Engine-Specific ML Algorithms tailored for performance on their respective engines. Data preparation is handled by the Data Splitting Utility, which works in conjunction with these algorithms. For specialized text processing, the String Clustering Orchestrator manages string clustering tasks, delegating to the Base String Clustering Logic for fundamental operations and leveraging Key Collision Clustering Algorithms for advanced, engine-optimized key collision detection. The Base Key Collision Function provides underlying utilities for these collision algorithms. Finally, Spark ML Pipeline Functions integrate and orchestrate these components into sequential workflows specifically for the Spark environment, ensuring a streamlined execution of complex ML tasks.

ML Model Abstraction

A foundational component providing a standardized interface for all machine learning models, ensuring engine agnosticism.

Related Classes/Methods:

Engine-Specific ML Algorithms

Implementations of various machine learning algorithms (e.g., linear regression, k-means) tailored for specific data processing engines like Pandas, Dask, or Spark.

Related Classes/Methods:

Data Splitting Utility

Manages the preparation of datasets for machine learning tasks, typically by splitting them into training and testing sets.

Related Classes/Methods:

Base String Clustering Logic

Encapsulates the fundamental logic for general string clustering operations, serving as a core reusable component for data transformation.

Related Classes/Methods:

String Clustering Orchestrator

Acts as the high-level entry point for initiating string clustering operations, delegating to the core clustering logic and specialized algorithms.

Related Classes/Methods:

Key Collision Clustering Algorithms

Specialized implementations of string clustering algorithms (e.g., fingerprint, n-gram fingerprint) optimized for different engines (cuDF, Pandas, Spark).

Related Classes/Methods:

Base Key Collision Function

Provides common, underlying functionalities and utilities for key collision-based string clustering, supporting the specialized algorithms.

Related Classes/Methods:

Spark ML Pipeline Functions

Manage the sequential flow and execution of machine learning models and transformations within a pipeline context, specifically for the Spark engine.

Related Classes/Methods: