graph LR
ML_Model_Abstraction["ML Model Abstraction"]
Engine_Specific_ML_Algorithms["Engine-Specific ML Algorithms"]
Data_Splitting_Utility["Data Splitting Utility"]
Base_String_Clustering_Logic["Base String Clustering Logic"]
String_Clustering_Orchestrator["String Clustering Orchestrator"]
Key_Collision_Clustering_Algorithms["Key Collision Clustering Algorithms"]
Base_Key_Collision_Function["Base Key Collision Function"]
Spark_ML_Pipeline_Functions["Spark ML Pipeline Functions"]
ML_Model_Abstraction -- "is base for" --> Engine_Specific_ML_Algorithms
Engine_Specific_ML_Algorithms -- "depends on" --> Data_Splitting_Utility
String_Clustering_Orchestrator -- "delegates to" --> Base_String_Clustering_Logic
String_Clustering_Orchestrator -- "leverages" --> Key_Collision_Clustering_Algorithms
Key_Collision_Clustering_Algorithms -- "relies on" --> Base_Key_Collision_Function
Spark_ML_Pipeline_Functions -- "orchestrates" --> Engine_Specific_ML_Algorithms
Spark_ML_Pipeline_Functions -- "orchestrates" --> Data_Splitting_Utility
Spark_ML_Pipeline_Functions -- "orchestrates" --> Base_String_Clustering_Logic
Spark_ML_Pipeline_Functions -- "orchestrates" --> Key_Collision_Clustering_Algorithms
The optimus ML subsystem is designed with a strong emphasis on engine-agnosticism and modularity, primarily facilitating machine learning operations across various data processing backends like Pandas, Dask, and Spark. At its core, the ML Model Abstraction provides a unified interface for diverse machine learning algorithms, which are then concretely implemented by Engine-Specific ML Algorithms tailored for performance on their respective engines. Data preparation is handled by the Data Splitting Utility, which works in conjunction with these algorithms. For specialized text processing, the String Clustering Orchestrator manages string clustering tasks, delegating to the Base String Clustering Logic for fundamental operations and leveraging Key Collision Clustering Algorithms for advanced, engine-optimized key collision detection. The Base Key Collision Function provides underlying utilities for these collision algorithms. Finally, Spark ML Pipeline Functions integrate and orchestrate these components into sequential workflows specifically for the Spark environment, ensuring a streamlined execution of complex ML tasks.
A foundational component providing a standardized interface for all machine learning models, ensuring engine agnosticism.
Related Classes/Methods:
Implementations of various machine learning algorithms (e.g., linear regression, k-means) tailored for specific data processing engines like Pandas, Dask, or Spark.
Related Classes/Methods:
Manages the preparation of datasets for machine learning tasks, typically by splitting them into training and testing sets.
Related Classes/Methods:
Encapsulates the fundamental logic for general string clustering operations, serving as a core reusable component for data transformation.
Related Classes/Methods:
Acts as the high-level entry point for initiating string clustering operations, delegating to the core clustering logic and specialized algorithms.
Related Classes/Methods:
Specialized implementations of string clustering algorithms (e.g., fingerprint, n-gram fingerprint) optimized for different engines (cuDF, Pandas, Spark).
Related Classes/Methods:
Provides common, underlying functionalities and utilities for key collision-based string clustering, supporting the specialized algorithms.
Related Classes/Methods:
Manage the sequential flow and execution of machine learning models and transformations within a pipeline context, specifically for the Spark engine.
Related Classes/Methods: