Skip to content

[ENH] add CASH classification pipeline test-function #22

@SimonBlanke

Description

@SimonBlanke

Combined Algorithm Selection and Hyperparameter optimization (CASH) is the core problem behind Auto-sklearn, Auto-WEKA and FLAML. The optimizer must simultaneously choose which classifier to use and set its hyperparameters, where different algorithms have different parameter spaces. This test function exposes that joint selection problem as a single flat search space using parameter prefixing.

The search space combines algorithm selection with algorithm-specific hyperparameters:

{
    "algorithm": ["knn", "dt", "rf", "svm", "gb"],
    "knn__n_neighbors": [3, 5, 7, 11, 15, 21, 31],
    "dt__max_depth": [None, 2, 5, 10, 20],
    "dt__min_samples_split": [2, 5, 10, 20],
    "rf__n_estimators": [10, 50, 100, 200],
    "rf__max_depth": [None, 5, 10, 20],
    "svm__C": [0.01, 0.1, 1.0, 10.0, 100.0],
    "svm__kernel": ["linear", "rbf", "poly"],
    "gb__n_estimators": [50, 100, 200],
    "gb__learning_rate": [0.01, 0.05, 0.1, 0.2],
    "gb__max_depth": [3, 5, 7],
}

The _ml_objective implementation selects the classifier based on the algorithm parameter, picks only the matching prefixed hyperparameters (ignoring parameters for other algorithms), trains with cross-validated accuracy and returns the mean score. Parameters for non-selected algorithms have no effect on the score, creating large neutral regions in the landscape that surround narrow algorithm-specific valleys.
The score is mean cross-validated accuracy, consistent with the existing classification functions. The constructor takes dataset and cv following the same pattern as RandomForestClassifierFunction. Only scikit-learn is needed under surfaces[ml].

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions