In regression tasks the preprocessing choices often matter more than the model hyperparameters. Missing value handling, feature scaling, and feature transformation interact with each other and with the downstream model in ways that create a landscape with strong parameter dependencies. This test function optimizes a full sklearn Pipeline from imputation through prediction.
The search space parameters would be:
{
"imputer_strategy": ["mean", "median", "most_frequent"],
"scaler": ["standard", "minmax", "robust", "none"],
"feature_transform": ["none", "polynomial_2", "polynomial_3", "log1p"],
"feature_selection_k": [5, 10, 15, 20, "all"],
"model": ["ridge", "lasso", "elastic_net", "gb"],
"model__alpha": [0.001, 0.01, 0.1, 1.0, 10.0, 100.0],
}
The implementation builds a sklearn Pipeline based on the parameter values. When feature_transform is "none" the corresponding pipeline step is skipped. The feature_selection_k parameter controls SelectKBest with f_regression scoring, where "all" disables selection. The model__alpha parameter applies to Ridge, Lasso and ElasticNet as regularization strength but is ignored when model="gb" (GradientBoostingRegressor uses its own defaults).
The constructor accepts a dataset parameter with the same regression datasets as the existing regressor functions (diabetes, california, friedman1, friedman2, linear). The diabetes dataset is the most interesting choice here because it has moderate dimensionality and benefits visibly from preprocessing, while the california housing dataset provides a harder problem where feature selection has a larger effect. The score is mean cross-validated R2, consistent with the existing regression functions. Only scikit-learn is needed under surfaces[ml].
In regression tasks the preprocessing choices often matter more than the model hyperparameters. Missing value handling, feature scaling, and feature transformation interact with each other and with the downstream model in ways that create a landscape with strong parameter dependencies. This test function optimizes a full sklearn Pipeline from imputation through prediction.
The search space parameters would be:
{ "imputer_strategy": ["mean", "median", "most_frequent"], "scaler": ["standard", "minmax", "robust", "none"], "feature_transform": ["none", "polynomial_2", "polynomial_3", "log1p"], "feature_selection_k": [5, 10, 15, 20, "all"], "model": ["ridge", "lasso", "elastic_net", "gb"], "model__alpha": [0.001, 0.01, 0.1, 1.0, 10.0, 100.0], }The implementation builds a sklearn Pipeline based on the parameter values. When
feature_transformis"none"the corresponding pipeline step is skipped. Thefeature_selection_kparameter controlsSelectKBestwithf_regressionscoring, where"all"disables selection. Themodel__alphaparameter applies to Ridge, Lasso and ElasticNet as regularization strength but is ignored whenmodel="gb"(GradientBoostingRegressor uses its own defaults).The constructor accepts a
datasetparameter with the same regression datasets as the existing regressor functions (diabetes, california, friedman1, friedman2, linear). The diabetes dataset is the most interesting choice here because it has moderate dimensionality and benefits visibly from preprocessing, while the california housing dataset provides a harder problem where feature selection has a larger effect. The score is mean cross-validated R2, consistent with the existing regression functions. Onlyscikit-learnis needed undersurfaces[ml].