Parameter Tuning and AutoML¶

training size in SVC¶

Parameter Tuning and AutoML¶

03/30/20

Andreas C. Müller

FIXME successive halving and hyperband explaination with budget is too confusing, don’t need it FIXME add neural networks to meta-model FIXME bullet points FIXME show figure 2x random is as good as hyperband? FIXME needs lots more polish! FIXME too long?! what?! how?! FIXME difference between hyperband and SuccessiveHalving unclear

Motivation¶

Need to select among Models
Need to select Hyper-Parameters
Need to select among preprocessing methods

Conditional Hyper-Parameters¶

Kernels
Neural Nets
Pipelines

Formulating model-selection as Hyperparameter Optimization¶

One big search, many conditional Hyper-Parameters
Categorical, integer, continuous, conditional
Different distributions

CASH problem¶

Find the best configuration
Global optimization on complex (high-dim?) space

Issues with Grid-Search¶

Need to define Grid
Exponential in number of dims

Black-Box Search Procedures¶

\[\Lambda^* = \arg\max_\Lambda f(\Lambda)\]

Parameters \(\Lambda\), model-evaluation \(f\).

General optimization of unknown, non-differentiable f, possibly no-smooth. NP Hard in general. Function f is very slow to evaluate - think training a neural net for a week.

Random Search¶

Random Search with scikit-learn¶

## specify parameters and distributions to sample from
from scipy.stats import randint
param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 11),
              "min_samples_split": randint(2, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}
random_search = RandomizedSearchCV(clf,
                                   param_distributions=param_dist,
                                   n_iter=200)

lists or objects with rvs method
Use continuous distributions for biggest advantage

Bayesian Optimization, SMBO¶

.wide-left-column[

fit ‘cheap’ probabilistic function to black-box
pick next point using exploration / exploitation
Implemented as acquisition function ] .narrow-right-column[ .center[ ]]

Surrogate functions¶

:scale 100%

FIXME NEED NEURAL NETS

Evolutionary Methods: TPOT¶

.center[ :scale 80% ]

Implementations¶

SMAC: Random Forest model (Hutter group)
spearmint: GP (Snoek et al)
hyperopt: TPE (Bergstra, not maintained)
scikit-optimize: GP, tree, etc
GPyOpt: GP based on GPy (Lawrence group)

Criticism¶

:scale 25%

– :scale 23%

.smallest[http://www.argmin.net/2016/06/20/hypertuning/]

FIXME use newer hyperband figures!

Beyond Black-Box¶

Hyperparameter gradient descent
Multi-Fidelity optimization
Meta-learning
(others…)

Multi-Fidelity Search¶

.smaller[Approximate function by similar cheaper function] .center[

]

Top: subsample the datasets bottom: use less trees in forest

Multi-Fidelity Bayesian Optimization¶

Fit model to performance given parameters and budget
Choose parameters and budget for best exploration / exploitation
Related to multi-armed bandits and A/B testing
Differences to bandits:
- non-stationary distributions
- receiving loss (computing validation error) is expensive
- possibly infinitely many arms (continuous parameters)

Successive Halving¶

Given \(n\) configuration and budget \(B\)
pick \(\eta=2\) or \(\eta=3\) (wording follows 2)
Each iteration, keep best halve of configurations
after \(k=\log_\eta(n) + 1\) left with single configuration.
initially allocate \(\frac{B}{kn}\) to each configuration, double each iteration (exact budget is slightly more complicated, see algorithm). –

Successive Halving Example¶

configurations n=81
total budget B=20000

train 81 configurations with resources   41
train 27 configurations with resources  123
train  9 configurations with resources  370
train  3 configurations with resources 1111
train  1 configurations with resources 3333
resources total: 16638

Think about resources as in “maximum number of trees build in total” or “maximum number of data points used in total”.

Successive Halving (different) Example¶

:scale 90%

Hyperband¶

.center[ :scale 78% ]

Hyperband¶

:scale 90%

BOHB / HpBandSter¶

.center[ :scale 85% ]

In Practice¶

“With the exception of the LeNet experiment (Section 3.3) and the 117 Datasets experi- ment (Section 4.2.1), the most aggressive bracket of SuccessiveHalving outperformed Hyperband in all of our experiments.” .quote_author[Li et. al.]

Soon (?) in sklearn (discussion)
HpBandSter distributed implementation, some custom code required
scikit-hyperband looks good, but doesn’t do out-of-the-box subsampling
Successive halving really easy to implement yourself.

Meta-learning¶

Learning from experience (other datasets)¶

Ranking¶

Run many algorithms on large array of datasets
Rank by “best on average”

Portfolios¶

PoSH auto-sklearn
Create diverse set so that a good one among top k
Submodular optimization problem
greedy approximation

Discretization¶

Are continuous parameters actually important ?¶

Anectotal evidence that it’s not.

Meta-Features and Meta-Models¶

.center[ :scale 90% ]

Active Testing and Recommendations¶

.smaller[

Probabilistic Matrix Factorization for AutoML (Fusi et al)
https://github.com/rsheth80/pmf-automl/]

Multi-Task Bayesian Optimization¶

Create estimate over all datasets at the same time!
Not scalable with Gaussian Processes
Maybe scalable with Neural Networks?

Ensemble Models¶

:scale 100%

Auto-sklearn¶

end-to-end auto-ml
seaches a fixed sklearn pipeline (4 steps)
Warm-starting with meta-features & KNN
Bayesian Optimization with SMAC

https://automl.github.io/auto-sklearn/stable/index.html

Playing around with auto-sklearn¶

import autosklearn.classification
import sklearn.model_selection
import sklearn.datasets
from sklearn.metrics import accuracy_score
X, y = sklearn.datasets.load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = \
    sklearn.model_selection.train_test_split(X, y, random_state=1)
automl = autosklearn.classification.AutoSklearnClassifier()
automl.fit(X_train, y_train)
y_hat = automl.predict(X_test)
print("Accuracy score", accuracy_score(y_test, y_hat))

“This will run for one hour and should result in an accuracy above 0.98.”

Practical Recommendations¶

Multi-Fidelity! Simple, effective!
Portfolios
BOHB / HpBandSter
auto-sklearn
TPot?

Seems promising:¶

Transfering surrogates / ensembles
Collaborative filtering / active testing

Criticisms¶

Do we need 100 classifiers?
Do we need Complex Pipelines?
Creates complex models and ensembles
“Making it too easy”?

Although we already reduced the space of considered ML algorithms substantially compared to our previous Auto-sklearn (4 vs. 15 classifiers), we could have reduced this set even further since, in the end, only XGBoost models ended up in the final ensembles for the challenge.

.quote_author[Feurer et al, PoSH auto-sklearn]

dabl¶

https://dabl.github.io/¶

Implements a portfolio classfier with successive halving:

:scale 100%

Questions ?¶

from sklearn.svm import SVC
param_grid = {'gamma': np.logspace(-3, 3, 7), 'C': np.logspace(-3, 3, 7)}
param_grid

{'gamma': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03]),
 'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03])}

tot_time = []
results = []
train_sizes = [0.1, 0.2, 0.4, 0.8]
for train_size in train_sizes:
    X_train, X_test, y_train, y_test = train_test_split(X / 16., y, stratify=y, random_state=1, train_size=train_size)
    grid_search = GridSearchCV(SVC(), param_grid=param_grid, iid=False)
    start = time.time()
    grid_search.fit(X_train, y_train)
    tot_time.append(time.time() - start)
    res = pd.DataFrame(grid_search.cv_results_).pivot(index='param_C', columns='param_gamma', values='mean_test_score')
    results.append(res)

/home/andy/checkout/scikit-learn/sklearn/model_selection/_split.py:2184: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.
  FutureWarning)
/home/andy/checkout/scikit-learn/sklearn/model_selection/_split.py:2184: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.
  FutureWarning)
/home/andy/checkout/scikit-learn/sklearn/model_selection/_split.py:2184: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.
  FutureWarning)
/home/andy/checkout/scikit-learn/sklearn/model_selection/_split.py:2184: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.
  FutureWarning)

fig, axes = plt.subplots(1, 4, figsize=(15, 4))
for i, ax in enumerate(axes):
    ax.imshow(results[i].values, vmin=.8, vmax=.99)
    ax.set_title("Data fraction: {}% time: {:.0f}s".format(train_sizes[i] * 100, tot_time[i]))
    ax.set_xticks(np.arange(7))
    ax.set_xticklabels(param_grid['gamma'])
    ax.set_xlabel('gamma')
    ax.set_yticks(np.arange(7))
    ax.set_yticklabels(param_grid['C'])
    ax.set_ylabel('C')
plt.suptitle("RBF-SVM parameters on digits dataset")
plt.savefig("images/multi-fidelity-digits.png")

../_images/parameter_tuning_automl_43_0.png

params_str	{'C': 0.001, 'gamma': 0.001}	{'C': 0.001, 'gamma': 0.01}	{'C': 0.001, 'gamma': 0.1}	{'C': 0.001, 'gamma': 1.0}	{'C': 0.001, 'gamma': 10.0}	{'C': 0.001, 'gamma': 100.0}	{'C': 0.01, 'gamma': 0.001}	{'C': 0.01, 'gamma': 0.01}	{'C': 0.01, 'gamma': 0.1}	{'C': 0.01, 'gamma': 1.0}	...	{'C': 10.0, 'gamma': 0.1}	{'C': 10.0, 'gamma': 1.0}	{'C': 10.0, 'gamma': 10.0}	{'C': 10.0, 'gamma': 100.0}	{'C': 100.0, 'gamma': 0.001}	{'C': 100.0, 'gamma': 0.01}	{'C': 100.0, 'gamma': 0.1}	{'C': 100.0, 'gamma': 1.0}	{'C': 100.0, 'gamma': 10.0}	{'C': 100.0, 'gamma': 100.0}
iter
0	0.164668	0.164668	0.129374	0.129374	0.164668	0.141138	0.164668	0.164668	0.129374	0.129374	...	0.277687	0.129374	0.164668	0.141138	0.751719	0.766145	0.277687	0.129374	0.164668	0.141138
1	0.162487	NaN	NaN	NaN	NaN	NaN	0.162487	0.162487	NaN	NaN	...	0.304140	NaN	0.162487	NaN	0.732745	0.733937	0.304140	NaN	0.162487	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	0.293254	NaN	NaN	NaN	0.740079	0.776190	0.293254	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	0.780840	0.800602	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	0.880104	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	0.895587	NaN	NaN	NaN	NaN

	mean_test_score
param_gamma	0.001	0.010	0.100	1.000	10.000	100.000
param_C
0.001	0.162487	0.164668	0.129374	0.129374	0.164668	0.141138
0.010	0.162487	0.162487	0.129374	0.129374	0.162487	0.141138
0.100	0.162487	0.164668	0.129374	0.129374	0.162487	0.141138
1.000	0.182540	0.846617	0.293254	0.129374	0.162487	0.141138
10.000	0.785551	0.895587	0.293254	0.129374	0.162487	0.141138
100.000	0.780840	0.895587	0.293254	0.129374	0.162487	0.141138

	iter
param_gamma	0.001	0.010	0.100	1.000	10.000	100.000
param_C
0.001	1	0	0	0	0	0
0.010	1	1	0	0	1	0
0.100	1	0	0	0	1	0
1.000	2	4	2	0	1	0
10.000	3	5	2	0	1	0
100.000	3	5	2	0	1	0

Applied Machine Learning in Python

Parameter Tuning and AutoML¶

training size in SVC¶