# W4995 Applied Machine Learning
# Parameter Tuning and AutoML
03/11/19
Andreas C. Müller

# Motivation
- Need to select among Models
- Need to select Hyper-Parameters
- Need to select among preprocessing methods

# Conditional Hyper-Parameters
- Kernels
- Neural Nets
- Pipelines

# Formulating model-selection as Hyperparameter Optimization
- One big search, many conditional Hyper-Parameters
- Categorical, integer, continuous, conditional
- Different distributions

# CASH problem
- Find the best configuration
- Global optimization on complex (high-dim?) space

# Issues with Grid-Search
- Need to define Grid
- Exponential in number of dims

# Black-Box Search Procedures
`$$\Lambda^* = \arg\max_\Lambda f(\Lambda)$$`
Parameters $\Lambda$, model-evaluation $f$.

General optimization of unknown, non-differentiable f, possibly no-smooth. NP Hard in general. Function f is very slow to evaluate - think training a neural net for a week.

# Random Search

# Random Search with scikit-learn
```python
# specify parameters and distributions to sample from
from scipy.stats import randint
param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 11),
              "min_samples_split": randint(2, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=200, cv=5)
```
- lists or objects with `rvs` method
- Use continuous distributions for biggest advantage

# Bayesian Optimization, SMBO
- fit 'cheap' probabilistic function to black-box
- pick next point using exploration / exploitation
- Implemented as acquisition function

# Surrogate functions

# Evolutionary Methods: TPOT

# Implementations
- SMAC: Random Forest model (Hutter group)
- spearmint: GP (Snoek et al)
- hyperopt: TPE (Bergstra, not maintained)
- scikit-optimize: GP, tree, etc
- GPyOpt: GP based on GPy (Lawrence group)

# Criticism  
# Beyond Black-Box
- Hyperparameter gradient descent
- Multi-Fidelity optimization
- Meta-learning
- (others...)

# Multi-Fidelity Search
Approximate function by similar cheaper function

Top: subsample the datasets
bottom: use less trees in forest

# Multi-Fidelity Bayesian Optimization
- Fit model to performance given parameters and budget
- Choose parameters and budget for best exploration / exploitation
- Related to multi-armed bandits and A/B testing
- Differences to bandits:
  - non-stationary distributions
  - receiving loss (computing validation error) is expensive
  - possibly infinitely many arms (continuous parameters)

# Successive Halving
- Given $n$ configuration and budget $B$
- pick $\eta=2$ or $\eta=3$ (wording follows 2)
- Each iteration, keep best halve of configurations
- after $k=\log_\eta(n) + 1$ left with single configuration.
- initially allocate $\frac{B}{kn}$ to each configuration, double each iteration (exact budget is slightly more complicated, see algorithm).

# Successive Halving Example
- configurations n=81
- total budget B=20000
```
train 81 configurations with resources 41
train 27 configurations with resources 123
train 9 configurations with resources 370
train 3 configurations with resources 1111
train 1 configurations with resources 3333
resources total: 16638
```

Think about resources as in "maximum number of trees build in total" or "maximum number of data points used in total".

# Successive Halving (different) Example

# Hyperband

# Hyperband

# BOHB / HpBandSter

# In Practice
"With the exception of the LeNet experiment (Section 3.3) and the 117 Datasets experi-
ment (Section 4.2.1), the most aggressive bracket of SuccessiveHalving outperformed
Hyperband in all of our experiments."
.quote_author[Li et. al.]

- Soon (?) in sklearn ([discussion](
- [HpBandSter]( distributed implementation, some custom code required
- [scikit-hyperband]( looks good, but doesn't do out-of-the-box subsampling
- Successive halving really easy to implement yourself.

# Meta-learning
## Learning from experience (other datasets)

# Ranking
- Run many algorithms on large array of datasets
- Rank by "best on average"

# Portfolios
- [PoSH auto-sklearn](
- Create diverse set so that a good one among top k
- Submodular optimization problem
- greedy approximation

# Discretization
## Are continuous parameters actually important ?
- Anectotal evidence that it's not.

# Meta-Features and Meta-Models

# Active Testing and Recommendations
- [Probabilistic Matrix Factorization for AutoML (Fusi et al)](

# Multi-Task Bayesian Optimization
- Create estimate over all datasets at the same time!
- Not scalable with Gaussian Processes
- Maybe scalable with Neural Networks?

# Ensemble Models

# Auto-sklearn
- end-to-end auto-ml
- seaches a fixed sklearn pipeline (4 steps)
- Warm-starting with meta-features & KNN
- Bayesian Optimization with SMAC

# Playing around with auto-sklearn
```python
import autosklearn.classification
import sklearn.model_selection
import sklearn.datasets
from sklearn.metrics import accuracy_score
X, y = sklearn.datasets.load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = \
    sklearn.model_selection.train_test_split(X, y, random_state=1)
automl = autosklearn.classification.AutoSklearnClassifier(), y_train)
y_hat = automl.predict(X_test)
print("Accuracy score", accuracy_score(y_test, y_hat))
```
"This will run for one hour and should result in an accuracy above 0.98."

# Practical Recommendations
- Multi-Fidelity! Simple, effective!
- Portfolios
- BOHB / HpBandSter
- auto-sklearn
- TPot?

## Seems promising:
- Transfering surrogates / ensembles
- Collaborative filtering / active testing

# Criticisms
- Do we need 100 classifiers?
- Do we need Complex Pipelines?
- Creates complex models and ensembles
- "Making it too easy"?

Although we already reduced the space of considered ML algorithms substantially compared to our previous Auto-sklearn (4 vs. 15 classifiers), we could have reduced this set even further since, in the end, only XGBoost models ended up in the final ensembles for the challenge.
.quote_author[Feurer et al, PoSH auto-sklearn]