Cross Validation and Grid Search

class: center, middle
    
![:scale 40%](images/sklearn_logo.png)
    
### Introduction to Machine learning with scikit-learn
    
# Cross Validation and Grid Search
    
Andreas C. Müller
    
Columbia University, scikit-learn
    
.smaller[https://github.com/amueller/ml-training-intro]

---
class: center, middle

# Model evaluation and selection

---

# Threefold split
.padding-top[
![:scale 100%](images/threefold_split.png)
]
???
The simplest way to combat this overfitting to the test set is by using
a three-fold split of the data, into a training, a validation and a
test set as we just did.
We use the training set for model building, the validation set for
parameter selection and the test set for a final evaluation of the model.
So how many models should you try out on the test set? Only one! Ideally
use use the test-set exactly once, otherwise you make a multiple
hypothesis testing error!

What are downsides of this? We lose a lot of data for evaluation, and
the results depend on the particular sampling.
--
<br />

pro: fast, simple

con: high variance, bad use of data

---
# Implementing threefold split

.smaller[
```python
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y)
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval)

val_scores = []
neighbors = np.arange(1, 15, 2)
for i in neighbors:
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    val_scores.append(knn.score(X_val, y_val))
print("best validation score: {:.3f}".format(np.max(val_scores)))
best_n_neighbors = neighbors[np.argmax(val_scores)]
print("best n_neighbors:", best_n_neighbors)

knn = KNeighborsClassifier(n_neighbors=best_n_neighbors)
knn.fit(X_trainval, y_trainval)
print("test-set score: {:.3f}".format(knn.score(X_test, y_test)))
```

```
best validation score: 0.991
best n_neighbors: 11
test-set score: 0.951
```
]
???
FIXME complex code
Here is an implementation of the three-fold split for selecting the
number of neighbors.
For each number of neighbors that we want to try, we build a model on
the training set, and evaluate it on the validation set.
We then pick the best validation set score, here that’s 97.2%, achieved
when using three neighbors.
We then retrain the model with this parameter, and evaluate on the test set.
The retraining step is somewhat optional. We could also just use the best
model. But retraining allows us to make better use of all the data.

Still, depending on the test-set size we might be using only 70% or 80%
of the data, and our results depend on how exactly we split the datasets.
So how can we make this more robust?
---
# Cross-validation
.padding-top[
![:scale 100%](images/kfold_cv.png)]
???
The answer is of course cross-validation. In cross-validation, you split
your data into multiple folds, usually 5 or 10, and built multiple models.
You start by using fold1 as the test data, and the remaining ones as the
training data. You build your model on the training data, and evaluate
it on the test fold.
For each of the splits of the data, you get a model evaluation and a
score. In the end, you can aggregate the scores, for example by taking
the mean.
What are the pros and cons of this?
Each data point is in the test-set exactly once!
Takes 5 or 10 times longer!
Better data use (larger training sets).
Does that solve all problems? No, it replaces only one of the splits,
usually the inner one!
--
<br \>

pro: more stable, more data

con: slower

???

---

class: center, some-space
# Cross-validation + test set

![:scale 70%](images/grid_search_cross_validation.png)

???
Here is how the workflow looks like when we are using five-fold
cross-validation together with a test-set split for adjusting parameters.
We start out by splitting of the test data, and then we perform
cross-validation on the training set.
Once we found the right setting of the parameters, we retrain on the
whole training set and evaluate on the test set.
---
# Grid-Search with Cross-Validation

.smaller[
```python
from sklearn.model_selection import cross_val_score

X_train, X_test, y_train, y_test = train_test_split(X, y)
cross_val_scores = []

for i in neighbors:
    knn = KNeighborsClassifier(n_neighbors=i)
    scores = cross_val_score(knn, X_train, y_train, cv=10)
    cross_val_scores.append(np.mean(scores))
    
print("best cross-validation score: {:.3f}".format(np.max(cross_val_scores)))
best_n_neighbors = neighbors[np.argmax(cross_val_scores)]
print("best n_neighbors:", best_n_neighbors)

knn = KNeighborsClassifier(n_neighbors=best_n_neighbors)
knn.fit(X_train, y_train)
print("test-set score: {:.3f}".format(knn.score(X_test, y_test)))
```

```
best cross-validation score: 0.967
best n_neighbors: 9
test-set score: 0.965
```
]

???
Here is an implementation of this  for k nearest neighbors.

We split the data, then we iterate over all parameters and for each of
them we do cross-validation.

We had seven different values of n_neighbors, and we are running 10 fold
cross-validation. How many models to we train in total?
10 * 7 + 1 = 71 (the one is the final model)
---
class: center, middle
![:scale 80%](images/gridsearch_workflow.png)

???
Here is a conceptual overview of this way of tuning parameters, we start
of with the dataset and a candidate set of parameters we want to try,
labeled parameter grid, for example the number of neighbors.

We split the dataset in to training and test set. We use cross-validation
and the parameter grid to find the best parameters.
We use the best parameters and the training set to build a model with
the best parameters,
and finally evaluate it on the test set.

---
class: center, middle
# Cross-Validation Strategies

???
So I mentioned k-fold cross validation, where k is usually 5 or ten,
but there are many other strategies.

One of the most commonly ones is stratified k-fold cross-validation.
---
# StratifiedKFold

![:scale 100%](images/stratified_cv.png)

Stratified:
Ensure relative class frequencies in each fold reflect relative class
frequencies on the whole dataset.

???
The idea behind stratified k-fold cross-validation is that you want the
test set to be as representative of the dataset as possible.
StratifiedKFold preserves the class frequencies in each fold to be the
same as of the overall dataset.
Here is and example of a dataset with three classes that are ordered. If
you apply standard three-fold to this, the first third of the data would
be in the first fold, the second in the second fold and the third in
the third fold. Because this data is sorted, that would be particularly
bad. If you use stratified cross-validation it would make sure that each
fold has exactly 1/3 of the data from each class.

This is also helpful if your data is very imbalanced. If some of the
classes are very rare, it could otherwise happen that a class is not
present at all in a particular fold.
---
# Defaults in scikit-learn

- Three-fold is default number of folds
- For classification cross-validation is stratified
- train_test_split has stratify option:
train_test_split(X, y, stratify=y)

- No shuffle by default!

???
Before we go to the other strategies, I wanted to point out the default
behavior in scikit-learn.
By default, all cross-validation strategies are three-fold.
If you do cross-validation for classification, it will be stratified
by default.
Because of how the interface is done, that’s not true for
train_test_split and if you want a stratified train_test_split, which
is always a good idea, you should use stratify=y
Another thing that’s important to keep in mind is that by default
scikit-learn doesn’t shuffle! So if you run cross-validation twice
with the default parameters, it will yield exactly the same results.
---
class: spacious
# Repeated KFold and LeaveOneOut

- LeaveOneOut : KFold(n_folds=n_samples)<br />
High variance, takes a long time

- Better: RepeatedKFold.<br />
Apply KFold or StratifiedKFold multiple times with shuffled data. Reduces
variance!

???
If you want even better estimates of the generalization performance,
you could try to increase the number of folds, with the extreme
of creating one fold per sample. That’s called “LeaveOneOut
cross-validation”. However, because the test-set is so small every time,
and the training sets all have very large overlap, this method has very
high variance.
A better way to get a robust estimate is to run 5-fold or 10-fold
cross-validation multiple times, while shuffling the dataset.
---
# GroupKFold

.padding-top[
![:scale 100%](images/group_kfold.png)]

???
A somewhat more complicated approach is group k-fold.
This is actually for data that doesn’t fulfill our IID assumption and
has correlations between samples.
The idea is that there are several groups in the data that each contain
highly correlated samples.
You could think about patient data where you have multiple samples for
each patient, then the groups would be which patient a measurement was
taken from.
If you want to know how well your model generalizes to new patients,
you need to ensure that the measurements from each patient are either
all in the training set, or all in the test set.
And that’s what GroupKFold does.
In this example, there are four groups, and we want three folds. The
data is divided such that each group is contained in exactly one fold.
There are several other cross-validation methods in scikit-learn that
use these groups.

---
# TimeSeriesSplit
.padding-top[
![:scale 100%](images/time_series_cv.png)]

???
Another common case of data that’s not independent is time
series. Usually todays stock price is correlated with yesterdays and
tomorrows. If you randomly split time series, this makes predictions
deceivingly simple. In applications, you usually have data up to some
point, and then try to make predictions for the future, in other words,
you’re trying to make a forecast.
The TimeSeriesSplit in scikit-learn simulates that, by taking increasing
chunks of data from the past and making predictions on the next
chunk. This is quite different from the other was to do cross-validation,
in that the training sets are all overlapping, but it’s more appropriate
for time-series.
---
# Using Cross-Validation Generators

.smaller[
```python
from sklearn.model_selection import KFold, StratifiedKFold, ShuffleSplit
kfold = KFold(n_splits=5)
skfold = StratifiedKFold(n_splits=5, shuffle=True)
ss = ShuffleSplit(n_splits=20, train_size=.4, test_size=.3)

print("KFold:")
print(cross_val_score(KNeighborsClassifier(), X, y, cv=kfold))

print("StratifiedKFold:")
print(cross_val_score(KNeighborsClassifier(), X, y, cv=skfold))

print("ShuffleSplit:")
print(cross_val_score(KNeighborsClassifier(), X, y, cv=ss))
```

```
KFold:
[ 0.93  0.96  0.96  0.98  0.96]
StratifiedKFold:
[ 0.97  0.95  0.98  0.96  0.96]
ShuffleSplit:
[ 0.93  0.96  0.95  0.98  0.95  0.98  0.97  0.95  0.96  0.96  0.99  0.96
  0.96  0.96  0.98  0.96  0.95  0.95  0.96  0.96]
```
]
???
FIXME: repeated kfold
Ok, so how do we use these cross-validation generators? We can simply
pass the object to the cv parameter of the cross_val_score function,
instead of passing a number. Then that generator will be used.
Here are some examples for k-neighbors classifier.
We instantiate a Kfold object with the number of splits equal to 5,
and then pass it to cross_val_score.
We can do the same with StratifiedKFold, and we can also shuffle if we
like, or we can use Shuffle split.

---
class: center, middle
![:scale 80%](images/gridsearch_workflow.png)

???
Let’s come back to the general workflow for adjusting hyper-parameters,
though.
So we start with hyper parameters we want to adjust and our dataset,
we split it into training and test set, find the best parameters using
cross-validation, retrain the model and then do a final evaluation on
the test set.
Because this is such a common pattern, there is a helper class for this
in scikit-learn, called GridSearch CV, which does most of these steps
for you.
---
# GridSearchCV
.smaller[
```python
from sklearn.model_selection import GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

param_grid = {'n_neighbors':  np.arange(1, 15, 2)}

grid = GridSearchCV(KNeighborsClassifier(), param_grid=param_grid, cv=10)
grid.fit(X_train, y_train)

print("best mean cross-validation score: {:.3f}".format(grid.best_score_))
print("best parameters:", grid.best_params_)

print("test-set score: {:.3f}".format(grid.score(X_test, y_test)))
```

```
best mean cross-validation score: 0.967
best parameters: {'n_neighbors': 9}
test-set score: 0.993
```
]
???
Here is an example.
We still need to split our data into training and test set.
We declare the parameters we want to search over as a dictionary. In
this example the parameter is just n_neighbors and the values we want
to try out are a range. The keys of the dictionary are the parameter
names and the values are the parameter settings we want to try. If you
specify multiple parameters, all possible combinations are tried. This
is where the name grid-search comes from - it’s an exhaustive search
over all possible parameter combinations that you specify.

GridSearchCV is a class, and it behaves just like any other model in
scikit-learn, with a fit, predict and score method.
It’s what we call a meta-estimator, since you give it one estimator,
here the KneighborsClassifier, and from that GridSearchCV constructs a
new estimator that does the parameter search for you.
You also specify the parameters you want to search, and the
cross-validation strategy.
Then GridSearchCV does all the other things we talked about, it does the
cross-validation and parameter selection, and retrains a model with the
best parameter settings that were found.
We can check out the best cross-validation score and the best parameter
setting with the best_score_ and best_params_ attributes.
And finally we can compute the accuracy on the test set, simply but
using the score method! That will use the retrained model under the hood.

---

# GridSearchCV Results
.smallest[
```python
import pandas as pd
results = pd.DataFrame(grid.cv_results_)
results.columns
```
```
Index(['mean_fit_time', 'mean_score_time', 'mean_test_score',
       'mean_train_score', 'param_n_neighbors', 'params', 'rank_test_score',
       'split0_test_score', 'split0_train_score', 'split1_test_score',
       'split1_train_score', 'split2_test_score', 'split2_train_score',
       'split3_test_score', 'split3_train_score', 'split4_test_score',
       'split4_train_score', 'split5_test_score', 'split5_train_score',
       'split6_test_score', 'split6_train_score', 'split7_test_score',
       'split7_train_score', 'split8_test_score', 'split8_train_score',
       'split9_test_score', 'split9_train_score', 'std_fit_time',
       'std_score_time', 'std_test_score', 'std_train_score'],
      dtype='object')
```

```python
results.params
```
```
0     {'n_neighbors': 1}
1     {'n_neighbors': 3}
2     {'n_neighbors': 5}
3     {'n_neighbors': 7}
4     {'n_neighbors': 9}
5    {'n_neighbors': 11}
6    {'n_neighbors': 13}
Name: params, dtype: object
```
]

???
GridSearchCV also computes a lot of interesting statistics for you, which
are stored in the cv_results_ attribute. That attribute is a dictionary,
but it’s easiest to convert it to a pandas dataframe to look at it.
Here you can see the columns. Theres mean fit time, mean score time,
mean test scores, mean training scores, standard deviations and scores
for each individual split of the data.
And there is one row for each setting of the parameters we tried out.
---
class: center
# n_neighbors Search Results

![:scale 70%](images/grid_search_n_neighbors.png)
???
We can use this for example to plot the results of cross-validation over
the different parameters.
Here are the mean training score and mean test score together with one
standard deviation.