Gradient Boosting; Stacking

class: center, middle

### W4995 Applied Machine Learning

# Boosting, Stacking, Calibration

02/21/18

Andreas C. Müller

???
We'll continue tree-based models, talking about boosting.
Another ensemble technique called stacking. Finally, a
technique called calibration that looks somewhat similar to
ensembles but has the goal of obtaining good probability
estimates from any classifier.

FIXME: XGBOOST!! FIXME early stopping FIXME GBRT: better
explain log-loss FIXME: add example of calibrated and
inaccurate vs accurate but not calibrated! FIXME:
calibration: what are the estimators fit to, and why does
fitting it to 0 and 1 create probabilities in between

Partial dependence plot. how is that different from feature
importances. calibration curve plotting: bins are different
for hist and calibration curve? slide 27: maybe put
probabilities on x axis? calibration curve: blue histogram
is number of total points, y axis not labeled
---
class: centre,middle
# Gradient Boosting
???
Gradient boosting is one of the most successfull supervised
machine learning methods in practice. It's often used in
kaggle to win competition, it's used for credit scoring,
it's one of the standard tools of the trade. It's one of the
best of-the-shelf models A standard implementation that
people use is XGBoost, but there's also an implementation in
scikit-learn, and we'll talk about both of them.

Last time we talked about Random forests, which builds many
trees independently, each randomized in a different way, and
then averages their predictions. Gradient boosting on the
other hand builds trees one by one in a sequential manner,
with each tree requiring the results of previous trees.
Often, Gradient boosting is done with very small trees, or
even decision stumps, which is trees of depth one, so a
single split.

---
class: center, middle
# Boosting (in General)

???
 - “Meta-algorithm” to create strong learners from weak learners.
 - AdaBoost, GentleBoost, …
 - Trees or stumps work best
 - Gradient Boosting often the best of the bunch
 - Many specialized algorithms (ranking etc)

This is an instance of a more general family of models,
called boosting models, which all iteratively try to improve
a model built up from weak learners. Gradient boosting is
this particular technique where we are trying to fit the
residuals, and it's been found to work very well in
practice, in particular if you're using shallow trees as the
weak learners. In principle, you could use any model as a
weak learner, but trees just work really well.

---

# Gradient Boosting Algorithm

`$$ f_{1}(x) \approx y  $$`

`$$ f_{2}(x) \approx y - f_{1}(x) $$`

`$$ f_{3}(x) \approx y - f_{1}(x) - f_{2}(x)$$`

$y \approx$ ![:scale 22.5%](images/grad_boost_term_1.png) + ![:scale 22.5%](images/grad_boost_term_2.png) + ![:scale 20%](images/grad_boost_term_3.png) + ...
???
Let's look at the regression case first. We start by
building a single tree f1 to try to predict the output y.
But we strongly restrict f1, so it will be rather bad at
predicting y. Next, we'll look at the residual of this first
model, so y - f1(x). We now train a new model f2 to try and
predict this residual, in other words to correct the
mistakes made by f1. Again, this will be a very simple
model, so it will still not be able to fix all errors. Then,
we look at the residual of both of the models together, so y
- f1(x) - f2(x), so the mistakes that could not be fixed by
f2, and we build f3 to fix that, and so on. This is natural
for regression. For classification this is not as clear. For
binary classification you use log-loss, or rather you apply
the logistic function to get a binary prediction, for
multi-class you can use 1 vs rest.

So we're sequentially building up a model using what's
called "weak learners", small trees, and create a more
powerful composite model.

---
# Gradient Boosting Algorithm

`$$ f_{1}(x) \approx y  $$`

`$$ f_{2}(x) \approx y - \alpha f_{1}(x) $$`

`$$ f_{3}(x) \approx y - \alpha f_{1}(x) - \alpha f_{2}(x)$$`

$y \approx \alpha$ ![:scale 22.5%](images/grad_boost_term_1.png) + $\alpha$ ![:scale 22.5%](images/grad_boost_term_2.png) + $\alpha$ ![:scale 20%](images/grad_boost_term_3.png) + ...
<br />
<br />
Learning rate $\alpha, i.e. 0.1$

???
- Iteratively add regression trees to model
- Use log loss for classification
- Discount update by learning rate

FIXME plot for regression models 
Come back to this
---
#GradientBoostingRegressor

.center[
![:scale 50%](images/grad_boost_regression_steps.png)
]
???
Here's an illustration for doing this for regression. This
is a 1D regression dataset for illustration purposes here to
form features on the x-axis, the prediction is on the
y-axis.

In the first step, I'm just fitting my tree to the data. I
use a simple tree of depth 3. The depth 3 tree is not able
to completely model the data and so the orange is the tree
that was fit. After this first step, I look at the total
predictions. So this is just alpha times the predictions
made by the first tree. So you see everything is sort of
squashed together. This is the effect of alpha. The blue
points here in this next panel is this data minus the minus
the predictions from step one, and so this is the residual.
It still looks pretty much the same, because we took only a
small step. Then I fit another tree to this residuals again
of depth three. Then here's the total prediction, which is
alpha times the first tree plus alpha times the second tree,
followed against the original data. The orange has more
steps now because it's a combination of two trees.

As I continue the same procedure until step five, you can
see that residual gets much smaller and it has learned most
of the variations in the data. A total linear combination of
all the trees and learning looks like this. And if I keep
doing this, then at some point, residuals will become very
small. And the total prediction will fit the data better and
better.

The question is can I extrapolate?

The answer is no.

The question is how is this different?

And it's quite different. It’s kind of hard for me to answer
his question.

A) The combination of stumps is different than building a
deeper tree because you always apply all the splits to the
whole dataset. Whereas if you make a tree deeper, you will
have different splits on the deeper nodes. If you have
already like 10 nodes, and you want to grow one more level,
each of these parts of the data will have a different split.
Whereas if I add another stump it will be on a global level.
In decision trees, decisions are very hard. And here you
don't trust any hard decisions. So you only go a little bit
in each direction.

B) This works way better.

So this was for regression and because it's easier to
visualize, you can do the same thing for classification.

Basically, what gradient boosting for classification does is
you’re letting regression trees to learn decision function.
So basically, you're doing regression again, but you're
doing logistic regression. Instead of using a linear model
for logistic regression, we're now using this linear
combination of trees. So in other words, what we're doing is
we're applying a log loss, and we're trying to find the
regression function that has a small log loss. This f is the
decision function that we're doing. And so inside a gradient
boosting classifier, you're not actually learning
classification trees, you're learning regression trees,
which are trying to predict the probability, which is quite
different.

The question is, how they're different?

The same is true for f1. If I fit f1, exactly to y this
becomes 0. One reason is the alpha, which means I only go a
small step. But even if I set alpha equal to one, the point
is that I restricted my f1 to not completely fit the data.
I'm not trying to completely overfit the data, I'm trying to
use a simple model. So if I use a tree of depth one, there
will be a very large residual.

---
#GradientBoostingClassifier

.center[
![:scale 70%](images/grad_boost_depth2.png)
]
???
Here’s an illustration of what this might look like for
classification. This is basically the same thing going on.
I’m plotting the probabilities assigned by the model.

Here's a first estimator, which makes some mistakes here and
makes a bunch of mistake in the middle. This is the decision
tree of depth two. Then I fit the model to the residuals
here. To minimize the log loss of this dataset changes
something in the middle, and so on. And you can see that as
I add more and more estimator it fits the data better and
better and gets more and more complex decision boundaries.

Since it works like a gradient descent on log loss, it's a
little bit harder to visualize this for classification.

White mean probability of 0.5 (which is a tie). Red means
high probability for the red class. Blue means a high
probability for the blue class. Here, clearly, it hasn't fit
the data perfectly. So I could add more and more models and
then, in the end, it would fit the data perfectly.

For multi-class, the default thing to do is One Versus Rest.

---
class:spacious
# Gradient Boosting Advantages

- Slower to train than RF (serial), but much faster to predict
- Small model size
- Typically more accurate than Random Forests
???
It's sort of slower to train if you're training serial, but
if you paralyze it, it's often faster to train since it has
a much smaller model size. So the trees are usually not as
deep and you don't need as many because you're doing much
more focused learning where you try to correct the mistakes
of the other models. So usually it’s very fast to predict
because prediction can happen in parallel over all the
trees, while learning cannot happen in parallel over all the
trees necessarily. Usually, this is more accurate than
random forests.

The question is for the same number of estimators, how does
low versus high learning rate changes?

High learning rate allows you to fit the data more strongly
and also overfit the data more strongly.

---

class:spacious
# Tuning Gradient Boosting
- Pick n_estimators, tune learning rate
- Can also tune max_features
- Typically strong pruning via max_depth

???
There are several things you can tune about the gradient
boosting. A common approach is to pick the number of
estimators that you have time for. So runtime is obviously
linear than number of estimators because you need to build
each tree one by one. Each tree is built with the same
parameters. And then you can tune the learning rate to see
how strongly you want to fit the data.

You can also tune something like max features if you want to
add more randomness, but that's actually not very commonly
used. You can also subsample the data if you want faster
training. And typically there's a maximum depth.
Traditionally, it was like maximum depth of one, two or
three, though in kaggle people are doing like 8 to 10 or
something. But typically, the depth is much, much smaller
than for random forests. This means the model will be small
in memory, and also will be faster to predict since you have
less deep trees to traverse.

Before we go into implementation I want to talk a little bit
about analyzing the models. So again, because it’s a
tree-based model, the prediction is a linear combination of
trees, you can look at feature importance like the same way
we did for random forest and trees.

---

# Partial Dependence Plots
- Marginal dependence of prediction on one or two features

.smaller[
```python
from sklearn.ensemble.partial_dependence import plot_partial_dependence
boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(
    boston.data, boston.target,random_state=0)

gbrt = GradientBoostingRegressor().fit(X_train, y_train)

fig, axs = plot_partial_dependence(
    gbrt, X_train, np.argsort(gbrt.feature_importances_)[-6:],
    feature_names=boston.feature_names, n_jobs=3,
    grid_resolution=50)
```
]

???
But there's also something else that's quite interesting,
which is called Partial Dependence Plots that's actually
possible for all trees. Unfortunately, in scikit-learn, it’s
available only for the gradient boosting. So the idea here
is to not only see what parameters are important but how
will they influence the target. And so after you fit your
model, I'm using the Boston data set to the gradient
boosting regressor, there's a thing called plot partial
dependence, which gets the model and the training data set
and then the features for which I want to make the partial
dependence plots.

So here, I'm looking at the six most important features. The
feature importance is important according to the trees. So I
sort this and take the six most important ones.

---
# Partial Dependence Plots

.center[
![:scale 70%](images/feat_impo_part_dep.png)
]

???
And so this is what the partner dependence plots look like.
So this is the most important feature, the second most
important feature and so on. This is sort of the marginal
contribution of each feature. Basically, in each tree,
you're summing out the contribution of all the other
features, and you look only at what is the contribution of
this particular feature. In general, this would be hard to
compute. For tree-based models you can compute is a very
efficient way because you can basically just sum up over the
whole space by traversing the tree.

Here with increased room size, the price increases and you
can see how it increases and you can see that there's like a
big step function here. And you can see that for increased
LSTAT, the price decreases. For the others, there aren’t any
drastic effect. There seems to be some threshold for NOX.
And something funky going on DIS.

The question is, so I said for random forest, you can't find
out the direction of which way the feature influences. But
that was for the feature importance. So the feature
importance tells the impurity decrease. So here I also look
at the feature importance and they're just numbers so they
don't give you the direction. For both gradient boosting and
random forest, you can look at partial dependency plot, and
they will give you a non-parametric model of how a single
feature influences it. Unfortunately, it’s not available in
scikit-learn. You could do the same thing for random
forests. Because in the end, the way they predict is quite
similar.

---
# Bivariate Partial Dependence Plots
.smaller[
```python
plot_partial_dependence(
    gbrt, X_train, [np.argsort(gbrt.feature_importances_)[-2:]],
    feature_names=boston.feature_names, n_jobs=3, grid_resolution=50)
```
]
.center[
![:scale 80%](images/feature_importance.png)
]

???
You can also do this in bivariate. So here, looking at the
two most important features in a bivariate plot, I'm
actually not giving it a list of features, but I'm giving it
a list of list of features. And so the two most important
features, LSTAT, and ROOM.

You can see there’s no very complex interaction. As you can
see, the effect of these two features taking into account
all the other features.

This is something you can usually do for linear model
easily. If you do this for the linear model, you would have
lines everywhere but for more complicated models this is
very hard to do. Way more tricky to do in neural networks.

---
# Partial Dependence for Classification
.tiny-code[
```python
from sklearn.ensemble.partial_dependence import plot_partial_dependence
for i in range(3):
    fig, axs = plot_partial_dependence(gbrt, X_train, range(4), n_cols=4,
                                       feature_names=iris.feature_names, grid_resolution=50, label=i)
```
]
.center[
![:scale 50%](images/feat_impo_part_dep_class.png)
]
???
We can do the same for classification. Here is a partial
dependency plot for the iris dataset.

As I said, for classification we’re usually using One Versus
Rest. So here, we have three classifiers.

One classifier for sertosa, one for Versicolor and one for
virginica.

You can see that for the first two features they're like
flat. But for sertosa the decision function is high for
small part petal length, medium peddling for versicolor and
large for virginca. These are all put through logistic
function and then normalized. 
---

# XGBoost
`conda install -c conda-forge xgboost`

```python
from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(X_train, y_train)
xgb.score(X_test, y_test))
```

- supports missing values
- supports multi-core

???
In terms of implementation, there's a couple of
implementation methods. Scikit-learn has gradient boosting
classifier and gradient boosting regressor. Unfortunately,
they are rather slow.

So one of the most commonly used packages is XGBoost, which
you can install from conda forge for example. It's
completely scikit-learn compatible. So you can import
XGBClassifier if you want you can use it with grid search or
with pipelines.

The cool thing is its fast. It supports missing values so
you don't need to do an imputation strategy, you can put
them directly into XGBoost. And it also supports multi-core.

The gradient boosting in scikit-learn is just sequential on
a single core which on most of the machines are not great.
If you have a big machine you want to use all the cores in
parallel. You can’t do that with scikit-learn but you can do
it with XGBoost.

For random forest, on the other hand, scikit-learn can also
run random forest in parallel, because it's much easier.
XGBoost also has a random forest. XGBoost also has some
enhancements to the algorithm. It has L1 and L2 penalties
for the leaves, so you can do something like an elastic net
or lasso penalty to the leaves. I think by default, it's
disabled. But they are not really training like the vanilla
decision trees.

One of the reasons they're faster is because they have a
faster implementation. But you can make even faster if you
use approximate splits in the trees.

If you want to find a split, you need to search over all
features and you need to search over all possible thresholds
on the feature. Searching over all possible thresholds on
the feature means soaring feature, which is an analog
operation. Instead, what you can do is you can bin the
features, and then its linear time operation but the
threshold will be approximate.

And that will make your computation much faster.

- Efficient implementation of gradient boosting (5x sklearn)
- Improvements on original algorithm
- https://arxiv.org/abs/1603.02754
- Adds l1 and l2 penalty on leaf-weights
- Fast approximate split finding
- Scikit-learn compatible interface

---
class: spacious

.center[
![:scale 50%](images/xgboost_sklearn_bench.png)
]
- Exact splits ~5x faster on single core
- Appoximate splits, multi-core -> more speed!
- Also check lightGBM (even faster?)
- Both have GPU support
- https://github.com/szilard/GBM-perf
???
Here is actually for the exact version of XGBoost the
comparison in terms of time and accuracy. This is for
illustrative purposes. I ran this on amnest or subsets of
amnest. The x-axis is the number of samples and so you can
see how the runtime increases as the number of samples
increases.

For a significant number of samples, you can see that
XGBoost is much faster.

If you run on a single core XGBoost tends to be about five
times faster, if you run it on a 10 core machine, it's going
to be much, much faster.

There's also an implementation out of Microsoft called
lightGBM, which might be even faster. So according to a lot
of latest benchmarks that I saw, lightGBM is actually
fastest. LightGBM and XGBoost both have GPU support, which
makes them even faster. For GPU support, the benchmark that
I saw stated XGBoost was a bit faster while on CPU lightGBM
was a bit faster. But both are pretty fast and highly
paralleled optimized algorithms. So if you care about
speeds, you want to use one of these two.

---
# Early stopping

- Adding trees can lead to overfitting
- Stop adding trees when validation accuracy stops increasing
- Optional in XGBoost and sklearn >= 0.20 (development)

???
As I said earlier, if you add more and more trees, the
algorithms can overfit. So instead of searching for the
learning rate, or searching for number of trees, you can
also do early stopping. Because this is a sequential
algorithm that gets better and better the more trees you
add, you can just use a validation set and stop adding trees
once you overfit. You can do that both with scikit-learn and
XGBoost.

Basically, the idea is that you pick a large number of
estimators and you have a separate validation set and if the
validation set accuracy doesn't improve or if it decreases
for number of iterations, say five, then you just stop the
learning. This way, you get results faster, because you
don't keep learning but also you possibly get a better model
because you know the overfitting.

The only downside stopping learning is that you have fewer
data to train your model because you need to the validation
set for early stopping. You’ll still need a separate test
set to see how well the model actually does. You can't use
the same set for early stopping and for evaluation.

---
class:spacious
# When to use tree-based models
- Model non-linear relationships
- Doesn’t care about scaling, no need for feature engineering
- Single tree: very interpretable (if small)
- Random forests very robust, good benchmark
- Gradient boosting often best performance with careful tuning

???
To summarize, tree-based models are really very, very
popular family of models. They're very commonly used. You
probably want to use them if you want nonlinear
relationships, or if you have a lot of different kinds of
weird features because they really don't care about scaling
of the features, so it allows you to get rid of mostly
preprocessing.

If you want very interpretable model, single trees or small
single trees are good ideas, because it's one of the few
models that you can sort of write down on a blackboard and
show to someone, and they'll have some idea of what's going
on. That's impossible for any other models.

Random forests are great because they're very robust. And
you don't have to tune anything, you just run a random
forest with 100 trees and with the default settings and it
will work great.

And so this one of my first benchmarks. Usually, I first ran
a logistic regression then I run random forests.

Gradient Boosting is often the best performing model, sort
of the centered toolbox. Sometimes you need to tune a little
bit more, so you need to tune the depth of the trees or the
learning rate, and so on. But if you tune them, then they
will usually beat random forest on those data sets.

One case where you might not want to use trees is if you
have very high dimensional sparse data then linear models
might work better but also your mileage may vary. On the
contrary, if you have like a low dimensional space,
tree-based models are probably a good bet.

Next thing we'll talk about is a different way to build
ensembles, different way to put together multiple models.

---
class: centre,middle
# More ensembles: Stacking
???
That's called stacking. So last time, we talked about
bagging and model averaging. And stacking basically goes a
little beyond that, in that it tries to stack models on top
of each other. 
---
# Simplified Stacking
 .tiny-code[
```python
from sklearn.neighbors import KNeighborsClassifier

X, y = make_moons(noise=.4, random_state=16, n_samples=300)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

voting = VotingClassifier(
   [('logreg', LogisticRegression(C=100)),
    ('tree', DecisionTreeClassifier(max_depth=3)),
    ('knn', KNeighborsClassifier(n_neighbors=3))],
    voting='soft')
voting.fit(X_train, y_train)
lr, tree, knn = voting.estimators_
```
]

.center[
![:scale 90%](images/average_voting.png)
]

???
This is not simplified stacking. This is basically just
averaging models. This is one of the simplest ensemble we
saw last time.

So here, I have this two-class data set and I trained a
logistic regression, a decision tree, and KNN. So in simple
ensemble, I just average all the probabilities and then take
the arg max as a prediction. And that's implemented in the
voting classifier that we saw last time.

But instead of just taking an average, what we could do is
take a weighted average, for example. And so the way to do
that would be….

---
# Simplified Stacking
.tiny-code[
```python
stacking = make_pipeline(voting, LogisticRegression(C=100))
stacking.fit(X_train, y_train)
print(stacking.score(X_train, y_train))
print(stacking.score(X_test, y_test))
```

```
0.90
0.77
```

```python
stacking.named_steps.logisticregression.coef_
```

```
array([[ 0.487,  1.519,  3.686]])
```
]
.left-column[
<br />
<br />
![:scale 90%](images/average_voting.png)
]
.right-column[
![:scale 90%](images/simple_stacking_result.png)
]

???
---
class:centre,middle

## Problem : Overfitting!
???
One way is simplified stacking, where you take the
probability estimates by all of the models as an input, and
you train logistic regression on these probability estimates
in an effort to predict the class.

This is basically weights for each of these probability
estimates. And then you can take the weight average.

So here, I used a pipeline out of voting classifier logistic
regression. If I used the voting classifier as a
transformer, it'll just transform into the probabilities.

So here are the coefficients, there are three coefficients,
one for each of the models. You can see that it trusts the
logistic regression the least, the decision tree a little
bit more, and the KNN the most. Instead of just averaging,
you get something like this.

Instead of using logistic regression, we could use any model
we want, you could use a decision tree, random forest,
gradient boosting neural network, I can train whatever model
I want on the probabilities created by these 3 models.

The question is why don't we just simply take the most
accurate model?

Because averaging them is often better.

There’s a problem with doing this, though, which is that you
basically overfit the trained data a lot. The thing is that
here we train logistic regression on the training data and
we train the logistic regression on the probability
estimates of these other models on the training data. So the
more the model overfit well it will be on the training data.

The KNN for example, overfit quite a bit. If you do this on
trained data set, the model that overfit most is the most
informative on the training dataset, because it does
perfectly. So then the second stage model will learn to
trust the one most that overfits most. And clearly, you
don't want that.

There are two ways to overcome this.

---
class:spacious
# Stacking
- Use cross-validation (even LOO!) to produce probability estimates on training set.
- Train second step estimator on held-out estimates
- No overfitting of second step!
- For testing: as usual
???
While you can either just hold the hold-out set and use a
single hold-out set to train a second stage model. So then
on the hold-out set, the model that overfits obviously
doesn't do as well on the hold-out model. And so the second
stage will not trust the overfitting model.
---
# Hold-out estimates of Probabilities
![:scale 100%](images/holdout_estimate.png)

- Split 1 produces probabilities for Fold 1, split2 for Fold 2 etc.
- Get a probability estimate for each data point!
- Unbiased estimates (like on the test set) for the whole training set!
- Without it: The best estimator is the one that memorized the training set.

???
The other way is to use cross-validation. And basically,
now, instead of having a single hold-out set, you have
multiple hold-out sets. So for each data point, you need to
compute the features. And the way you compute the features
of the first stage on the training set is you train the
model on the remaining folds and then you do probabilistic
predictions or probabilistic estimates on this data point in
the training set. But the model that you're using for this
probabilistic estimates was not trained on this data. You
can do leave one out, for example, and you can train a model
on all of the other data and then predict probabilities for
one point. Obviously, that's going to be very slow.

You can also do something like 5 fold cross-validation,
wherein the first stage, you have to train the model five
times and then get hold-out probability estimates for each
of these.

This way, you have unbiased probability estimate for all
training data points.

The output of the first stage with cross-validation would be
just the features which are probability estimates of the
underlying model. Only now, instead of being biased, their
unbiased estimates because it's hold-out data.

The benefit of doing this compared to single hold-out set is
that the second stage now has much more training data
because the training dataset is the same size. I have
unbiased probability estimates for my models for all data
points. So I have the same size training set only without
overfitting now.

---

.tiny-code[
```python
from sklearn.model_selection import cross_val_predict
# take only probabilities of positive classes for more interpretable coefficients
first_stage = make_pipeline(voting, FunctionTransformer(lambda X: X[:, 1::2]))
transform_cv = cross_val_predict(first_stage, X_train, y_train, cv=10, method="transform")
second_stage = LogisticRegression(C=100).fit(transform_cv, y_train)
print(second_stage.coef_)
```
```
[[ 2.192  1.402  1.012]]
```

```python
print(second_stage.score(transform_cv, y_train))
print(second_stage.score(first_stage.transform(X_test), y_test))
```

```
0.813
0.800
```
]
.center[
![:scale 40%](images/simple_stacking_result.png)
![:scale 40%](images/stacking_result.png)
]

???
And so here is the result of this. There is a pull request
in scikit-learn to do this directly. I do it slightly
differently using cross_val_predict, cross_val_predict
allows me to apply any method inside cross-validation.

Here, I'm using the transform method of the voting
classifier, and I'm using 10 fold cross-validation to always
just use transform on the hold-out set. So computing the
probabilities of the voting classifier on the hold-out set.

TransformCV is something of the same number of samples as a
training data set. And it has 3 dimensions, which are the 3
probability estimates. And then again, here I train logistic
regression on it and I get coefficients.

And so now you can see that here in the coefficients, the
first model was logistic regression, it trusts this more.
The trust for the coefficient of the trees is similar. The
last one, the nearest neighbor now has the lowest weight.

So here's the model that we had just doing a logistic
regression, training on the training dataset. And here's
with proper stacking where you use hold-out probability
estimates. And you can see here, there's much less
overfitting going on and they're sort of much more
uncertainty here.

For prediction, this cross-validation doesn't really change
anything, if you apply this to the test set… The standard
way is to train a model on the whole training set, and then
use this to estimate the probabilities on the test set. So
this cross-validation is only to get probability estimates
on the training set to train the second stage.

The next thing I want to talk about is similar stacking but
actually has a different goal. The goal in stacking is
really to blend different models in a smart way. And so you
take multiple models, you try to build a stronger, more
expressive model by blending these models using some
weighting.

---
class:spacious
#Calibration
.center[
<a href="http://www.datascienceassn.org/sites/default/files/
Predicting%20good%20probabilities%20with%20supervised%20learning.pdf">Source</a>
]
- Probabilities can be much more informative than labels:

- “The model predicted you don’t have cancer” vs “The model predicted you’re 40% likely to have cancer”

???
So the next thing I want to talk about is calibrations.
Calibration also builds a model on top of a model. But the
goal of calibration is to actually get accurate probability
estimates. Oftentimes, we’re interested not only in the
output of the classifier, but also interested in the
uncertainty estimate of a classifier.

So for example, think about a model that does cancer
prediction. If the model predicted you don't have cancer,
you're going to be pretty happy. If it says there’s a 40%
chance that you have cancer, which is the same thing if you
have a binary classification, you might be less happy since
40% seems pretty high for having cancer anyway.

So often, we want actual probability estimates that allow us
to make decisions that are finer grained than just a yes or
no.

Calibration is a way to get probability estimates out of any
models. For example, SVMs are not good at breaking
probabilities, so you can use calibration if you really want
to use an SVM and get the probabilities out. Or if you have
a model that already was able to predict probabilities like
tree or random forests or nearest neighbors, you can use
calibration to make sure that the probabilities that you get
are actually good probabilities. So if you use random forest
and use it to estimate probabilities, if it says “this is
70% class one”, it's not really clear what it means because
the model is not really trained to optimize probability
estimates so the probability estimates could be off quite a
bit.

---
class: spacious
#Calibration curve (Reliability diagram)
.left-column[
![:scale 80%](images/prob_table.png)
]
.right-column[
![:scale 80%](images/calib_curve.png)

]

???
Before we talk about how to do calibration, I want to talk
about how to measure calibration and binary classification.
You can also do it for multiclass classification, but binary
is much simpler. The cool thing about it is you can actually
measure calibration and you can calibrate the classifier
without having round first the probability estimates. So
even if you only have like 0 or 1 labels you can still make
sure that you have a model that provides reasonable
probabilities.

The way you do this if you take the probability estimates of
the model, you bin the probability... So here, for example,
let's say these are probabilities estimate by a model, these
are my true targets. So now I sort and bin them by the
probability estimates.

And then I create three bins. And then I look at what is the
actual prevalence of the classes in each of these bins.

If these probabilities estimates were actually accurate,
then let's say if I have a bin for all the 90% ones, the
prevalence in the 90% bin should actually be 90%, that's
sort of what it says. 90% of the points that are given a
score of 90% should be the true class. And so you can plot
this in the calibration curve or the reliability diagram.

So here I have three bins. And basically what I wanted is
the diagonal line where the bin that contains the very small
probabilities has the same prevalence. And so here, for
example, things that are around like 0.5 actually have zero
prevalence. So there's no true class in there, which is
really weird. You won’t use a diagonal line, this would be a
very bad classifier.

One thing that I want to emphasize here is that calibration
doesn't imply that the model is accurate. These are actually
two different things. If I have a binary model on a balanced
data set, and it always says 50% probability for all data
points. It's perfectly calibrated because it says 50% for
all data points. 50% of those data points are actually the
true class if the dataset is balanced. So it's a completely
trivial classifier that doesn't tell you anything but
perfectly calibrated, but also kind of tells you nothing
because it always gives you 0.5. So you know that you can
trust it basically.

Calibration basically tells you how much you can trust the
model.

- For binary classification only
- you can be calibrated and inaccurate!
- Given a predicted ranking or probability from a supervised classifier, bin predictions.
- Plot fraction of data that’s positive in each bin.
- Doesn’t require ground truth probabilities!

---
class: split-40
# calibration_curve with sklearn

Using subsample of covertype dataset
.left-column[
.tiny-code[
```python
from sklearn.linear_model import LogisticRegressionCV
print(X_train.shape)
print(np.bincount(y_train))
lr = LogisticRegressionCV().fit(X_train, y_train)
(52292, 54)
[19036 33256]```

```python
lr.C_
array([ 2.783])
```

```python
print(lr.predict_proba(X_test)[:10])
print(y_test[:10])
[[ 0.681  0.319]
 [ 0.049  0.951]
 [ 0.706  0.294]
 [ 0.537  0.463]
 [ 0.819  0.181]
 [ 0.     1.   ]
 [ 0.794  0.206]
 [ 0.676  0.324]
 [ 0.727  0.273]
 [ 0.597  0.403]]
[0 1 0 1 1 1 0 0 0 1]
```

]
]

.right-column[
.tiny-code[
```python
from sklearn.calibration import calibration_curve
probs = lr.predict_proba(X_test)[:, 1]
prob_true, prob_pred = calibration_curve(y_test, probs, n_bins=5)
print(prob_true)
print(prob_pred)
[ 0.2    0.303  0.458  0.709  0.934]
[ 0.138  0.306  0.498  0.701  0.926]
```
]
.center[![:scale 70%](images/predprob_positive.png)
]
]

???
I'm using a subsample of cover type dataset because it's
kind of big and we get some nice histograms here. So I train
a logistic regression model. Logistic regression typically
gives pretty good probability estimates. So I expect this
model to be relatively well calibrated.

For the first 10 data points, the probabilities and the
actual predictions. So what I'm going to take is the
predicted probabilities for all data points, for the first
class and the actual predictions.

Obviously, I need to do this on the test set. If I do this
on the training set, then it'll tell me nothing, because on
the training set the model is good. So either I need to do
this on a test set or on some of the holdout set. On the
training set, if I perfectly overfit, I would be doing
perfect, but that's not the point. So here, I need to use a
separate set and so I can compute the calibration curve.

You can see the histogram here shows how many points are in
this area. And here, the orange thing is the actual
calibration curve, the dotted line is sort of perfect
calibration. And in here, I’m using five bins. The first bin
is zero until 0.2, so I would expect that 10% of the data in
this bin 10 is the true class, but actually, it's more like
20%. So here, it's like not perfectly calibrated.

In the second bin from 0.2 to 0.4. So what I would expect is
that 30% of these data points are predicted as class one and
actually- it's 30%.

---
class:spacious
#Influence of number of bins

.center[![:scale 100%](images/influence_bins.png)]

???
How do I pick the number of bins?

More bins give me more resolution. But also at some point,
you get noise. So I guess it's the same way you would do for
histogram. You can use the standard heuristic for histogram
but usually, if you do something like 10 or 20, that's
usually enough bins.

- Works here because dataset is big
- Might become very noisy for larger datasets
---
class:spacious
# Comparing Models

.center[![:scale 90%](images/calib_curve_models.png)]
???
Logistic regression is pretty well calibrated. The decision
tree, since I didn't use any pruning, it'll be way too
certain. So all probabilities estimates, even on a test set
that will be either 0 or 1. So that's not very great. Random
forest classifier used here is not certain enough.

Now we want to fix that. Before I want to say how we can
trust them let's measure them first.

---
# Brier Score (for binary classification)

- “mean squared error of probability estimate”
`$$ BS = \frac{\sum_{i=1}^{n} (\widehat{p} (y_i)-y_i)^{2}}{n}$$`

.center[![:scale 70%](images/models_bscore.png)]
???
This graph is a nice way to look at how calibrated something
is. But it's very hard to compare. A standard way to compare
calibration is a Brier score, you could also use something
like a log loss. Log loss would also tell you how good that
a probability estimate is.

Brier score is the mean squared error of probability
estimates.  Yi is either one or zero, basically, and p ̂ is
the probability estimate.

So if you predict 0.5 then it's always going to give you a
loss, but it's going to give you a loss of only 0.5. If you
predict 0 when you should’ve predicted 1 then it's going to
give you a very large loss.

Here are the brier loss for the different models. Also, the
smaller is better. But this conflates accuracy and
calibration a little bit because here, you can see that the
random forest actually does best-.

I guess it does better because just it's a more accurate
model. So it's not well calibrated. But it is actually a
much better predictor and so it still has a smaller score.

We defined it, we can measure it and we actually want to
calibrate it.

---
# Fixing it: Calibrating a classifier
- Build another model, mapping classifier probabilities to better probabilities!
- 1d model! (or more for multi-class)

`$$ f_{callib}(s(x)) \approx p(y)$$`

- s(x) is score given by model, usually
- Can also work with models that don’t even provide probabilities!
Need model for $f_{callib}$, need to decide what data to train it on.
- Can train on training set → Overfit
- Can train using cross-validation → use data, slower

???
The way to do fix this is similar to stacking and that we
built another model on top of the probability estimates. So
usually, that's just a 1D model because we only have 1
underlying model.

If you have binary classification it gives us a single
probability output and so we need to learn 1D function that
basically maps this probability to something that is more
accurate.

It also works if the model gives us a score rather than a
probability. SVM only gives us a score, we can still learn
one leave function that tries to get probabilities.

Similar to what we did was stacking. So you don't want to
learn this the calibration model on the training data set
because on train dataset you're doing very well. So you need
to use either a whole dataset or you need to use
cross-validation again.

---

# Platt Scaling

- Use a logistic sigmoid for $f_{callib}$
- Basically learning a 1d logistic regression
- (+ some tricks)
- Works well for SVMs

`$$f_{platt} = \frac{1}{1 + exp(-ws(x))}$$`

???
There are two main methods that people use and both are in
scikit-learn. One is Platt scaling, Platt scaling is
basically the same as 1D logistic regression. So you learn
to 1d just sigmoid plus, there's like some new tricks.

This basically allows you to fix a particular sigmoid shape
that you usually get from SVM scores and this works well for
SVMs. But you only have one parameter here, so there's not a
lot that you can tune.

---
# Isotonic Regression

- Very flexible way to specify $f_{callib}$
- Learns arbitrary monotonically increasing step-functions in 1d.
- Groups data into constant parts, steps in between.
- Optimum monotone function on training data (wrt mse).
.center[![:scale 40%](images/isotonic_regression.png)]

???

The other one is Isotonic regression, which is basically a
non-parametric mapping. What it does is it fits the monotone
function that minimizes the squared error.

The problem it’s solving is the peace wise constant function
that’s monotonous with minimums error. So this what it looks
like and so if you have to data in red, then standard
regression will find this green thing.

In Platt scaling, s(x) is what the model scores, the score
that comes from the SVM or random forest. The w is the one
parameter you learn from the data. This is a very inflexible
model that allows you to say how steep should the sigmoid
correction be while Isotonic regression is a very flexible
model that allows you to correct anything in any way, it’s
monotonous.

---
class:spacious
# Building the model
- Using the training set is bad
- Either use hold-out set or cross-validation
- Cross-validation can be use as in stacking to make unbiased probability predictions, use that as training set.

???
---
# CalibratedClassifierCV
.tiny-code[
```python
from sklearn.calibration import CalibratedClassifierCV
X_train_sub, X_val, y_train_sub, y_val = train_test_split(X_train, y_train,
                                                          stratify=y_train, random_state=0)
rf = RandomForestClassifier(n_estimators=100).fit(X_train_sub, y_train_sub)
scores = rf.predict_proba(X_test)[:, 1]
plot_calibration_curve(y_test, scores, n_bins=20)
```
.center[![:scale 30%](images/random_forest.png)]
]
???
---
# Calibration on Random Forest
.smaller[
```python
cal_rf = CalibratedClassifierCV(rf, cv="prefit", method='sigmoid')
cal_rf.fit(X_val, y_val)
scores_sigm = cal_rf.predict_proba(X_test)[:, 1]

cal_rf_iso = CalibratedClassifierCV(rf, cv="prefit", method='isotonic')
cal_rf_iso.fit(X_val, y_val)
scores_iso = cal_rf_iso.predict_proba(X_test)[:, 1]
```]

.center[![:scale 90%](images/types_callib.png)]

???
This is for a random forest again. It can use calibrated
classifier CV. This does calibration. You can either you do
it with single hold-out dataset or with cross-validation.

Here, I'm using a single holdout set. I split my dataset,
the training set into X_train_sub and X_val.

X_train_sub is what I used to train a random forest and then
x_val is what I'm going to use to calibrate it. I created
calibrated classifier CV with the random forest that I’ve
already fit on the training dataset. I set cross-validation
to prefit and set the method to sigmoid. And then I can fit
this calibration model on the validation set.

So here is what this looks like, the random forest model
with no calibration, with sigma calibration and with
isotonic calibration.

In this case, sigmoid and isotonic don't look very
different. They both look like reasonably calibrated.

That said if you're doing sort of holdout set, destroying
away a bunch of data for training your first model.

---

# Cross-validated Calibration

```python
cal_rf_iso_cv = CalibratedClassifierCV(rf, method='isotonic')
cal_rf_iso_cv.fit(X_train, y_train)
scores_iso_cv = cal_rf_iso_cv.predict_proba(X_test)[:, 1]
```

.center[![:scale 90%](images/types_callib_cv.png)]

???
So we can also not specify CV, then it uses
cross-validation, I think 3 fold by default.

And so then it does the whole cross-validation thing
internally. And then I get an even better calibration. So
what it does here for each cross-validation it trains a
separate model. And then if I want to predict on the test
set, it uses all of these models and then averages them. So
in a sense, it's not very surprising that I get a better
result here because I built 3 random forest models on
subsets of the data and then average them. I calibrated them
and I averaged calibrated models. So I basically just build
a bigger random forest, so it's not very surprising that it
does better. Also, we were able to use more data.

So if you do this, you retain as many models as there are
cross-validation folds and you use all of them in
prediction.

- kinda cheating, we have more trees now lol
- we use all the data, get good probabilities. just time-consuming
---
# Multi-Class Calibration

.center[![:scale 100%](images/multi_class_calibration.png)]

???
You can also do this for multi-class calibration, which does
basically the same thing, but you just do it for each class
individually and then you normalize again and you can make
pretty pictures.

- per-class calibratoin
- renormalization
---
class: center, middle

# Questions ?