Gradient Boosting

class: center, middle

### W4995 Applied Machine Learning

# (Gradient) Boosting, Calibration

02/20/19

Andreas C. Müller

???
We'll continue tree-based models, talking about boosting.
Finally, a technique called calibration that looks somewhat similar to
ensembles but has the goal of obtaining good probability
estimates from any classifier.

FIXME: Binning
FIXME: talk about gradient boosting as gradient descent more? - requires gradient descent :-/

FIXME GBRT: better explain log-loss
FIXME: add example of calibrated and
inaccurate vs accurate but not calibrated! (always saying 50% is calibrated)
FIXME: brier score decomposition
FIXME calibration scores
FIXME maybe saying "sort" before binning is confusing?
FIXME move calibration after metrics?

Partial dependence plot. how is that different from feature
importances. calibration curve plotting: bins are different
for hist and calibration curve? slide 27: maybe put
probabilities on x axis? calibration curve: blue histogram
is number of total points, y axis not labeled

---
class: centre,middle
# Gradient Boosting
???
Gradient boosting is one of the most successfull supervised
machine learning methods in practice. It's often used in
kaggle to win competition, it's used for credit scoring,
it's one of the standard tools of the trade. It's one of the
best of-the-shelf models A standard implementation that
people use is XGBoost, but there's also an implementation in
scikit-learn, and we'll talk about both of them.

Last time we talked about Random forests, which builds many
trees independently, each randomized in a different way, and
then averages their predictions. Gradient boosting on the
other hand builds trees one by one in a sequential manner,
with each tree requiring the results of previous trees.
Often, Gradient boosting is done with very small trees, or
even decision stumps, which is trees of depth one, so a
single split.

---
class: center, middle
# Boosting (in General)

???
 - “Meta-algorithm” to create strong learners from weak learners.
 - AdaBoost, GentleBoost, …
 - Trees or stumps work best
 - Gradient Boosting often the best of the bunch
 - Many specialized algorithms (ranking etc)

This is an instance of a more general family of models,
called boosting models, which all iteratively try to improve
a model built up from weak learners. Gradient boosting is
this particular technique where we are trying to fit the
residuals, and it's been found to work very well in
practice, in particular if you're using shallow trees as the
weak learners. In principle, you could use any model as a
weak learner, but trees just work really well.

---

# Gradient Boosting Algorithm

`$$ f_{1}(x) \approx y  $$`

`$$ f_{2}(x) \approx y - f_{1}(x) $$`

`$$ f_{3}(x) \approx y - f_{1}(x) - f_{2}(x)$$`

$y \approx$ ![:scale 22.5%](images/grad_boost_term_1.png) + ![:scale 22.5%](images/grad_boost_term_2.png) + ![:scale 20%](images/grad_boost_term_3.png) + ...
???
Let's look at the regression case first. We start by
building a single tree f1 to try to predict the output y.
But we strongly restrict f1, so it will be rather bad at
predicting y. Next, we'll look at the residual of this first
model, so y - f1(x). We now train a new model f2 to try and
predict this residual, in other words to correct the
mistakes made by f1. Again, this will be a very simple
model, so it will still not be able to fix all errors. Then,
we look at the residual of both of the models together, so y
- f1(x) - f2(x), so the mistakes that could not be fixed by
f2, and we build f3 to fix that, and so on. This is natural
for regression. For classification this is not as clear. For
binary classification you use log-loss, or rather you apply
the logistic function to get a binary prediction, for
multi-class you can use 1 vs rest.

So we're sequentially building up a model using what's
called "weak learners", small trees, and create a more
powerful composite model.

---
# Gradient Boosting Algorithm

`$$ f_{1}(x) \approx y  $$`

`$$ f_{2}(x) \approx y - \gamma f_{1}(x) $$`

`$$ f_{3}(x) \approx y - \gamma f_{1}(x) - \gamma f_{2}(x)$$`

$y \approx \gamma$ ![:scale 22.5%](images/grad_boost_term_1.png) + $\gamma$ ![:scale 22.5%](images/grad_boost_term_2.png) + $\gamma$ ![:scale 20%](images/grad_boost_term_3.png) + ...
<br />
<br />
Learning rate $\gamma, i.e. 0.1$

???
- Iteratively add regression trees to model
- Use log loss for classification
- Discount update by learning rate

FIXME plot for regression models 
Come back to this
---
#GradientBoostingRegressor

.center[
![:scale 50%](images/grad_boost_regression_steps.png)
]
???
Here's an illustration for doing this for regression. This
is a 1D regression dataset for illustration purposes here to
form features on the x-axis, the prediction is on the
y-axis.

In the first step, I'm just fitting my tree to the data. I
use a simple tree of depth 3. The depth 3 tree is not able
to completely model the data and so the orange is the tree
that was fit. After this first step, I look at the total
predictions. So this is just gamma times the predictions
made by the first tree. So you see everything is sort of
squashed together. This is the effect of gamma. The blue
points here in this next panel is this data minus the minus
the predictions from step one, and so this is the residual.
It still looks pretty much the same, because we took only a
small step. Then I fit another tree to this residuals again
of depth three. Then here's the total prediction, which is
gamma times the first tree plus gamma times the second tree,
followed against the original data. The orange has more
steps now because it's a combination of two trees.

As I continue the same procedure until step five, you can
see that residual gets much smaller and it has learned most
of the variations in the data. A total linear combination of
all the trees and learning looks like this. And if I keep
doing this, then at some point, residuals will become very
small. And the total prediction will fit the data better and
better.

The question is can I extrapolate?

The answer is no.

The question is how is this different?

And it's quite different. It’s kind of hard for me to answer
his question.

A) The combination of stumps is different than building a
deeper tree because you always apply all the splits to the
whole dataset. Whereas if you make a tree deeper, you will
have different splits on the deeper nodes. If you have
already like 10 nodes, and you want to grow one more level,
each of these parts of the data will have a different split.
Whereas if I add another stump it will be on a global level.
In decision trees, decisions are very hard. And here you
don't trust any hard decisions. So you only go a little bit
in each direction.

B) This works way better.

So this was for regression and because it's easier to
visualize, you can do the same thing for classification.

Basically, what gradient boosting for classification does is
you’re letting regression trees to learn decision function.
So basically, you're doing regression again, but you're
doing logistic regression. Instead of using a linear model
for logistic regression, we're now using this linear
combination of trees. So in other words, what we're doing is
we're applying a log loss, and we're trying to find the
regression function that has a small log loss. This f is the
decision function that we're doing. And so inside a gradient
boosting classifier, you're not actually learning
classification trees, you're learning regression trees,
which are trying to predict the probability, which is quite
different.

The question is, how they're different?

The same is true for f1. If I fit f1, exactly to y this
becomes 0. One reason is the gamma, which means I only go a
small step. But even if I set gamma equal to one, the point
is that I restricted my f1 to not completely fit the data.
I'm not trying to completely overfit the data, I'm trying to
use a simple model. So if I use a tree of depth one, there
will be a very large residual.

---
#GradientBoostingClassifier

.center[
![:scale 70%](images/grad_boost_depth2.png)
]
???
Here’s an illustration of what this might look like for
classification. This is basically the same thing going on.
I’m plotting the probabilities assigned by the model.

Here's a first estimator, which makes some mistakes here and
makes a bunch of mistake in the middle. This is the decision
tree of depth two. Then I fit the model to the residuals
here. To minimize the log loss of this dataset changes
something in the middle, and so on. And you can see that as
I add more and more estimator it fits the data better and
better and gets more and more complex decision boundaries.

Since it works like a gradient descent on log loss, it's a
little bit harder to visualize this for classification.

White mean probability of 0.5 (which is a tie). Red means
high probability for the red class. Blue means a high
probability for the blue class. Here, clearly, it hasn't fit
the data perfectly. So I could add more and more models and
then, in the end, it would fit the data perfectly.

For multi-class, the default thing to do is One Versus Rest.

---
class:spacious
# Gradient Boosting Advantages

- Slower to train than RF (if using "old" GradientBoostingRegressor), but much faster to predict
- Very fast using XGBoost, LightGBM, pygbm, new scikit-learn implementation [#12807](https://github.com/scikit-learn/scikit-learn/pull/12807)
- Small model size
- Typically more accurate than Random Forests
???
It's sort of slower to train if you're training serial, but
if you paralyze it, it's often faster to train since it has
a much smaller model size. So the trees are usually not as
deep and you don't need as many because you're doing much
more focused learning where you try to correct the mistakes
of the other models. So usually it’s very fast to predict
because prediction can happen in parallel over all the
trees, while learning cannot happen in parallel over all the
trees necessarily. Usually, this is more accurate than
random forests.

The question is for the same number of estimators, how does
low versus high learning rate changes?

High learning rate allows you to fit the data more strongly
and also overfit the data more strongly.

---

class:spacious
# Tuning Gradient Boosting
- Pick n_estimators, tune learning rate
- Can also tune max_features
- Typically strong pruning via max_depth

???
There are several things you can tune about the gradient
boosting. A common approach is to pick the number of
estimators that you have time for. So runtime is obviously
linear than number of estimators because you need to build
each tree one by one. Each tree is built with the same
parameters. And then you can tune the learning rate to see
how strongly you want to fit the data.

You can also tune something like max features if you want to
add more randomness, but that's actually not very commonly
used. You can also subsample the data if you want faster
training. And typically there's a maximum depth.
Traditionally, it was like maximum depth of one, two or
three, though in kaggle people are doing like 8 to 10 or
something. But typically, the depth is much, much smaller
than for random forests. This means the model will be small
in memory, and also will be faster to predict since you have
less deep trees to traverse.

Before we go into implementation I want to talk a little bit
about analyzing the models. So again, because it’s a
tree-based model, the prediction is a linear combination of
trees, you can look at feature importance like the same way
we did for random forest and trees.

---
class: middle
# Improvements: Binning and "extreme" gradient boosting
---
class: spacious

# A better split criterion (?)
- Optimize "weights" in leave (not just counting)
- Include second order (hessian) and regularizer
![:scale 100%](images/xgboost-criterion.png)

---
# Binning
.left-column[
![:scale 100%](images/xgboost-exact.png)
]
--
.right-column[
![:scale 100%](images/xgboost-binning.png)
]
---

# Aggressive sub-sampling
.padding-top[
According to user feedback, using column sub-sampling prevents over-fitting even more so than the traditional row sub-sampling. (XGBoost: A Scalable Tree Boosting System, 2016)
]
---
# XGBoost
`conda install -c conda-forge xgboost`

```python
from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(X_train, y_train)
xgb.score(X_test, y_test))
```

- supports missing values
- supports multi-core

???
In terms of implementation, there's a couple of
implementation methods. Scikit-learn has gradient boosting
classifier and gradient boosting regressor. Unfortunately,
they are rather slow.

So one of the most commonly used packages is XGBoost, which
you can install from conda forge for example. It's
completely scikit-learn compatible. So you can import
XGBClassifier if you want you can use it with grid search or
with pipelines.

The cool thing is its fast. It supports missing values so
you don't need to do an imputation strategy, you can put
them directly into XGBoost. And it also supports multi-core.

The gradient boosting in scikit-learn is just sequential on
a single core which on most of the machines are not great.
If you have a big machine you want to use all the cores in
parallel. You can’t do that with scikit-learn but you can do
it with XGBoost.

For random forest, on the other hand, scikit-learn can also
run random forest in parallel, because it's much easier.
XGBoost also has a random forest. XGBoost also has some
enhancements to the algorithm. It has L1 and L2 penalties
for the leaves, so you can do something like an elastic net
or lasso penalty to the leaves. I think by default, it's
disabled. But they are not really training like the vanilla
decision trees.

One of the reasons they're faster is because they have a
faster implementation. But you can make even faster if you
use approximate splits in the trees.

If you want to find a split, you need to search over all
features and you need to search over all possible thresholds
on the feature. Searching over all possible thresholds on
the feature means soaring feature, which is an analog
operation. Instead, what you can do is you can bin the
features, and then its linear time operation but the
threshold will be approximate.

And that will make your computation much faster.

- Efficient implementation of gradient boosting (5x sklearn)
- Improvements on original algorithm
- https://arxiv.org/abs/1603.02754
- Adds l1 and l2 penalty on leaf-weights
- Fast approximate split finding
- Scikit-learn compatible interface

---
class: spacious
.center[
![:scale 50%](images/xgboost_sklearn_bench.png)
]
- Exact splits ~5x faster on single core
- Appoximate splits, multi-core -> more speed!
- Also check lightGBM
- Both have GPU support
- https://github.com/szilard/GBM-perf
- Nicolas' PR for sklearn [#12807](https://github.com/scikit-learn/scikit-learn/pull/12807)
???
Here is actually for the exact version of XGBoost the
comparison in terms of time and accuracy. This is for
illustrative purposes. I ran this on amnest or subsets of
amnest. The x-axis is the number of samples and so you can
see how the runtime increases as the number of samples
increases.

For a significant number of samples, you can see that
XGBoost is much faster.

If you run on a single core XGBoost tends to be about five
times faster, if you run it on a 10 core machine, it's going
to be much, much faster.

There's also an implementation out of Microsoft called
lightGBM, which might be even faster. So according to a lot
of latest benchmarks that I saw, lightGBM is actually
fastest. LightGBM and XGBoost both have GPU support, which
makes them even faster. For GPU support, the benchmark that
I saw stated XGBoost was a bit faster while on CPU lightGBM
was a bit faster. But both are pretty fast and highly
paralleled optimized algorithms. So if you care about
speeds, you want to use one of these two.

---
# Early stopping

- Adding trees can lead to overfitting
- Stop adding trees when validation accuracy stops increasing
- Optional in XGBoost and sklearn >= 0.20

???
As I said earlier, if you add more and more trees, the
algorithms can overfit. So instead of searching for the
learning rate, or searching for number of trees, you can
also do early stopping. Because this is a sequential
algorithm that gets better and better the more trees you
add, you can just use a validation set and stop adding trees
once you overfit. You can do that both with scikit-learn and
XGBoost.

Basically, the idea is that you pick a large number of
estimators and you have a separate validation set and if the
validation set accuracy doesn't improve or if it decreases
for number of iterations, say five, then you just stop the
learning. This way, you get results faster, because you
don't keep learning but also you possibly get a better model
because you know the overfitting.

The only downside stopping learning is that you have fewer
data to train your model because you need to the validation
set for early stopping. You’ll still need a separate test
set to see how well the model actually does. You can't use
the same set for early stopping and for evaluation. 
---
class:spacious
# More Tuning of Gradient Boosting
- Tune max_features
- Tune column subsampling, row subsampling
- Typically strong pruning via max_depth
- Regularization
- Pick learning rate and do early stopping?
---
class: middle
# Concluding tree-based models
---
class:spacious
# When to use tree-based models
- Model non-linear relationships
- Doesn’t care about scaling, no need for feature engineering
- Single tree: very interpretable (if small)
- Random forests very robust, good benchmark
- Gradient boosting often best performance with careful tuning

???
To summarize, tree-based models are really very, very
popular family of models. They're very commonly used. You
probably want to use them if you want nonlinear
relationships, or if you have a lot of different kinds of
weird features because they really don't care about scaling
of the features, so it allows you to get rid of mostly
preprocessing.

If you want very interpretable model, single trees or small
single trees are good ideas, because it's one of the few
models that you can sort of write down on a blackboard and
show to someone, and they'll have some idea of what's going
on. That's impossible for any other models.

Random forests are great because they're very robust. And
you don't have to tune anything, you just run a random
forest with 100 trees and with the default settings and it
will work great.

And so this one of my first benchmarks. Usually, I first ran
a logistic regression then I run random forests.

Gradient Boosting is often the best performing model, sort
of the centered toolbox. Sometimes you need to tune a little
bit more, so you need to tune the depth of the trees or the
learning rate, and so on. But if you tune them, then they
will usually beat random forest on those data sets.

One case where you might not want to use trees is if you
have very high dimensional sparse data then linear models
might work better but also your mileage may vary. On the
contrary, if you have like a low dimensional space,
tree-based models are probably a good bet.

Next thing we'll talk about is a different way to build
ensembles, different way to put together multiple models.

---

class:spacious
#Calibration
.center[
<a href="http://www.datascienceassn.org/sites/default/files/
Predicting%20good%20probabilities%20with%20supervised%20learning.pdf">Source</a>
]
- Probabilities can be much more informative than labels:

- “The model predicted you don’t have cancer” vs “The model predicted you’re 40% likely to have cancer”

???
So the next thing I want to talk about is calibrations.
Calibration also builds a model on top of a model. But the
goal of calibration is to actually get accurate probability
estimates. Oftentimes, we’re interested not only in the
output of the classifier, but also interested in the
uncertainty estimate of a classifier.

So for example, think about a model that does cancer
prediction. If the model predicted you don't have cancer,
you're going to be pretty happy. If it says there’s a 40%
chance that you have cancer, which is the same thing if you
have a binary classification, you might be less happy since
40% seems pretty high for having cancer anyway.

So often, we want actual probability estimates that allow us
to make decisions that are finer grained than just a yes or
no.

Calibration is a way to get probability estimates out of any
models. For example, SVMs are not good at breaking
probabilities, so you can use calibration if you really want
to use an SVM and get the probabilities out. Or if you have
a model that already was able to predict probabilities like
tree or random forests or nearest neighbors, you can use
calibration to make sure that the probabilities that you get
are actually good probabilities. So if you use random forest
and use it to estimate probabilities, if it says “this is
70% class one”, it's not really clear what it means because
the model is not really trained to optimize probability
estimates so the probability estimates could be off quite a
bit.

---
class: spacious
#Calibration curve (Reliability diagram)
.left-column[
![:scale 100%](images/prob_table.png)
]
.right-column[
![:scale 100%](images/calib_curve.png)

]

???
Before we talk about how to do calibration, I want to talk
about how to measure calibration and binary classification.
You can also do it for multiclass classification, but binary
is much simpler. The cool thing about it is you can actually
measure calibration and you can calibrate the classifier
without having round first the probability estimates. So
even if you only have like 0 or 1 labels you can still make
sure that you have a model that provides reasonable
probabilities.

The way you do this if you take the probability estimates of
the model, you bin the probability... So here, for example,
let's say these are probabilities estimate by a model, these
are my true targets. So now I sort and bin them by the
probability estimates.

And then I create three bins. And then I look at what is the
actual prevalence of the classes in each of these bins.

If these probabilities estimates were actually accurate,
then let's say if I have a bin for all the 90% ones, the
prevalence in the 90% bin should actually be 90%, that's
sort of what it says. 90% of the points that are given a
score of 90% should be the true class. And so you can plot
this in the calibration curve or the reliability diagram.

So here I have three bins. And basically what I wanted is
the diagonal line where the bin that contains the very small
probabilities has the same prevalence. And so here, for
example, things that are around like 0.5 actually have zero
prevalence. So there's no true class in there, which is
really weird. You won’t use a diagonal line, this would be a
very bad classifier.

One thing that I want to emphasize here is that calibration
doesn't imply that the model is accurate. These are actually
two different things. If I have a binary model on a balanced
data set, and it always says 50% probability for all data
points. It's perfectly calibrated because it says 50% for
all data points. 50% of those data points are actually the
true class if the dataset is balanced. So it's a completely
trivial classifier that doesn't tell you anything but
perfectly calibrated, but also kind of tells you nothing
because it always gives you 0.5. So you know that you can
trust it basically.

Calibration basically tells you how much you can trust the
model.

- For binary classification only
- you can be calibrated and inaccurate!
- Given a predicted ranking or probability from a supervised classifier, bin predictions.
- Plot fraction of data that’s positive in each bin.
- Doesn’t require ground truth probabilities!

---
class: split-40
# calibration_curve with sklearn

Using subsample of covertype dataset
.left-column[
.tiny-code[
```python
from sklearn.linear_model import LogisticRegressionCV
print(X_train.shape)
print(np.bincount(y_train))
lr = LogisticRegressionCV().fit(X_train, y_train)
(52292, 54)
[19036 33256]```

```python
lr.C_
array([ 2.783])
```

```python
print(lr.predict_proba(X_test)[:10])
print(y_test[:10])
[[ 0.681  0.319]
 [ 0.049  0.951]
 [ 0.706  0.294]
 [ 0.537  0.463]
 [ 0.819  0.181]
 [ 0.     1.   ]
 [ 0.794  0.206]
 [ 0.676  0.324]
 [ 0.727  0.273]
 [ 0.597  0.403]]
[0 1 0 1 1 1 0 0 0 1]
```

]
]

.right-column[
.tiny-code[
```python
from sklearn.calibration import calibration_curve
probs = lr.predict_proba(X_test)[:, 1]
prob_true, prob_pred = calibration_curve(y_test, probs, n_bins=5)
print(prob_true)
print(prob_pred)
[ 0.2    0.303  0.458  0.709  0.934]
[ 0.138  0.306  0.498  0.701  0.926]
```
]
.center[![:scale 70%](images/predprob_positive.png)
]
]

???
I'm using a subsample of cover type dataset because it's
kind of big and we get some nice histograms here. So I train
a logistic regression model. Logistic regression typically
gives pretty good probability estimates. So I expect this
model to be relatively well calibrated.

For the first 10 data points, the probabilities and the
actual predictions. So what I'm going to take is the
predicted probabilities for all data points, for the first
class and the actual predictions.

Obviously, I need to do this on the test set. If I do this
on the training set, then it'll tell me nothing, because on
the training set the model is good. So either I need to do
this on a test set or on some of the holdout set. On the
training set, if I perfectly overfit, I would be doing
perfect, but that's not the point. So here, I need to use a
separate set and so I can compute the calibration curve.

You can see the histogram here shows how many points are in
this area. And here, the orange thing is the actual
calibration curve, the dotted line is sort of perfect
calibration. And in here, I’m using five bins. The first bin
is zero until 0.2, so I would expect that 10% of the data in
this bin 10 is the true class, but actually, it's more like
20%. So here, it's like not perfectly calibrated.

In the second bin from 0.2 to 0.4. So what I would expect is
that 30% of these data points are predicted as class one and
actually- it's 30%.

---
class:spacious
#Influence of number of bins

.center[![:scale 100%](images/influence_bins.png)]

???
How do I pick the number of bins?

More bins give me more resolution. But also at some point,
you get noise. So I guess it's the same way you would do for
histogram. You can use the standard heuristic for histogram
but usually, if you do something like 10 or 20, that's
usually enough bins.

- Works here because dataset is big
- Might become very noisy for larger datasets
---
class:spacious
# Comparing Models

.center[![:scale 90%](images/calib_curve_models.png)]
???
Logistic regression is pretty well calibrated. The decision
tree, since I didn't use any pruning, it'll be way too
certain. So all probabilities estimates, even on a test set
that will be either 0 or 1. So that's not very great. Random
forest classifier used here is not certain enough.

Now we want to fix that. Before I want to say how we can
trust them let's measure them first.

---
# Brier Score (for binary classification)

- “mean squared error of probability estimate”
`$$ BS = \frac{\sum_{i=1}^{n} (\widehat{p} (y_i)-y_i)^{2}}{n}$$`

.center[![:scale 70%](images/models_bscore.png)]
???
This graph is a nice way to look at how calibrated something
is. But it's very hard to compare. A standard way to compare
calibration is a Brier score, you could also use something
like a log loss. Log loss would also tell you how good that
a probability estimate is.

Brier score is the mean squared error of probability
estimates.  Yi is either one or zero, basically, and p ̂ is
the probability estimate.

So if you predict 0.5 then it's always going to give you a
loss, but it's going to give you a loss of only 0.5. If you
predict 0 when you should’ve predicted 1 then it's going to
give you a very large loss.

Here are the brier loss for the different models. Also, the
smaller is better. But this conflates accuracy and
calibration a little bit because here, you can see that the
random forest actually does best-.

I guess it does better because just it's a more accurate
model. So it's not well calibrated. But it is actually a
much better predictor and so it still has a smaller score.

We defined it, we can measure it and we actually want to
calibrate it.

---
# Fixing it: Calibrating a classifier
- Build another model, mapping classifier probabilities to better probabilities!
- 1d model! (or more for multi-class)

`$$ f_{callib}(s(x)) \approx p(y)$$`

- s(x) is score given by model, usually
- Can also work with models that don’t even provide probabilities!
Need model for $f_{callib}$, need to decide what data to train it on.
- Can train on training set → Overfit
- Can train using cross-validation → use data, slower

???
The way to do fix this is similar to stacking and that we
built another model on top of the probability estimates. So
usually, that's just a 1D model because we only have 1
underlying model.

If you have binary classification it gives us a single
probability output and so we need to learn 1D function that
basically maps this probability to something that is more
accurate.

It also works if the model gives us a score rather than a
probability. SVM only gives us a score, we can still learn
one leave function that tries to get probabilities.

Similar to what we did was stacking. So you don't want to
learn this the calibration model on the training data set
because on train dataset you're doing very well. So you need
to use either a whole dataset or you need to use
cross-validation again.

---

# Platt Scaling

- Use a logistic sigmoid for $f_{callib}$
- Basically learning a 1d logistic regression
- (+ some tricks)
- Works well for SVMs

`$$f_{platt} = \frac{1}{1 + exp(-ws(x) - b)}$$`

???
There are two main methods that people use and both are in
scikit-learn. One is Platt scaling, Platt scaling is
basically the same as 1D logistic regression. So you learn
to 1d just sigmoid plus, there's like some new tricks.

This basically allows you to fix a particular sigmoid shape
that you usually get from SVM scores and this works well for
SVMs. But you only have one parameter here, so there's not a
lot that you can tune.

---
# Isotonic Regression

- Very flexible way to specify $f_{callib}$
- Learns arbitrary monotonically increasing step-functions in 1d.
- Groups data into constant parts, steps in between.
- Optimum monotone function on training data (wrt mse).
.center[![:scale 40%](images/isotonic_regression.png)]

???

The other one is Isotonic regression, which is basically a
non-parametric mapping. What it does is it fits the monotone
function that minimizes the squared error.

The problem it’s solving is the peace wise constant function
that’s monotonous with minimums error. So this what it looks
like and so if you have to data in red, then standard
regression will find this green thing.

In Platt scaling, s(x) is what the model scores, the score
that comes from the SVM or random forest. The w is the one
parameter you learn from the data. This is a very inflexible
model that allows you to say how steep should the sigmoid
correction be while Isotonic regression is a very flexible
model that allows you to correct anything in any way, it’s
monotonous.

---
class:spacious
# Building the model
- Using the training set is bad
- Either use hold-out set or cross-validation
- Cross-validation can be used to make unbiased probability predictions, use that as training set.

???
---
class: center
# Fitting the calibration model
![:scale 70%](images/calibration_val_scores.png)
---
class: center

# Fitting the calibration model
![:scale 70%](images/calibration_val_scores_fitted.png)

---
# CalibratedClassifierCV
.tiny-code[
```python
from sklearn.calibration import CalibratedClassifierCV
X_train_sub, X_val, y_train_sub, y_val = train_test_split(X_train, y_train,
                                                          stratify=y_train, random_state=0)
rf = RandomForestClassifier(n_estimators=100).fit(X_train_sub, y_train_sub)
scores = rf.predict_proba(X_test)[:, 1]
plot_calibration_curve(y_test, scores, n_bins=20)
```
.center[![:scale 30%](images/random_forest.png)]
]
???
---
# Calibration on Random Forest
.smaller[
```python
cal_rf = CalibratedClassifierCV(rf, cv="prefit", method='sigmoid')
cal_rf.fit(X_val, y_val)
scores_sigm = cal_rf.predict_proba(X_test)[:, 1]

cal_rf_iso = CalibratedClassifierCV(rf, cv="prefit", method='isotonic')
cal_rf_iso.fit(X_val, y_val)
scores_iso = cal_rf_iso.predict_proba(X_test)[:, 1]
```]

.center[![:scale 90%](images/types_callib.png)]

???
This is for a random forest again. It can use calibrated
classifier CV. This does calibration. You can either you do
it with single hold-out dataset or with cross-validation.

Here, I'm using a single holdout set. I split my dataset,
the training set into X_train_sub and X_val.

X_train_sub is what I used to train a random forest and then
x_val is what I'm going to use to calibrate it. I created
calibrated classifier CV with the random forest that I’ve
already fit on the training dataset. I set cross-validation
to prefit and set the method to sigmoid. And then I can fit
this calibration model on the validation set.

So here is what this looks like, the random forest model
with no calibration, with sigma calibration and with
isotonic calibration.

In this case, sigmoid and isotonic don't look very
different. They both look like reasonably calibrated.

That said if you're doing sort of holdout set, destroying
away a bunch of data for training your first model.

---

# Cross-validated Calibration

```python
cal_rf_iso_cv = CalibratedClassifierCV(rf, method='isotonic')
cal_rf_iso_cv.fit(X_train, y_train)
scores_iso_cv = cal_rf_iso_cv.predict_proba(X_test)[:, 1]
```

.center[![:scale 90%](images/types_callib_cv.png)]

???
So we can also not specify CV, then it uses
cross-validation, I think 3 fold by default.

And so then it does the whole cross-validation thing
internally. And then I get an even better calibration. So
what it does here for each cross-validation it trains a
separate model. And then if I want to predict on the test
set, it uses all of these models and then averages them. So
in a sense, it's not very surprising that I get a better
result here because I built 3 random forest models on
subsets of the data and then average them. I calibrated them
and I averaged calibrated models. So I basically just build
a bigger random forest, so it's not very surprising that it
does better. Also, we were able to use more data.

So if you do this, you retain as many models as there are
cross-validation folds and you use all of them in
prediction.

- kinda cheating, we have more trees now lol
- we use all the data, get good probabilities. just time-consuming
---
# Multi-Class Calibration

.center[![:scale 100%](images/multi_class_calibration.png)]

???
You can also do this for multi-class calibration, which does
basically the same thing, but you just do it for each class
individually and then you normalize again and you can make
pretty pictures.

- per-class calibratoin
- renormalization
---
class: center, middle

# Questions ?