Calibration, Imbalanced Data

### W4995 Applied Machine Learning

# Calibration, Imbalanced Data

03/02/20

Andreas C. Müller

???
Today we'll expand on the model evaluation topic we started last week,
and we'll talk more on how we can build better models for imbalanced data

Today we’ll talk about working with imbalanced data. We
already talked about how accuracy is a bad measure and what
are the other measures we can use.

FIXME in calibration: make sure we have p(y=1) not p(y)?
FIXME what are the bins for calibration in first slide
FIXME write down definition for calibration
FIXME alpha value for curves
FIXME definition of balanced class weight
FIXME bullet points!!
FIXME balanced bagging is terrible now?
FIXME add more datasets for some benchmark?
FIXME: add example of calibrated and
inaccurate vs accurate but not calibrated! (always saying 50% is calibrated)
FIXME: brier score decomposition
FIXME calibration scores
FIXME maybe saying "sort" before binning is confusing?
FIXME maybe add why I don't like log loss here?
FIXME show that AUC is not changed by calibration?
FIXME show example of overconfident vs underconfident predictions more clearly

FIXME update metrics slides from metrics lecture
FIXME better illustration for sampling
FIXME better motivation for sampling / why we want to change things
FIXME Add estimators minimizing loss directly?
FIXME add more on imbalanced forest
FIXME better benchmarking for SMOTE
FIXME show grid-search results
FIXME grid-search SMOTE show uncertainty
FIXME research comparison papers more
FIXME smote: clarify to only sample from same class
FIXME smote: paper vs implementation: do along coordinates in rectangle?
---
class:spacious
#Calibration
.center[
<a href="http://www.datascienceassn.org/sites/default/files/
Predicting%20good%20probabilities%20with%20supervised%20learning.pdf">Source</a>
]
- Probabilities can be much more informative than labels:

- “The model predicted you don’t have cancer” vs “The model predicted you’re 40% likely to have cancer”

???
So the next thing I want to talk about is calibrations.
Calibration also builds a model on top of a model. But the
goal of calibration is to actually get accurate probability
estimates. Oftentimes, we’re interested not only in the
output of the classifier, but also interested in the
uncertainty estimate of a classifier.

So for example, think about a model that does cancer
prediction. If the model predicted you don't have cancer,
you're going to be pretty happy. If it says there’s a 40%
chance that you have cancer, which is the same thing if you
have a binary classification, you might be less happy since
40% seems pretty high for having cancer anyway.

So often, we want actual probability estimates that allow us
to make decisions that are finer grained than just a yes or
no.

Calibration is a way to get probability estimates out of any
models. For example, SVMs are not good at breaking
probabilities, so you can use calibration if you really want
to use an SVM and get the probabilities out. Or if you have
a model that already was able to predict probabilities like
tree or random forests or nearest neighbors, you can use
calibration to make sure that the probabilities that you get
are actually good probabilities. So if you use random forest
and use it to estimate probabilities, if it says “this is
70% class one”, it's not really clear what it means because
the model is not really trained to optimize probability
estimates so the probability estimates could be off quite a
bit.

---
class: spacious
#Calibration curve (Reliability diagram)
.left-column[
![:scale 100%](images/prob_table.png)
]
.right-column[
![:scale 100%](images/calib_curve.png)

]

???
Before we talk about how to do calibration, I want to talk
about how to measure calibration and binary classification.
You can also do it for multiclass classification, but binary
is much simpler. The cool thing about it is you can actually
measure calibration and you can calibrate the classifier
without having round first the probability estimates. So
even if you only have like 0 or 1 labels you can still make
sure that you have a model that provides reasonable
probabilities.

The way you do this if you take the probability estimates of
the model, you bin the probability... So here, for example,
let's say these are probabilities estimate by a model, these
are my true targets. So now I sort and bin them by the
probability estimates.

And then I create three bins. And then I look at what is the
actual prevalence of the classes in each of these bins.

If these probabilities estimates were actually accurate,
then let's say if I have a bin for all the 90% ones, the
prevalence in the 90% bin should actually be 90%, that's
sort of what it says. 90% of the points that are given a
score of 90% should be the true class. And so you can plot
this in the calibration curve or the reliability diagram.

So here I have three bins. And basically what I wanted is
the diagonal line where the bin that contains the very small
probabilities has the same prevalence. And so here, for
example, things that are around like 0.5 actually have zero
prevalence. So there's no true class in there, which is
really weird. You won’t use a diagonal line, this would be a
very bad classifier.

One thing that I want to emphasize here is that calibration
doesn't imply that the model is accurate. These are actually
two different things. If I have a binary model on a balanced
data set, and it always says 50% probability for all data
points. It's perfectly calibrated because it says 50% for
all data points. 50% of those data points are actually the
true class if the dataset is balanced. So it's a completely
trivial classifier that doesn't tell you anything but
perfectly calibrated, but also kind of tells you nothing
because it always gives you 0.5. So you know that you can
trust it basically.

Calibration basically tells you how much you can trust the
model.

- For binary classification only
- you can be calibrated and inaccurate!
- Given a predicted ranking or probability from a supervised classifier, bin predictions.
- Plot fraction of data that’s positive in each bin.
- Doesn’t require ground truth probabilities!

---
class: split-40
# calibration_curve with sklearn

Using subsample of covertype dataset
.left-column[
.tiny-code[
```python
from sklearn.linear_model import LogisticRegressionCV
print(X_train.shape)
print(np.bincount(y_train))
lr = LogisticRegressionCV().fit(X_train, y_train)
(52292, 54)
[19036 33256]```

```python
lr.C_
array([ 2.783])
```

```python
print(lr.predict_proba(X_test)[:10])
print(y_test[:10])
[[ 0.681  0.319]
 [ 0.049  0.951]
 [ 0.706  0.294]
 [ 0.537  0.463]
 [ 0.819  0.181]
 [ 0.     1.   ]
 [ 0.794  0.206]
 [ 0.676  0.324]
 [ 0.727  0.273]
 [ 0.597  0.403]]
[0 1 0 1 1 1 0 0 0 1]
```

]
]

.right-column[
.tiny-code[
```python
from sklearn.calibration import calibration_curve
probs = lr.predict_proba(X_test)[:, 1]
prob_true, prob_pred = calibration_curve(y_test, probs, n_bins=5)
print(prob_true)
print(prob_pred)
[ 0.2    0.303  0.458  0.709  0.934]
[ 0.138  0.306  0.498  0.701  0.926]
```
]
.center[![:scale 70%](images/predprob_positive.png)
]
]

???
I'm using a subsample of cover type dataset because it's
kind of big and we get some nice histograms here. So I train
a logistic regression model. Logistic regression typically
gives pretty good probability estimates. So I expect this
model to be relatively well calibrated.

For the first 10 data points, the probabilities and the
actual predictions. So what I'm going to take is the
predicted probabilities for all data points, for the first
class and the actual predictions.

Obviously, I need to do this on the test set. If I do this
on the training set, then it'll tell me nothing, because on
the training set the model is good. So either I need to do
this on a test set or on some of the holdout set. On the
training set, if I perfectly overfit, I would be doing
perfect, but that's not the point. So here, I need to use a
separate set and so I can compute the calibration curve.

You can see the histogram here shows how many points are in
this area. And here, the orange thing is the actual
calibration curve, the dotted line is sort of perfect
calibration. And in here, I’m using five bins. The first bin
is zero until 0.2, so I would expect that 10% of the data in
this bin 10 is the true class, but actually, it's more like
20%. So here, it's like not perfectly calibrated.

In the second bin from 0.2 to 0.4. So what I would expect is
that 30% of these data points are predicted as class one and
actually- it's 30%.

---
class:spacious
#Influence of number of bins

???
How do I pick the number of bins?

More bins give me more resolution. But also at some point,
you get noise. So I guess it's the same way you would do for
histogram. You can use the standard heuristic for histogram
but usually, if you do something like 10 or 20, that's
usually enough bins.

- Works here because dataset is big
- Might become very noisy for larger datasets
---
class:spacious
# Comparing Models

.center[![:scale 90%](images/calib_curve_models.png)]
???
Logistic regression is pretty well calibrated. The decision
tree, since I didn't use any pruning, it'll be way too
certain. So all probabilities estimates, even on a test set
that will be either 0 or 1. So that's not very great. Random
forest classifier used here is not certain enough.

Now we want to fix that. Before I want to say how we can
trust them let's measure them first.

---
# Brier Score (for binary classification)

- “mean squared error of probability estimate”
`$$ BS = \frac{\sum_{i=1}^{n} (\widehat{p} (y_i)-y_i)^{2}}{n}$$`

.center[![:scale 70%](images/models_bscore.png)]
???
This graph is a nice way to look at how calibrated something
is. But it's very hard to compare. A standard way to compare
calibration is a Brier score, you could also use something
like a log loss. Log loss would also tell you how good that
a probability estimate is.

Brier score is the mean squared error of probability
estimates.  Yi is either one or zero, basically, and p ̂ is
the probability estimate.

So if you predict 0.5 then it's always going to give you a
loss, but it's going to give you a loss of only 0.5. If you
predict 0 when you should’ve predicted 1 then it's going to
give you a very large loss.

Here are the brier loss for the different models. Also, the
smaller is better. But this conflates accuracy and
calibration a little bit because here, you can see that the
random forest actually does best-.

I guess it does better because just it's a more accurate
model. So it's not well calibrated. But it is actually a
much better predictor and so it still has a smaller score.

We defined it, we can measure it and we actually want to
calibrate it.

---
# Fixing it: Calibrating a classifier
- Build another model, mapping classifier probabilities to better probabilities!
- 1d model! (or more for multi-class)

`$$ f_{calib}(s(x)) \approx p(y)$$`

- s(x) is score given by model, usually
- Can also work with models that don’t even provide probabilities!
Need model for $f_{calib}$, need to decide what data to train it on.
- Can train on training set → Overfit
- Can train using cross-validation → use data, slower

???
The way to do fix this is similar to stacking and that we
built another model on top of the probability estimates. So
usually, that's just a 1D model because we only have 1
underlying model.

If you have binary classification it gives us a single
probability output and so we need to learn 1D function that
basically maps this probability to something that is more
accurate.

It also works if the model gives us a score rather than a
probability. SVM only gives us a score, we can still learn
one leave function that tries to get probabilities.

Similar to what we did was stacking. So you don't want to
learn this the calibration model on the training data set
because on train dataset you're doing very well. So you need
to use either a whole dataset or you need to use
cross-validation again.

---

# Platt Scaling

- Use a logistic sigmoid for $f_{calib}$
- Basically learning a 1d logistic regression
- (+ some tricks)
- Works well for SVMs

`$$f_{platt} = \frac{1}{1 + exp(-ws(x) - b)}$$`

???
There are two main methods that people use and both are in
scikit-learn. One is Platt scaling, Platt scaling is
basically the same as 1D logistic regression. So you learn
to 1d just sigmoid plus, there's like some new tricks.

This basically allows you to fix a particular sigmoid shape
that you usually get from SVM scores and this works well for
SVMs. But you only have one parameter here, so there's not a
lot that you can tune.

---
# Isotonic Regression

- Very flexible way to specify $f_{calib}$
- Learns arbitrary monotonically increasing step-functions in 1d.
- Groups data into constant parts, steps in between.
- Optimum monotone function on training data (wrt mse).
.center[![:scale 40%](images/isotonic_regression.png)]

???

The other one is Isotonic regression, which is basically a
non-parametric mapping. What it does is it fits the monotone
function that minimizes the squared error.

The problem it’s solving is the peace wise constant function
that’s monotonous with minimums error. So this what it looks
like and so if you have to data in red, then standard
regression will find this green thing.

In Platt scaling, s(x) is what the model scores, the score
that comes from the SVM or random forest. The w is the one
parameter you learn from the data. This is a very inflexible
model that allows you to say how steep should the sigmoid
correction be while Isotonic regression is a very flexible
model that allows you to correct anything in any way, it’s
monotonous.

---
class:spacious
# Building the model
- Using the training set is bad
- Either use hold-out set or cross-validation
- Cross-validation can be used to make unbiased probability predictions, use that as training set.

???
---
class: center
# Fitting the calibration model
![:scale 70%](images/calibration_val_scores.png)
---
class: center

# Fitting the calibration model
![:scale 70%](images/calibration_val_scores_fitted.png)

---
# CalibratedClassifierCV
.tiny-code[
```python
from sklearn.calibration import CalibratedClassifierCV
X_train_sub, X_val, y_train_sub, y_val = train_test_split(X_train, y_train,
                                                          stratify=y_train, random_state=0)
rf = RandomForestClassifier().fit(X_train_sub, y_train_sub)
scores = rf.predict_proba(X_test)[:, 1]
plot_calibration_curve(y_test, scores, n_bins=20)
```
.center[![:scale 30%](images/random_forest.png)]
]
???
---
# Calibration on Random Forest
.smaller[
```python
cal_rf = CalibratedClassifierCV(rf, cv="prefit", method='sigmoid')
cal_rf.fit(X_val, y_val)
scores_sigm = cal_rf.predict_proba(X_test)[:, 1]

cal_rf_iso = CalibratedClassifierCV(rf, cv="prefit", method='isotonic')
cal_rf_iso.fit(X_val, y_val)
scores_iso = cal_rf_iso.predict_proba(X_test)[:, 1]
```]

???
This is for a random forest again. It can use calibrated
classifier CV. This does calibration. You can either you do
it with single hold-out dataset or with cross-validation.

Here, I'm using a single holdout set. I split my dataset,
the training set into X_train_sub and X_val.

X_train_sub is what I used to train a random forest and then
x_val is what I'm going to use to calibrate it. I created
calibrated classifier CV with the random forest that I’ve
already fit on the training dataset. I set cross-validation
to prefit and set the method to sigmoid. And then I can fit
this calibration model on the validation set.

So here is what this looks like, the random forest model
with no calibration, with sigma calibration and with
isotonic calibration.

In this case, sigmoid and isotonic don't look very
different. They both look like reasonably calibrated.

That said if you're doing sort of holdout set, destroying
away a bunch of data for training your first model.

---

# Cross-validated Calibration

```python
cal_rf_iso_cv = CalibratedClassifierCV(rf, method='isotonic')
cal_rf_iso_cv.fit(X_train, y_train)
scores_iso_cv = cal_rf_iso_cv.predict_proba(X_test)[:, 1]
```

???
So we can also not specify CV, then it uses
cross-validation, I think 3 fold by default.

And so then it does the whole cross-validation thing
internally. And then I get an even better calibration. So
what it does here for each cross-validation it trains a
separate model. And then if I want to predict on the test
set, it uses all of these models and then averages them. So
in a sense, it's not very surprising that I get a better
result here because I built 3 random forest models on
subsets of the data and then average them. I calibrated them
and I averaged calibrated models. So I basically just build
a bigger random forest, so it's not very surprising that it
does better. Also, we were able to use more data.

So if you do this, you retain as many models as there are
cross-validation folds and you use all of them in
prediction.

- kinda cheating, we have more trees now lol
- we use all the data, get good probabilities. just time-consuming
---
# Multi-Class Calibration

???
You can also do this for multi-class calibration, which does
basically the same thing, but you just do it for each class
individually and then you normalize again and you can make
pretty pictures.

- per-class calibratoin
- renormalization
---
class: center, middle

## Recap on imbalanced data

???

---
class: spacious

# Two sources of imbalance

- Asymmetric cost
- Asymmetric data

???
In general, there's are two ways in which a classification
task can be imbalanced. First one is asymmetric costs. Even
if the probability of class 0 and class 1 are the same, they
might be different like in business costs, or health costs,
or any other kind of cost or benefit associated with making
different kinds of mistakes. The second one is having
asymmetrical data. Meaning that one class is much more
common than the other class.

---

# Why do we care?

- Why should cost be symmetric?
- All data is imbalanced
- Detect rare events

???
One of these two is true in basically all real world
applications. Usually, both of them are true. There's no
reason why a false positive and a false negative should have
the same business cost, they’re usually very, very different
things, no matter whether you do ad-click prediction or
whether you do health, the two kinds of mistakes are usually
quite different, and have quite different real world
consequences.

Also, data is always imbalanced, and often very drastically.
In particular if you do diagnosis, or if you do ad clicks-or
marketing….For ad-clicks, I think it's like below 0.01% of
ads I clicked on, depending on how good doing with your
targeting. So very often, we have very few positives. And so
this is really a topic that is basically all of
classification. So balance classification with balance
costs, is not really something that happens a lot in the
real world.

---

# Changing Thresholds
.tiny-code[
```python
data = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, stratify=data.target, random_state=0)

lr = LogisticRegression().fit(X_train, y_train)
y_pred = lr.predict(X_test)

classification_report(y_test, y_pred)
```
```
          precision   recall  f1-score  support
0              0.91     0.92      0.92       53
1              0.96     0.94      0.95       90
avg/total      0.94     0.94      0.94      143
```
```python
y_pred = lr.predict_proba(X_test)[:, 1] > .85

classification_report(y_test, y_pred)
```
```
          precision   recall  f1-score  support
0              0.84     1.00      0.91       53
1              1.00     0.89      0.94       90
avg/total      0.94     0.93      0.93      143

```
]

???
So apart from evaluation we talked about one way we could’ve
changed the outcome to take into account which was changing
the threshold of greater probability. So not only taking
into account the predicted class. Assume I have a logistic
regression model and I can either use the predict method
which basically makes the cut off at 0.5 probability of the
positive class, then I can look at the classification
report, which will tell me precision and recall for both the
positive and the negative class. But if I want to increase
recall for class 0 or increase precision for class 1, I can
say only predict things as class 1 where the estimated
probability of class 1 is 0.85. And then I will have only
the ones that I'm very certain predicted as class 1.

If you have given actual a cost function of how much each
mistake costs, you can optimize this threshold.

FIXME new classification report!!

---

# Roc Curve

???
We also looked at ROC curves, which basically look at all
possible thresholds as you can apply. Either for
probabilistic prediction, or for any sort of continuous
uncertainty estimate.
---

## Remedies for the model

???
Today, I really want to talk about more, how we can change
the model more than just changing the threshold. So how can
we change the building of the model so that it takes into
account the asymmetric costs, or asymmetric data.
---
# Mammography Data

.smallest[
.left-column[
```python
from sklearn.datasets import fetch_openml
# mammography https://www.openml.org/d/310
data = fetch_openml('mammography', as_frame=True)
X, y = data.data, data.target
X.shape
```
(11183, 6)
```python
y.value_counts()
```
```
-1    10923
1       260
```
]
.right-column[
.center[
![:scale 100%](images/mammography_data.png)
]
]
.reset-column[
```python
# make y boolean
# this allows sklearn to determine the positive class more easily
X_train, X_test, y_train, y_test = train_test_split(X, y == '1', random_state=0)
```
]
]
???
I use this mammography data set, which is very imbalanced.
This is a data set that has many samples, only six features
and it's very imbalanced.

The datasets are about mammography data, and whether there
are calcium deposits in the breast. They are often mistaken
for cancer, which is why it's good to detect them. Since its
rigidly low dimensional, we can do a scatter plot. And we
can see that these are much skewed distributions and there's
really a lot more of one class than the other.

---
# Mammography Data

.smaller[
```python
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression

scores = cross_validate(LogisticRegression(),
                        X_train, y_train, cv=10,
                        scoring=('roc_auc', 'average_precision'))
scores['test_roc_auc'].mean(), scores['test_average_precision'].mean()
```
0.920, 0.630
```python
from sklearn.ensemble import RandomForestClassifier
scores = cross_validate(RandomForestClassifier(),
                        X_train, y_train, cv=10,
                        scoring=('roc_auc', 'average_precision'))
scores['test_roc_auc'].mean(), scores['test_average_precision'].mean()
```
0.939, 0.722
]

???

So as a baseline here is just evaluating logistic regression
and the random forest on this. Actually, I ran it under ROC
curve and average precision. I’ve used a cross-validate
function model selection that allows you to specify multiple
metrics. So I only need to train the model once but I can
look at multiple metrics.

I do a 10 fold cross-validation on the data set and I split
into training and test and so I can look at the scores here.
The scores are dictionary, they give me training and test
scores for all the metrics I specified. And so you can look
at the mean test drug score and the mean test average
precision score. This gives a high AUC and a quite low
average precision.

Here is a second baseline with a random forest doing the
same evaluation with ROC AUC and average precision. We get
slightly higher AUC and quite a bit higher average
precision.

---

# Basic Approaches

Change the training procedure
]

???
Now we want to change these basic training methods to be
better adapted to this imbalanced dataset. There are
generally two approaches. One is changing the data. And the
other is change the training procedure and how you built the
model. The easier one is to change the data. We can either
add samples to the data, we can remove samples to the data,
or we can do both. Resampling is not possible in
scikit-learn because of some API issues.
---
class: center, spacious

# Sckit-learn vs resampling

![:scale 55%](images/pipeline.png)

???
The problem with pipelines, as they're in scikit-learn right
now is, if you create a pipeline, and you call fit, it'll
always use the original y and the output of a transformer is
always a transformed x. So we can't change y. So we can
re-sample the data.

- The transform method only transforms X
- Pipelines work by chaining transforms
- To resample the data, we need to also change y

---
class: spacious
# Imbalance-Learn

http://imbalanced-learn.org

```
pip install -U imbalanced-learn
```

Extends `sklearn` API
???
So we're going to use imbalance learn, which is an extension
of the scikit-learn API that basically allows us to
resample.
---

### Sampler

To resample a data sets, each sampler implements:
```python
 data_resampled, targets_resampled = obj.sample(data, targets)
```
Fitting and sampling can also be done in one step:
```python
 data_resampled, targets_resampled = obj.fit_sample(data, targets)
```
--
 
 
In Pipelines:
Sampling only done in `fit`!

???
This extends sampler objects in scikit-learn. The sampler
objects have a sample method which returns resample data and
resample targets.

There’s also pipelines in imbalance learn.  The important
part here is the sampling is only done in fit. So only your
training data will be resampled. But when you do
predictions, you want predictions to be made on the whole
test set, and on the original test set. So you don't want to
mess with the test set in your evaluation. So resampling is
only done when you're building a model.

- Imbalance-learn extends scikit-learn interface with a
“sample” method.
- Imbalance-learn has a custom pipeline that allows
resampling.
- Imbalance-learn: resampling is only performed during fitting
- Warning: not everything in imbalance-learn is multiclass!

---

# Random Undersampling

```python
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(replacement=False)
X_train_subsample, y_train_subsample = rus.fit_sample(
    X_train, y_train)
print(X_train.shape)
print(X_train_subsample.shape)
print(np.bincount(y_train_subsample))
```
```
(8387, 6)
(390, 6)
[195 195]
```

???
The easiest strategy is randomly undersampling. The default
strategy is to undersample the majority class so that it has
the same size as the minority class, that's implemented in
the random under sampler. Here, I instantiate to random
under sampler, I set replacement equal to false, which means
sampling without replacement and then I can do fit sample on
the training set. And so the original training set was 8387
samples, the subsample data set is only 390 samples. And you
can see now in the bin count here, the data set is balanced.

Basically, I reduced the majority class randomly to a very
much smaller dataset, which made the majority class the same
size as the minority class. And you can see that the dataset
is 20 times smaller because the dataset was imbalanced. So
building anything on this dataset will be much much faster.
But also keep in mind, we threw away 98% of our data in
doing that.

- Drop data from the majority class randomly
- Often untill balanced
- Very fast training (data shrinks to 2x minority)
- Loses data !

---
# Random Undersampling

undersample_pipe = make_imb_pipeline(RandomUnderSampler(), LogisticRegressionCV())
scores = cross_validate(undersample_pipe,
                        X_train, y_train, cv=10,
                        scoring=('roc_auc', 'average_precision'))
scores['test_roc_auc'].mean(), scores['test_average_precision'].mean()
# baseline was 0.920, 0.630
```
```
0.927, 0.527
```
]
--
.smaller[
```python
undersample_pipe_rf = make_imb_pipeline(RandomUnderSampler(),
                                        RandomForestClassifier())
scores = cross_validate(undersample_pipe_rf,
                        X_train, y_train, cv=10,
                        scoring=('roc_auc', 'average_precision'))
scores['test_roc_auc'].mean(), scores['test_average_precision'].mean()
# baseline was 0.939, 0.722
```
```
0.951, 0.629
```

]

???
Looking at the result, ROC AUC actually improved a little
bit while the average precision decreased a little bit.
Given that we threw away 98% of our data, I think the fact
that the ROC AUC improved is quite remarkable. So this is
clearly still a reasonable model, given one particular
measure, it’s even a better model even though we threw away
most of the data.

Even though it's very simplistic, it's actually a viable
strategy that people often use in practice. In particular,
if you have a very big data set, you might not have enough
compute to actually do something on the whole dataset or you
might only do something simple on the whole dataset. But
after you resampled it, you can maybe train a much more
complicated model.

Here by default, this random under simpler makes it balanced
but of course, you could be slightly less extreme, so that
you throw away a little bit fewer data.

We can do the same thing with random forest.  After
computing, the area under the ROC curve actually went up
substantially. Again, it's quite surprising given that we
used much less data, but the average precision went down
quite a bit.

- As accurate with fraction of samples!
- Really good for large datasets

---

# Random Oversampling

```python
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler()
X_train_oversample, y_train_oversample = ros.fit_sample(
    X_train, y_train)
print(X_train.shape)
print(X_train_oversample.shape)
print(np.bincount(y_train_oversample))
```
```
(8387, 6)
(16384, 6)
[8192 8192]
```
???
- Repeat samples from the minority class randomly
- Often untill balanced
- Much slower (dataset grows to 2x majority)

---

# Random Oversampling

.smaller[
```python
oversample_pipe = make_imb_pipeline(RandomOverSampler(), LogisticRegression())
scores = cross_validate(oversample_pipe,
                        X_train, y_train, cv=10,
                        scoring=('roc_auc', 'average_precision'))
scores['test_roc_auc'].mean(), scores['test_average_precision'].mean()
# baseline was 0.920, 0.630
```
0.917, 0.585
]
--
.smaller[
```python
oversample_pipe_rf = make_imb_pipeline(RandomOverSampler(),
                                       RandomForestClassifier())
scores = cross_validate(oversample_pipe_rf,
                        X_train, y_train, cv=10,
                        scoring=('roc_auc', 'average_precision'))
scores['test_roc_auc'].mean(), scores['test_average_precision'].mean()
# baseline was 0.939, 0.722
```
0.926, 0.715
]

???
The complement of doing random sampling of the data is
random oversampling of data. So in random oversampling, we
do the opposite. We basically resample the training dataset
so that the minority class has the same number of samples as
the majority class.

Given that this dataset was very imbalanced, we nearly
doubled the size of the training dataset. So now everything
will be actually much slower because we have many more
samples. We nearly doubled the dataset size because we
reproduced many copies of the minority class and then we
have a balanced dataset again.

Here, what we did is we just sampled with replacement. We
sampled with replacement about 8000 times from this pool of
195 samples.

Q: Does that mean there are repeat records?

Yeah, most of them are repeated like 40 times on average.

Q: What distribution you’re sampling from?

We're not sampling IID from the original distribution
because in the original distribution there was a strongly
imbalanced. So we now sample from the traditional
distributions basically where we sample first the label and
then we sample from this class. And we do this for the two
labels independently in the same amount of time. It's
slightly weird because we have a lot of copies of the
sample.

Logreg the same, Random Forest much worse than undersampling (about same as doing nothing)

---

# Curves for LogReg

???
Here, I’m making a pipeline with logistic aggression random
forest and you can see that the area under the ROC curve is
actually lower, while the average precision is higher than
with undersampling but slightly lower than the original
dataset.
---

# Curves for Random Forest

???
We can also look at the curves. We have on the left, the ROC
curve for logistic regression with the original data set,
the oversample, and the under-sample dataset. On the
right-hand side, we have the recall curve.

These are the same for random forest. If you look at these
curves from afar, they give you the exact opposite ideas. On
the left, under sample seems to be best and oversample is
the worst while under sample is clearly the worst and under
sample is not so bad on the curve in the right.

If I look at the precision-recall curve, the original data
set did best. Looking at these two curves you get quite
different ideas. The TPR axis is the same as the recall
axis.

The idea is since we have the cost function, we know which
could be the area under one of these curves, or could be a
particular recall or position value or particular cost
matrix we want to achieve and we want to optimize this. And
we hope that by taking into account the imbalance of the
classes, we can optimize this cost better than just
basically using the IID data set. But precision and false
positive rate measure quite different things.

---
class: spacious

# Class-weights

- Instead of repeating samples, re-weight the loss function.
- Works for most models!
- Same effect as over-sampling (though not random), but not as expensive (dataset size the same).

???
One way we can make the resampling more efficient is by
using class weights instead of actually resampling. So we
can change our loss function to do the same thing as if you
would resample but under sampling case, we don't actually
throw away any data and in the oversampling case, we don't
actually make our computational problem harder by repeating
some of the samples. This works for most models and it's
pretty simple to do in scikit-learn. Basically, it's the
same as oversampling in a sense because you're not throwing
away any data.
---
class: spacious

# Class-weights in linear models

`$$\min_{w \in ℝ^{p}, b \in \mathbb{R}}-C \sum_{i=1}^n\log(\exp(-y_i(w^T \textbf{x}_i + b )) + 1) + ||w||_2^2$$`

`$$\min_{w \in ℝ^{p}, b \in \mathbb{R}}-C \sum_{i=1}^n c_{y_i}  \log(\exp(-y_i(w^T \textbf{x}_i + b )) + 1) + ||w||_2^2$$`

Similar for linear and non-linear SVM

???
So for the linear model, for example, let's say we do
logistic regression. Instead of minimizing this problem, for
each class, we have a class weight c_(y_i )  and so the loss
for each sample gets multiplied by this class weight and
usually, they sum to one. This is similar to having a
different penalty C for all of the different classes.

You can see that this is the same as if I repeat each sample
x_i c_(y_i ) many times in each class. So if I set the cost
weight of one class to 2, that would be the same as
repeating each sample in this class twice, only now I don't
actually have to duplicate any sampling. So this is cheaper
than over sampling but has the same effect.

You can do this for hinge loss and SVMs by just changing the
loss.

---

# Class weights in trees

Gini Index:

`$$H_\text{gini}(X_m) = \sum_{k\in\mathcal{Y}} p_{mk} (1 - p_{mk})$$`

`$$H_\text{gini}(X_m) = \sum_{k\in\mathcal{Y}} c_k p_{mk} (1 - p_{mk})$$`

Prediction:

Weighted vote

???
For trees and all trees based models, you can just change
the splitting criteria. For example, if you have a Gini
index here, and you compute the Gini index for each class or
the cross-entropy for each class, and you have the class
weights in here.

And again, you can see this is the same as replicating the
data points c_k many times. If you want to make predictions,
you can do a weighted vote.

---

#Using Class-Weights

.smaller[
```python
scores = cross_validate(LogisticRegression(class_weight='balanced'),
                        X_train, y_train, cv=10,
                        scoring=('roc_auc', 'average_precision'))
scores['test_roc_auc'].mean(), scores['test_average_precision'].mean()
# baseline was 0.920, 0.630
```
0.918, 0.587
```python
scores = cross_validate(RandomForestClassifier(n_estimators=100,
                                               class_weight='balanced'),
                        X_train, y_train, cv=10,
                        scoring=('roc_auc', 'average_precision'))
scores['test_roc_auc'].mean(), scores['test_average_precision'].mean()
# baseline was 0.939, 0.722
```
0.917, 0.701
]

???
In scikit-learn, all the classifier has a class weight
parameter. You can set them to anything you want, basically,
giving arbitrary integers to the particular classes, there
is normalized sum to one.

Balanced setting means do the same as oversampling so that
the populations have the same size for all the classes. You
can see this has a somewhat similar effect to the
oversampling for only half the computational price.

This is a pretty simple way to try to change the model
towards a positive class.

---
class: spacious

# Ensemble Resampling

- Random resampling separate for each instance in
an ensemble!
- Chen, Liaw, Breiman: “Using random forest to learn imbalanced data.”
- Paper: “Exploratory Undersampling for Class Imbalance Learning”
- Not in sklearn (yet)
- Easy with imblearn
???
There’s something a little bit better that’s called easy
ensembles or resampling within an ensemble. So the idea is
you build an ensemble like bagging classifier, but instead
of doing a bootstrap sample, you can do a random
undersampling into a balance dataset separately for each
classifier in ensemble. Right now, you can only do this with
imbalance learn.
---

# Easy Ensemble with imblearn

from sklearn.tree import DecisionTreeClassifier
from imblearn.ensemble import BalancedBaggingClassifier

# from imblearn.ensemble import BalancedRandomForestClassifier
# resampled_rf = BalancedRandomForestClassifier()

tree = DecisionTreeClassifier(max_features='auto')
resampled_rf = BalancedBaggingClassifier(base_estimator=tree,
                                         random_state=0)
scores = cross_validate(resampled_rf,
                        X_train, y_train, cv=10,
                        scoring=('roc_auc', 'average_precision'))
scores['test_roc_auc'].mean(), scores['test_average_precision'].mean()
# baseline was 0.939, 0.722
```
0.957, 0.654

]

???
For example, I can do a balanced random forest, I’m using a
decision tree as a base classifier with max features as
auto. For classifier, this is the square root of number of
features.

And so this will build 100 estimators. Each estimator will
be basically trained on the random undersample of the data
set. So I do a random under a sample of the data set 100
times and I build a tree on each of them. So this is exactly
as expensive as building a random forest on the undersample
data set, which is pretty cheap because under sampling
throws away most of the data. In fact, we're not throwing
away as much data, we're keeping a lot of the data but only
in different trees in the ensemble.

Of all the models that we looked at today, this is the best
one in terms of area under the curve. It’s cheap to do and
it allows you to use a lot of your data in a pretty nice
way. This has been shown to be like pretty competitive in a
bunch of benchmarks.

-As cheap as undersampling, but much better results than anything else! 
-Didn't do anything for Logistic Regression.

---
class: center, middle

![:scale 100%](images/roc_vs_pr.png)
???
Looking at the curves again, comparing the easy ensemble.
You can see that in the higher recall area, it does quite
well and in the high precision area, it does better than
just undersampling. So remember, this is much much cheaper
than the oversampling by a large amount. And so if we are
anywhere in this area here, it seems to be like a pretty
decent solution.

To explain the difference between easy ensemble and under
sample….In undersampling, I randomly under sample the
dataset once. So then I have a balance dataset 195-195 and I
built a model on this. And in this case, I build a random
forest model. And in the easy ensemble, what I do is for
each tree in the random forest I separately do an
undersampling in 295-95. So they will all have the same
minority class samples, but they all will have different
majority class samples. So in total, I'm looking at more
than 195 samples from the majority class since I under
sample for each tree in a different way. That makes it quite
a bit better.

These are all the randomly sample methods that I want to
talk about. You can see that here, they make quite a big
difference. Depending on where you want to be on this
precision-recall curve, you would choose quite different
methods.

---
class: center, middle

## Synthetic Sample Generation

???
There are many methods, but the only method that your
interviewer will expect you to know is SMOTE.
---

# Synthetic Minority Oversampling Technique (SMOTE)

- Adds synthetic interpolated data to smaller class
- For each sample in minority class:
 – Pick random neighbor from k neighbors.
 – Pick point on line connecting the two uniformly (or within rectangle)
 – Repeat.

???
I don't think it's actually used commonly in practice like
the random oversampling and random undersampling. In
particular, the undersample is great because it makes
everything much faster.

The idea is to add samples to the smaller class. So you
synthetically want to add samples that kind of looks like
the smaller class and this way hopefully, bias the
classifier more towards the smaller class. It's also
neighbor space so again if you have a very big dataset or if
you’re in a very high dimension this might be slow. So in
very high dimensions not even work that well.

So what we're doing here is for each sample on the minority
class you pick a random neighbor among the K nearest
neighbors and then you pick a point on the line between the
two.

- Leads to very large datasets (oversampling)
- Can be combined with undersampling strategies

---

???
This is feature three and feature four from the mammography
dataset. You can see basically that for this dataset it’s
pretty far away, K is three and so these neighbors were
picked and then a bunch of times points was randomly picked
on the line between the two. So basically, you get something
that is sort of connecting all two data points.

This might set of make more sense than just repeating the
sample over and over again. But also, if you're in high
dimensions, it's not entirely clear how well this will work.

---

.smaller[
```python
smote_pipe = make_imb_pipeline(SMOTE(), LogisticRegression())
scores = cross_validate(smote_pipe, X_train, y_train, cv=10,
                        scoring=('roc_auc', 'average_precision'))
pd.DataFrame(scores)[['test_roc_auc', 'test_average_precision']].mean()
# baseline was 0.920, 0.630
```
0.919, 0.585

```python
smote_pipe_rf = make_imb_pipeline(SMOTE(),
                                  RandomForestClassifier())
scores = cross_validate(smote_pipe_rf, X_train, y_train, cv=10,
                        scoring=('roc_auc', 'average_precision'))
pd.DataFrame(scores)[['test_roc_auc', 'test_average_precision']].mean()
# baseline was 0.939, 0.722
```
0.946, 0.688

]

???
The results are pretty similar to either the original
dataset or the random sampling.
Performing nearest neighbors, I found 11 to be best.

---
.smaller[
```python
param_grid = {'smote__k_neighbors': [3, 5, 7, 9, 11, 15, 31]}
search = GridSearchCV(smote_pipe_rf, param_grid, cv=10,
                      scoring="average_precision")
search.fit(X_train, y_train)
results = pd.DataFrame(search.cv_results_)
results.plot("param_smote__k_neighbors", ["mean_test_score", "mean_train_score"])
```
]
.center[
![:scale 60%](images/param_smote_k_neighbors.png)
]

???
Also here in these plots, it looks now, there are more
yellow points than purple points that's because I followed
the purple points first and then the yellow points. Because
if I didn't, then in the original data, you wouldn't see any
yellow points.

Again, if you look at the metrics, this doesn't really make
a big difference.

---

???
---

???
---
class: spacious

# Summary

- Always check roc_auc and average_precision look at curves
- Undersampling is very fast and can help!
- Undersampling + Ensembles is very powerful!
- Can add synthetic samples with SMOTE

???
Always check the ROC AUC and the average precision and look
at the curves. You’ve seen that the different strategies and
different parts of the curves, they behave quite
differently. Undersampling is really fast, and it can
sometimes make things better. So I would always try under
sampling just because maybe your model is slightly worse but
if you get 100 times to speed up, it's still worth it.

Using undersampling with easy ensembles is as fast as
undersampling, but often gives better results.

People in machine learning research like balance datasets,
but in the real world data sets are never balanced.
Unfortunately, most of the datasets we have come from
machine learning researchers. And so there's really not a
lot of interesting data sets that are very balanced.

The other thing I could have covered is you can directly
optimize particular metrics. So you can train a model to
optimize precision at K or area under the curve or something
like that, at least for linear model, that's relatively
straightforward. I'm not sure what the outcome was if
someone tried it with trees. But I think the most common is
just reweighting to samples.