Linear Models for Regression

class: center, middle

### W4995 Applied Machine Learning

# Linear models for Regression

01/31/18

Andreas C. Müller

???
So today we'll talk about linear models for regression.
You're probably familiar with some of the basics, and that's
where we'll start. But we'll also look into how to tune
these models, what the trade-offs are, and some more complex
techniques.

TODO: SVR to robust models? Move path algorithms into this
lecture instead of robust regression? Add more preprocessing
and scaling to this lecture? dimensionality -> model
complexity?

---
class:center

# Linear Models for Regression

![:scale 50%](images/linear_regression_1d.png)

$$\hat{y} = w^T \mathbf{x} + b = \sum_{i=1}^p w_i x_i +b$$

???
Predictions in all linear models for regression are of the
form shown here: It's an inner product of the features with
some coefficient or weight vector w, and some bias or
intercept b. In other words, the output is a weighted sum of
the inputs, possibly with a shift. here i runs over the
features and x_i is one feature of the data point x. These
models are called linear models because they are linear in
the parameters w. The way I wrote it down here they are also
linear in the features x_i. However, you can replace the
features by any non-linear function of the inputs, and it'll
still be a linear model.

There are many differnt linear models for regression, and
they all share this formula for making predictions. The
difference between them is in how they find w and b based on
the training data.

---
# Ordinary Least Squares

$$\hat{y} = w^T \mathbf{x} + b = \sum_{i=1}^p w_i x_i +b $$

`$$\min_{w \in \mathbb{R}^p} \sum_{i=1}^p ||w^T\mathbf{x}_i - y_i||^2$$`

Unique solution if $\mathbf{X} = (\mathbf{x}_1, ... \mathbf{x}_n)^T$ has full column rank.
???
The most straight-forward solution that goes back to Gauss
is ordinary least squares. In ordinary least squares, find w
and b such that the predictions on the training set are as
accurate as possible according the the squared error. That
intuitively makes sense: we want the predictions to be good
on the training set. If there is more samples than features
(and the samples span the whole feature space), then there
is a unique solution. The problem is what's called a least
squares problem, which is particularly easy to optimize and
get the unique solution to.

However, if there are more features than samples, there are
usually many perfect solutions that lead to 0 error on the
training set. Then it's not clear which solution to pick.
Even if there are more samples than features, if there are
strong correlations among features the results might be
unstable, and we'll see some examples of that soon.

Before we look at examples, I want to introduce a popular
alternative.

---
# Ridge Regression

`$$ \min_{w \in \mathbb{R}^p} \sum_{i=1}^n ||w^T\mathbf{x}_i - y_i||^2 + \alpha ||w||^2 $$`

Always has a unique solution.

Tuning parameter alpha.

???
In Ridge regression we add another term to the optimization
problem. Not only do we want to fit the training data well,
we also want w to have a small squared l2 norm or squared
euclidean norm. The idea here is that we're decreasing the
"slope" along each of the feature by pushing the
coefficients towards zero. This constraings the model to be
more simple.

So there are two terms in this optimization problem, which
is also called the objective function of the model: the data
fitting term here that wants to be close to the training
data according to the squared norm, and the prenalty or
regularization term here that wants w to have small norm,
and that doesn't depend on the data.

Usually these two goals are somewhat opposing. If we made w
zero, the second term would be zero, but the predictions
would be bad. So we need to trade off between these two. The
trade off is problem specific and is specified by the user.
If we set alpha to zero, we get linear regression, if we set
alpha to infinity we get a constant model. Obviously usually
we want something in between.

This is a very typical example of a general principle in
machine learning, called regularized empirical risk
minimization.

---
# (regularized) Empirical Risk Minimization

`$$ min_{f \in F} \sum_{i=1}^n L(f(\mathbf{x}_i), y_i) + \alpha R(f) $$`

???
FIXME pointers data fitting / regularization!

Many models in machine learning, like linear models, SVMs
and neural networks follow the general framework of
empirical risk minimization, which you can see here. We
formulate the machine learning problem as an optimization
problem over a family of functions. In our case that was the
family of linear functions parametrized by w and b. The
minimization problem consists of two parts, the data fitting
part and the model complexity part. The data fitting part
says that the predictions mad eby our functions should be
accurate according to some loss L. For our regression
problems that was the squared loss. The model complexity
part says that we prefer simple models and penalizes
complicated f. Most machine learning algorithms can be cast
into this, with a particular choice of family of functions
f, loss function L and regularizer R. And most of machine
learning theory is build around this framework. People proof
for differnt choices of F and L and R that if you minimize
this, you'll be able to generalize well. And that makes
intuitive sense. To do well on the test set, we definitely
want to do reasonably well on the training set. We don't
expect that we can do better on a test set than the training
set. But we also want to minimize the performance difference
between training and test set. If we restrict our model to
be simple via the regularizer R, we have better chances of
the model generalizing.

---
# Reminder on model complexity

![:scale 80%](images/overfitting_underfitting_cartoon_full.png)

???
I hope this sounds familiar from what we talked about last
time. This is a particular way of dealing with overfitting
and underfitting. For this framework in general, or for
ridge regression in particular, trading off the data fitting
and the regularization changes the model complexity. If we
set alpha high we restrict the model, and we will be on the
left side of the graph. If we make alpha small, we allow the
model to fit the data more, and we're on the right side of
the graph.

---
# Boston Housing Dataset

![:scale 70%](images/boston_housing_scatter.png)
.smaller[
```python
print(X.shape)
print(y.shape)
```
```
(506, 13)
(506,)
```
]
???
Ok after all this pretty abstract talk, let's make this
concrete. Let's do some regression on the boston housing
dataset. After the last homework you're hopefully familiar
with it. The idea is to predict prices of property in the
boston area in different neighborhoods. This is a dataset
from the 70s I think, so everything is pretty cheap. Most of
the features you can see are continuous, with the exception
of the charlston river variable which says whether the
neighborhood is on the river.

Keep in mind that this data lives in a 13 dimensional space
and these univariate plots only look at 13 different
projections of the data, and can't capture any of the
interactions.

But still we can see that the price clearly depends on some
of these variables. It's also pretty clear that the
dependency is non-linear for some of the variables. We'll
still start with a linear model, because its a very simple
class of models, and I'd always star approaching any model
from the simplest baseline. In this case it's linear
regression. We're having 506 samples and 13 features. We
have much more samples than features. Linear regression
should work just fine. Also it's a tiny dataset, so
basically anything we'll try will run instantaneously, which
is also good to keep in mind.

Another thing that you can see in this graph is that the
features have very different scales. Here's a box plot that
shows that even more clearly.
---

![:scale 80%](images/boston_scaling.png)

???
That's something that will trip up the distance based models
models we talked about last time, as well as the linear
models we're talking about today.  For the penalized models
the different scales mean that different features are
penalized differently, which you usually want to avoid.
Usually there is no particular semantics attached to the
fact that one feature has different magnitutes than another.
We could measure something in inches instead of miles, and
that would change the outcome of the model. That's certainly
not something we want. A good idea is to scale the data to
get rid of this effect. We'll talk about that and other
preprocessing methods in-depth on Wednesday next week. Today
I'm mostly gonna ignore this. But let's get started with
Linear Regression

---
class: middle, spacious

```python
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=0)

np.mean(cross_val_score(LinearRegression(), X_train, y_train, cv=10))
```
```
0.717
```

```python
np.mean(cross_val_score(Ridge(), X_train, y_train, cv=10))
```
```
0.715
```
???
Let’s look at two simple models. Linear regression and Ridge
regression. What I've done is I’ve split the data into
training and test set and used 10 fold cross-validation to
evaluate them. Here I use cross_val_score together with the
model, the training data, training labels, and 10 fold
cross-validation. This will return 10 scores and I'm going
to compute the mean of them. I'm doing this for both linear
regression and Ridge regression. Here is ridge regression
uses a default value of alpha of 1. Here these two scores
are quite similar.

---
# Coefficient of determination R^2

`$$ R^2(y, \hat{y}) = 1 - \frac{\sum_{i=0}^{n - 1} (y_i - \hat{y}_i)^2}{\sum_{i=0}^{n - 1} (y_i - \bar{y})^2} $$`

`$$ \bar{y} =  \frac{1}{n} \sum_{i=0}^{n - 1} y_i$$`

Can be negative for biased estimators - or the test set!

???
The scores are R squared or coefficient of determination.
This is basically a score that's usually between zero and
one, where one means perfect prediction or perfect
correlation and zero means a random prediction. What it does
is it computes the mean of the targets over the data you're
evaluating it on and then it looks at the distance between
the prediction and the ground truth relative to the mean. If
it's negative, it means you do a worse job at predicting and
just predicting the mean. It can happen if your model was
really bad and bias. The other reason is if you use a test
set. This is guaranteed to be positive on the data it was
fit on with an unbiased linear model, which will nearly
never apply to what we’re doing.

---
# Scaling (if you want)
```python
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
X, y = boston.data, boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
ridge = Ridge().fit(X_train_scaled, y_train)

X_test_scaled = scaler.transform(X_test)
ridge.score(X_test_scaled, y_test)
```
???
If you want to scale the data, you can use StandardScaler
for that. It makes the mean zero and the standard deviation
one. It's an unsupervised model, so it only takes the data
X_train to fit. Fitting just means computing mean and
standard devitation. Then, we can scale the data using the
transform method. Make sure that you fit the data only on
the training set, and then transform the training and the
test set. If you want to use scaling inside
cross-validation, it's a little bit more tricky. As I said,
more details next wednesday. I'm just mentioning this here
because it might come in handy for the homework.

Ok but so let's come back to the Ridge model. Above we just
used the default setting for the parameter alpha, which is
one. This is a reasonable default, but there is no reason
why this should give us the optimum generalization
performance on this problem. So it's a good idea to adjust
the parameter. As we saw on Monday, we can easily do that
with gridsearchCV.

---

.smallest[
```python
from sklearn.model_selection import GridSearchCV
param_grid = {'alpha': np.logspace(-3, 3, 13)}
print(param_grid)
```
```
{'alpha': array([ 0.001,  0.003, 0.01, 0.032, 0.1, 0.316, 1., 3.162,
                    10., 31.623, 100., 316.228, 1000.])}
```

```python
grid = GridSearchCV(Ridge(), param_grid, cv=10)
grid.fit(X_train, y_train)
```
]
.center[
![:scale 60%](images/ridge_alpha_search.png)
]

???
Coming back to the ridge regression we used the standard
alpha of one which is a reasonable default, but by no means,
this is guaranteed to make sense in this particular problem.
Here I’ve done the grid search. As we talked about on
Monday, I defined a parameter grid where the key is the
parameter I want to search (alpha in Ridge) and the
parameters I want to try. For regularization parameters,
like alpha, it’s usually good to do them on the logarithmic
grid. I do a relatively fine grid here with 13 different
points mostly because I wanted to have a nice plot. In
reality, I would use a three or six or something like that.
I’ve instantiated GridSearchCV with Ridge(), the parameter
grid and do 10 fold cross-validation and then I called
grid.fit. I’ve reported the mean training accuracy and mean
test accuracy over 10 cross-validation folds for each of the
parameter settings. Okay, you can see a couple things here.
A) There's a lot of uncertainty B) The training set is
always better than the test set. C) The most important thing
that you can see here is that regularization didn't help.
Making alpha as small as possible is the best. What I'm
going to do next is I'm going to modify this dataset a
little bit so that we can see the effect of the
regularization. I’m going to modifying by using a polynomial
expansion, again we're going to talk about a little bit more
on Wednesday.

---
# Adding features

```python
from sklearn.preprocessing import PolynomialFeatures, scale
poly = PolynomialFeatures(include_bias=False)
X_poly = poly.fit_transform(scale(X))
print(X_poly.shape)
X_train, X_test, y_train, y_test = train_test_split(X_poly, y)
```
```
(506, 104)
```
```python
np.mean(cross_val_score(LinearRegression(), X_train, y_train, cv=10))
```
```
0.74
```
```python
np.mean(cross_val_score(Ridge(), X_train, y_train, cv=10))
```
```
0.76
```

???
Basically, I'm adding a whole bunch of more features, which
is the interaction between all pairs of features and to
square for each feature. This allows me to learn not an only
linear function, but any polynomial of degree 2 of the
features, which gives me 104 features. Now there are way
more features in relation to the training data size. It's
still fewer features than data points, but unlike before
now, they're in the same order of magnitude. Now, this is a
problem that's much harder for linear regression without
regularization. Keep in mind the power of the linear models
really depends a lot on the dimensionality of the input
space. Previously I had 16 degrees of freedom and now I have
140 degrees of freedom, so I can learn a much more complex
model now. You can also think about this in terms of a 2D
problem, you draw a linear line, it seems to be very, very
restrictive. But if you look at a very high dimensional
space, if you have more features than samples, then you can
solve any problem perfectly. So it's not restrictive at all.
The powerfulness of the linear models depends a lot on the
majority of the features.

---
.center[
![:scale 60%](images/ridge_alpha_search_poly.png)
]
.smaller[
```python
print(grid.best_params_)
print(grid.best_score_)
```
```
{'alpha': 31.6}
0.83
```
]
???
Let's look at what happens if you do a Grid search. You can
see a clear fit, hopefully, it looks to you a lot like the
cartoon curve that I did before. If we increase alpha, the
model gets less complex and the training performance goes
down. But if I increase regularization, the test score goes
up for a while and if I make it too strong, it goes down.
You can also see that the standard deviation is much higher
with less regularization than it is with more
regularization. This is sort of a more robust model and it
generalizes this better. Using regularization, I found the
best alpha and the best R square. It actually improved quite
a bit. The alpha doesn't really mean anything depending on
how you scale your data, don't think too much about that.

---
# Plotting coefficient values (LR)

```python
lr = LinearRegression().fit(X_train, y_train)
plt.scatter(range(X_poly.shape[1]),
            lr.coef_, c=np.sign(lr.coef_), cmap="bwr_r")
```
.center[
![:scale 55%](images/lr_coefficients_large.png)
]

???
In the previous slide, linear regression did nearly as well
as the Ridge Regression with the default parameters but
let's look at the coefficients of a linear model. These are
the coefficients of the linear model trained on the
polynomial futures. I plot them adding a color to represent
positive and negative. The magnitude here is 1 and then 13
zeros. We have two features, one is like, probably more than
trillions times two and then there's another one that's very
negative. What I conclude from this is, these 2 features are
very highly correlated. The model makes both of them really,
really big and then they cancel each other out. This is not
a very nice model, because it relates to American stability
and also it tells me that these 2 features are like
extremely important, but they might not be important at all.
Like the other features might be more important, but they're
nullified by this cancellation effect. They (0.2 and -0.8)
need to cancel each other out because the predictions are
reasonable and all the houses only cost like $70,000.

---
# Ridge Coefficients

```python
ridge = grid.best_estimator_
plt.scatter(range(X_poly.shape[1]), ridge.coef_,
            c=np.sign(ridge.coef_), cmap="bwr_r")
```
.center[
![:scale 60%](images/ridge_coefficients.png)
]

???
Let's look at the Ridge model. This is the best estimator,
which is the model that was found in a grid search with the
best parameter settings. This looks much more reasonable.
This feature, which was a very negative one, still is very
negative. But now this is actually three and minus three. So
this is a much more reasonable range. We can also look at
the features and the effect of different values of alpha.
Here is a Ridge with 3 different values of alpha. In the
previous random seat, alpha equal to 14 was the best now and
now we have it equal to 30 something. The green one is more
or less the best setting and then there's a smaller and a
bigger one. You can see that basically what alpha does is,
on average, it pushes all the coefficients toward zero. So
here you can see this coefficient shrank going from 1 to 14
going to 0 and the same here. So basically, they all push
the different features towards 0. If you look at this long
enough, you can see things that are interesting, the first
one with alpha equal to one it's positive, and with alpha
equal to 100, it's negative. That means depending on how
much you regularize the direction of effect goes in opposite
directions, what that tells me is don't interpret your
models too much because clearly, either it has a positive or
negative effect, it can't have both.

---
```python
ridge100 = Ridge(alpha=100).fit(X_train, y_train)
ridge1 = Ridge(alpha=1).fit(X_train, y_train)
plt.figure(figsize=(8, 4))

plt.plot(ridge1.coef_, 'o', label="alpha=1")
plt.plot(ridge.coef_, 'o', label="alpha=14")
plt.plot(ridge100.coef_, 'o', label="alpha=100")
plt.legend()
```

.center[
![:scale 60%](images/ridge_coefficients_alpha.png)
]

???
One other way to visualize the coefficient is to look at the
coefficient path or regularization path. On the x-axis is
the alpha, and on the y-axis, the coefficient magnitude.
Basically, I looped over all of the different alphas and you
can see how they shrink towards zero to increase alpha.
There are some very big coefficients that go to zero very
quickly and some coefficients here that stay the same for a
long time. 
---
# Coefficient Paths

![:scale 90%](images/ridge_coefficient_paths.png)

???
Another interesting way to see the effect of regularization
is to look at learning curves. The definition of the
learning curve is it looks at subsets of the training set at
different size and look how the model performs. Here what
I'm doing is I hold the number of features constant and I
look at what happens if I decrease the amount of data. So
here again, I look at different strengths of regularization,
alpha to 1, 14, and 100 and I look at aggression, which is
basically alpha equal to zero. You can see that if I make
alpha 100, you get a pretty reasonable score for small
models or even here for 14, it’s the best. If you use the
whole dataset then linear regression does well. If you have
had a smaller or subset of the data set regularization would
help.

---
# Learning Curves

![:scale 90%](images/ridge_learning_curve.png)

???
The thing I want to illustrate is if you have less data,
regularization helps more. You can see the effects here
between the orange and the red are very drastic. The green
one probably has too much regularization. If you have very
little data, even regularizing way too much is okay, but
there's sort of a sweet spot here which is the orange one
and the orange one does best. The more data you add, the
less important regularization becomes. If you take the blue
curve at the beginning it’s much higher than the red curve,
but then, in the end, they overlap.

---
# Lasso Regression

`$$ \min_{w \in \mathbb{R}^p} \sum_{i=1}^n ||w^T\mathbf{x}_i - y_i||^2 + \alpha ||w||_1 $$`

- Shrinks w towards zero like Ridge

- Sets some w exactly to zero - automatic feature selection!

???
Lasso Regression looks very similar to Ridge Regression. The
only thing that is changed is we use the L1 norm instead of
the L2 norm. L2 norm is the sum of squares, the L1 norm is
the sum of the absolute values. So again, we are shrinking w
towards 0, but we're shrinking it in a different way. The L2
norm penalizes very large coefficients more, the L1 norm
penalizes all coefficients equally. What this does in
practice is its sets some entries of W to exactly 0. It does
automatic feature selection if the coefficient of zero means
it doesn't influence the prediction and so you can just drop
it out of the model. This model does features selection
together with prediction. Ideally what you would want is,
let's say you want a model that does features selections.
The goal is to make our model automatically select the
features that are good. What you would want to penalize the
number of features that it uses, that would be L0 norm.

---

# Understanding L1 and L2 Penalties
.wide-left-column[
![:scale 100%](images/l2_l1_l0.png)
]

.narrow-right-column[
.smaller[
`$$ \ell_2(w) = \sum_i \sqrt{w_i ^ 2}$$`
`$$ \ell_1(w) = \sum_i |w_i|$$`
`$$ \ell_0(w) = \sum_i 1_{w_i != 0}$$`
]]
???
The L0 norm penalizes the number of features that are not
zero. The L1 norm penalizes the absolute values. L2 norm
penalizes the squared norm. The problem with L0 norm is it's
one everywhere except at zero. And that's not
differentiable, it's not even continuous, so this is very
hard to optimize. Basically, we just gave up on this and we
use the next best thing we can do, which is the L1 norm. The
L1 norm is not differentiable, there’s like this kink in
zero, but we can still optimize it pretty well. The L2 norm
penalizes large values much more than small values.

---

# Understanding L1 and L2 Penalties

.wide-left-column[
![:scale 100%](images/l1_kink.png)
]
.narrow-right-column[
.padding-top[
.smaller[
`$ f(x) = (2 x - 1)^2 $`

`$ f(x) + L2 = (2 x - 1)^2 + \alpha x^2$`

`$ f(x) + L1= (2 x - 1)^2 + \alpha |x|$`
]]]
???
Let's say I want to do a one-dimensional linear regression
problem. The problem to solve is something of this form. I
want my coefficients to be such that, let's say x, the one
coefficient I'm looking for and this is sort of the squared
loss function that I have. F is the daycare fitting term
here, and this is would be like some quadratic. The loss
function is in blue, its quadratic. The loss function plus
our L2 regularization, the ridge is in orange. The data
fitting plus L1 regularization is in green. The blue is a
squared quadratic function, the orange is a quadratic
function that sort of moved a little bit toward zero and the
green one also moved towards zero but there's a kink here.
Basically, this is just sort of the kink at the absolute
value of zero and that makes it much more likely that the
optimum will actually be at zero. So here the optimum for
orange is somewhere here, so it was pushed towards zero but
not exactly zero.

---
class: center
# Understanding L1 and L2 Penalties

![:scale 70%](images/l1l2ball.png)
???
A different way to visualize this is in two dimensions.
Let's say we have 2 coefficients. So one way to look at
these norms, let’s say we want to minimize the norm but say
we want to fix the norm, like more or less equivalent, these
are the L1 and L2 balls. So this is all the 2D vectors that
have Euclidean norm 1 or the ball and all the vectors that
have L1 norm is the diamond. Let's say we restrict ourselves
to a solution that lies on this ball.

---
class: center
# Understanding L1 and L2 Penalties

![:scale 70%](images/l1l2ball_intersect.png)
???
And then we have a quadratic function like this, this is
sort of the data fitting term again and so now if we want a
solution with L2 norm equal to zero, we get the solution
here, the intersection of the height lines of the data
fitting term and constraint to L2 norm. But for this
diamond, we're going to hit very likely one of the corners
just because of the geometry of the space, we're likely to
either of these corners. The corner means one of the
coefficients is exactly zero, and the other one is one. So
this is another sort of geometric realization of trying to
understand why does this lead to exact zeros hop.

---
# Grid-Search for Lasso
```python
param_grid = {'alpha': np.logspace(-3, 0, 13)}
print(param_grid)
```
```
{'alpha': array([ 0.001,  0.003, 0.01, 0.032, 0.1, 0.316, 1., 3.162,
                    10., 31.623, 100., 316.228, 1000.])}
```
```python
grid = GridSearchCV(Lasso(normalize=True), param_grid, cv=10)
grid.fit(X_train, y_train)

print(grid.best_params_)
print(grid.best_score_)
```
```
{'alpha': 0.001}
0.837
```

???
Now we can do Grid Search again, the default parameters
usually don't work very well for Lasso. I use alpha on the
logarithmic grid. I fitted and then I get the best score.
---

![:scale 90%](images/lasso_alpha_search.png)

???
Looking at the training test set performance, you can see
that if you increase the regularization, this model gets
really bad. The ridge regression didn't go this badly. If
you set the realization to one, all coefficients become
zero. Other than that there's reasonable performance, which
is about as good as the ridge performance. 
---
.center[
![:scale 60%](images/lasso_coefficients.png)
]

```python
print(X_poly.shape)
np.sum(lasso.coef_ != 0)
```
```
(506, 104)
64
```

???
These are the coefficients of the model. Out of the 104
features, it only selected 64 that are non-zero, the other
ones are exactly zero. You can see this visualized here. The
white ones are exactly zero and the other ones on non-zero.
If I wanted I could prune the future space a lot and that
makes the model possibly more interpretable. There's a
slight caveat here if two of the features that are very
correlated, Lasso will pick one of them at random and make
the other one zero. Just because something's zero doesn't
mean it's not important. It means you can drop it out of
this model. If you have two features that are identical, one
of them will be zero and one of them will be not zero and
it's going to be randomly selected. That makes
interpretation a little bit harder.

---
.smaller[
```python
from sklearn.linear_model import lars_path
# lars_path computes the exact regularization path which is piecewise linear.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
alphas, active, coefs = lars_path(X_train, y_train, eps=0.00001, method="lasso")

plt.plot(alphas, coefs.T, alpha=.5)
plt.xscale("log")
```
]
.center[
![:scale 55%](images/lars_path.png)
]
???
For this model, you can look at the pass for all the
different coefficients. Here it's the original data set and
so we have a little bit fewer curves. There’s actually a way
to compute this very effectively using lars_ path method.

The Lasso model is pretty cool because it allows you to
select features. It really only improves performance when
you think the model should be sparse. By sparse I mean many
of the coefficients should be zero. Only then it really
helps. Think of it as prior knowledge that you assume most
of the coefficient are zero then this model is good. You can
actually formalize both of these models as a Bayesian prior
if you want.

---
# Elastic Net

- Combines benefits of Ridge and Lasso

- two parameters to tune.

`$$\min_{w \in \mathbb{R}^p} \sum_{i=1}^n ||w^T\mathbf{x}_i - y_i||^2 + \alpha_1 ||w||_1 +  \alpha_2 ||w||^2_2 $$`

???
You can also combine them. This actually what works best in
practice. This is what's called the Elastic Net. Elastic Net
tries to combine both of these penalizations together. You
now have both terms, you have the L1 norm and the L2 norm
and you have different values of alpha. Basically, this
generalizes both. If you choose both these are alpha, it can
become ridge and it can become Lasso, it can become any
anything in between. Generally, ridge helps generalization.
So it's a good idea to have the ridge penalty in there, but
also maybe if there are some features that are really not
useful, the L1 penalty helps makes the same exactly zero.

---
class: center
# Comparing unit balls

![:scale 70%](images/l1l2_elasticnet.png)

???
Looking at the unit balls, if you zoom in here you would see
the Elastic Net ball is kind of round, but also has a
corner. Important things are the corners because that allows
you to hit the exact zeros.
---
# Parametrization in scikit-learn
`$$\min_{w \in \mathbb{R}^p} \sum_{i=1}^n ||w^T\mathbf{x}_i - y_i||^2 + \alpha \eta ||w||_1 +  \alpha (1 - \eta) ||w||^2_2 $$`

Where $\eta$ is the relative amount of l1 penalty (`l1_ratio` in the code).
???
The way this parameterize in scikit-learn is slightly
different. In scikit-learn, you have a parameter alpha,
which is the amount of regularization and then there's a
parameter called l1_ratio, that says how much of the penalty
should be L1 and L2. If you make this one, you have Lasso,
if you make it zero, you have a Ridge. Don't use Lasso or
Ridge and set alpha zero, because the solver will not handle
it well. If you actually want alpha equal to zero, use
linear regression. Now we have more parameters to tune, but
we just have a more general model. This actually works
pretty well often.

---
# Grid-searching ElasticNet

```python
from sklearn.linear_model import ElasticNet
param_grid = {'alpha': np.logspace(-4, -1, 10),
              'l1_ratio': [0.01, .1, .5, .9, .98, 1]}

grid = GridSearchCV(ElasticNet(), param_grid, cv=10)
grid.fit(X_train, y_train)

print(grid.best_params_)
print(grid.best_score_)
```
```
{'alpha': 0.0001, 'l1_ratio': 0.01}
0.718
```
???
Here is me doing a grid search. If you have two parameters
for the grid search it will do all possible combinations.
Here I do a logarithmic space for alpha and for the
l1_ratio, I use something that’s very close to zero and
something that's very close to one and some stuff in
between. If you want to analyze the output of a 2D grid
search a little bit harder we can’t do the nice curve
anymore.

---
class: smaller
# Analyzing grid-search results
```python
import pandas as pd
res = pd.pivot_table(pd.DataFrame(grid.cv_results_),
    values='mean_test_score', index='param_alpha', columns='param_l1_ratio')
```
.center[
![:scale 60%](images/elasticnet_search.png)
]
???
The way that I like to do it is, here's the grip.cv results.
And I put it in a data frame and then I'll make a pivot
table where the values are test score, the index is one
parameter and the columns are the other parameter. It allows
me to visualize the grid search nicely. This is alpha and
this is l1_ratio and you can see that if the l1_ratio is
pretty high, there are some pretty good results, if you set
the alpha accordingly. So here's like the diagonal of pretty
good things. This is the model that did best. There's a
slight caveat here that right now I did this with
cross-validation and so this is the cross-validation
accuracy. Last time I said, this is not really a good
measure of generalization performance. So here, I searched
way more parameters, I tried like 5 or 10 times as many
models. So it's likely that by chance, I'll get better
results. I didn't do this here in particular, because the
data set is small and very noisy but in practice, if you
want to compare models, you should evaluate it on a test set
and see which of the models actually are better on the test.
One more thing, why this is helpful is if the best value is
on the edge of this graph that means my ranges were too
small. Question is why we're using r square instead of the
squared loss, one of the answers is that's the default in
scikit-learn and the other answer is it's nice to know the
range so you know that perfect prediction is one and you
have some idea of what 0.5 means, the RMSE (the other norm
that you usually use is the RMSE) depends on the scale of
the output. So for example for the housing prices, it might
be interesting to see what is the standard error in terms of
dollars. If you want, like something that is in the units of
the output, RMSE is good or mean absolute error might even
be better. If you want something that is independent of the
units of the output r square is pretty good because you know
it's going to be between zero and one and it's measure
something like the correlation and so if it's like 0.9, you
know it’s a pretty good model. If my RMSE is 10,000 I don't
know if have a good model or a bad model depends on what the
range of the outputs is. The last thing I want to talk about
today is this was basically changing the regularization
parts. The two most times regularization we looked at is
Ridge which is L2 penalty, Lasso which is an L1 penalty and
combining two of them which is Elastic Net. So now I want to
talk about changing the first part, which was the squared
loss of the predictions, basically.

---

# Robust Regression
![:scale 85%](images/robust_regression.png)
???

Robust regression, which is a regression that’s robust to
outliers. So let's say we have a single feature on the
x-axis here, and response on the y-axis. This is like a
one-dimensional regression problem, it's pretty linear. You
think it's pretty easy, but you have nearly all the data
here, and then some of the data's here.

---
# Least Squares Fit to Outlier Data

![:scale 80%](images/outliers_least_squares.png)

???
If you fit a linear model to this with least squares, this
is what you're going to get. Obviously, depending on how
many data points are here, and how many data points are
here. But there's a lot of data points here and there's only
a few here but soon the linear model gets tilted away
because of the way the square loss works, large errors get
penalized a lot because of the ball shape.
---

# Robust Fit

![:scale 80%](images/outliers_robust_fit.png)

???
So ideally, what we would like from a model is we want it to
fit like the orange or decreeing line. So here are the
models that we're going to talk about Huber regression and
RANSAC regression and the blue is linear regression.
Ideally, we also about want the model to tell us which are
the inliers, which are the outliers. These techniques are
helpful if you think there are inliers and outliers and
sometimes you can visually see if they are there or are not.
In dimensional spaces, it might be hard to visualize that.
But if you have a feeling that there might be outliers,
these are techniques you might want to try. Now I want to
talk about how exactly that works.

---
class: smaller
# Huber loss
.padding-top[
.left-column[
![:scale 100%](images/huber_loss.png)
]
.right-column[
`$$min_{w, \sigma} \sum_{i=1}^n \left (\sigma + H\left ( \frac{X_iw - y_i}{\sigma}\right) \sigma \right ) + \alpha ||w||_2^2$$`

`$$H(z) = \begin{cases} z^2, &if |z| < \epsilon, \\ 2\epsilon|z| & else\end{cases}$$`
]
]

???
Let's first talk about the Huber regressor. Huber regressor
changes the square loss that we had here with Huber loss.
There's a sigma, which is learned from the data, and there's
an epsilon which is a specifier. The sigma learns some
scaling of the data. The main thing is its quadratic if the
error is smaller than epsilon, and it's the absolute value
if it's larger than epsilon. So the mean here around zero
it's quadratic and here on the outside, it's linear. That
means very large errors are not penalized much more. So
large outliers will not move away from along the line a lot.
So the reason why we can't just use absolute error here is
that we want a differentiable problem. So here everything is
like differentiable and so it's nice and easy to optimize.
And we get a stable optimum. The epsilon defines how close
you need to be to be an inlier and if you're out of that
you're an outlier. You still get penalized to make mistakes
there, but you get penalized not nearly as much as you would
for linear regression. This leads to this pretty big
difference. You can see the Huber regression is still
slightly tilted from the line. So here the orange lines are
probably what we would actually want and the green line is
what the Huber regression gives us. It's slightly tilted but
still its way better than what linear regression would give
us. Question: Does Huber have a closed-form solution like
OLS or Ridge? All of these problems are pretty easy to
solve. Linear regression indeed has a closed-form solution,
which means need to invert the matrix, so it’s a really bad
idea to try to invert any matrix to solve this. So you
probably have some other like QR solver, that's implement in
Fortran doing the stuff behind the scenes and in the end,
it's going to be an iterative process either way. Similarly,
for Ridge, you can do an SVD. And sometimes SPD is a good
idea to solve it, sometimes it's not. Next time we will talk
about logistic regression, which has nothing like a
closed-form solution but still solving it to make a
difference. So don't worry about things having close from
solutions. It's more about how hard is the optimization
problem. These are all connected optimization problems.
They're mostly least squares, they are really simple. So the
answer is probably no, but don't worry about it. Here you
also need to estimate the sigma.

---

# RANSAC

![:scale 80%](images/ransac_algorithm.png)

???
The way RANSAC works are it subsamples of data, usually just
enough points to fit whatever model you want. Here we want
to use it with the linear model and we're actually going use
linear regression because we're going to have very few
features and we just going to sample enough data points to
fit the linear model. Here in 2D, we need like three data
points to fit the model. So we continuously sample three
data points and fit linear regression, I drew up three of
them here. So now we have multiple candid solutions and now
we can evaluate how good these solutions are. The way we do
this is we look at how well they predict the data and how
well they predict particular fractions of the data.

---
class: middle
![:scale 100%](images/ransac_algorithm2.png)

???
Here we can see the points that are determined inliers based
on the fit. Clearly the blue one has the most inliers, so it
is slected as solution.
---

???
Here according to the default parameters that are set in
RANSAC, these are all the inliers according to the blue
model, orange model, and the green model. Here the blue
model has way more inliers, it means it explains the data
much better than the other models. You can also iterate this
again and try to fit a model with this but the main idea is
basically just subsampling the data, fit the model and try
to find one that fits the data well enough. So here this
covers a large fraction of the data with inliers with this
model. Then you can also decide which points are outliers
and which are not. You can do this for Huber as well. So all
I have for today. Any questions? To repeat the question to
see how well the model fits I count many data points are
within a particular threshold of the prediction.

---

class: middle

# Questions ?