class: center, middle ### W4995 Applied Machine Learning # Linear models for Regression 02/11/19 Andreas C. Müller ??? So today we'll talk about linear models for regression. You're probably familiar with some of the basics, and that's where we'll start. But we'll also look into how to tune these models, what the trade-offs are, and some more complex techniques. But first we'll start with some more preprocessing that we didn't get to last week, in particular dealing with missing values, which you'll need for the homework. FIXME polynomial features are used in linear model part. Move explanation there? Or hold off on poly features at that place and use different data!!!! FIXME remove 3d plots! FIXME add missing value indicator FIXME better explanations for imputation, in particular iterative! FIXME L1 explanation unclear FIXME alpha=14 vs alpha=31 FIXME adding features: use pipeline FIXME should we do empirical risk minimization in this lecture at underfitting vs overfitting? risk minimization: two kinds of error (estimation and generalization) --- class: center, middle # Sidebar: Dealing with Missing Values ??? This is for the homework. I didn't realize there were missing values in the continuous data. --- class: spacious # Dealing with missing values - Missing values can be encoded in many ways - Numpy has no standard format for it (often np.NaN) - pandas does - Sometimes: 999, ???, ?, np.inf, “N/A”, “Unknown“ … - Not discussing “missing output” - that’s semi-supervised learning. - Often missingness is informative (Use `MissingIndicator`) ??? So first we're going to talk about imputation, which means basically dealing with missing values. Before that, we're going to talk about different methods to deal with missing values. The first thing about missing values is that you need to figure out whether your dataset has them and how they encode it. If you're on Python, numpy has no standard way to represent missing values while Pandas does and it's usually nan. But the problem is usually more in the data source. So depending on where your data comes from, missing values might be encoded as anything. They might be differently encoded for different parts of the data set. So if you see some question marks somewhere, it doesn't mean that all missing values are encoded as question marks. There might be different reasons why data is missing. If you look into like theoretical analysis missingness, often there you can see something that's called missing at random or missing completely at random, meaning that data was retracted randomly from the dataset. That's not usually what happens. Usually, if the data is missing, it's missing because something when differently in a process, like someone didn't fill out the form correctly, or someone didn't reply in a survey. And very often the fact that someone didn't reply or that something was not measured is actually informative. Whenever you have missing values, it's often a good idea to keep the information about whether the value was missing or not. If a feature was missing, while we're going to do imputation and we're going to try to fill in the missing values. It's often really useful information that's something was missing and you should record the fact and represent it in the dataset somehow. We're only going to talk about missing input today. You can also have missing values in the outputs that are usually called semi-supervised learning where you have the true target or the true class only for some data points, but not for all of them. So there's the first method which is very obvious. Let's say your data looks like this. All my illustrations today we'll be adding randomness to the iris data set. FIXME: Better digram for feature selection. What are the hypothesises for the tests --- .center[ ![:scale 80%](images/row_nan_col_nan.png) ] ??? So if my dataset looks like something on the left-hand side here, then you can see that there are only missing values in the first feature and it's mostly missing. One of the ways to deal with this is just completely dropped the first feature, and that might be a reasonable thing to do. If there are so few values here that you don't think there's any information in this just drop it I always like to compare everything to a baseline approach. So your baseline approach should be if there's only missing values in some of the columns, just drop these columns, see what happens. You can always iterate and improve on that but that should be the baseline. A little bit trickier situation is that there might be some missing values only for a few rows, the rows are data points. You can kick out these data points and train the model on the rest and that might be fine. There's a bit of a problem with this though, that if you want to make predictions on new data and the data that arrives has missing values you'll not be able to make predictions because you don't have a way to deal with missing values. If you're guaranteed that any new test point that comes in will not have missing values then doing this might make sense. Another issue with dropping the rows with missing values is that if this was related to the actual outcome then it might be that you biased how well you think you're doing. Maybe all the hard data points are the ones that have missing values. And so by excluding them from your training data, you're also excluding them from the validation. So it means you think you're doing very well because you discarded all the hard data points. Discarding data points is a little bit trickier and it depends on your situation. --- .center[ ![:scale 100%](images/imputation-schema.png) ] ??? The other solution is to impute them. So the idea is that you have your training data set, you build some model of the training data set, and then you fill in the missing values using the information from the other rows and columns. And you built a model for this and then you can also apply the same imputation model on the test data set if you want to make predictions. Question is what if it has all missing values? Then you have no choice but drop that. If in the dataset that happens, you need to figure out what you are going to do. Like, if you have a production system, and something comes in with all missing values, you need to decide what you're going to do. But you probably cannot use this data point for training model. You could like use the outputs and train to find the mean outcome of all the values that are missing. --- class: spacious # Imputation Methods - Mean / Median - kNN - Regression models - Matrix factorization (not in this lecture) ??? So let's talk about the general methods for data imputation. So these are the ones that we are going to talk through. The easiest one is me doing a constant value per column. Imputing the mean or the medium of the column that we're trying to compute. kNN means taking the average of KNearest Neighbors. Regression means I build a regression model from some of the features trying to rip the missing future. And finally, elaborate probabilistic models. They try to build a probabilistic model of the dataset and complete the missing values based on this probabilistic model --- # Baseline: Dropping Columns .smaller[ ```python from sklearn.linear_model import LogisticRegressionCV from sklearn.model_selection import train_test_split, cross_val_score X_train, X_test, y_train, y_test = train_test_split(X_, y, stratify=y) nan_columns = np.any(np.isnan(X_train), axis=0) X_drop_columns = X_train[:, ~nan_columns] scores = cross_val_score(LogisticRegressionCV(v=5), X_drop_columns, y_train, cv=10) np.mean(scores) ``` 0.772 ] ??? Here’s my baseline, which is dropping the columns. I used iris dataset where I had to put some missing values into the second and third column. Here, x_score has like some missing values, I split into test and training test data set and then I drop all columns that have missing values and then I can train logistic regression, which is sort of the model I'm going to use to tell me how well does this imputation work for classification. For my baseline, using a logistic regression model and 10 fold baseline, I get 77.2% accuracy. --- # Mean and Median .center[ ![:scale 100%](images/imputation-median-schema.png) ] ??? The simplest imputation is mean or medium. Unfortunately, only one is implemented in scikit-learn right now. If this is our dataset, the imputation will look like this. For a column, each missing value is replaced by the mean of the column. The imputer is the only transformer in scikit-learn that can handle missing data. Using the imputer you can specify the strategy, mean or median or constant and then you can call the method transform and that imputes missing values. --- .center[ ![:scale 90%](images/median_imputation.png) ] ??? Here is a graphical illustration of what does this. So the iris dataset is four-dimensional, and I'm plotting it in two dimensions in which I added missing values. So, there are two other dimensions which had no missing values, which you can't see. The original data set is basically blue points here, orange points here, green points here, and it's relatively easy to separate. But here you can see that the green points they have a lot of missingness, so they were replaced by the mean of the whole dataset. So there are two things about this. One, I kind of lost the class information a lot. Two, the data is now in places where there was no data before, which is not so great. You could do smart things like doing the mean per class, but that's actually not something that I've seen a lot. --- ```python from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScalar nan_columns = np.any(np.isnan(X_train), axis = 0) X_drop_columns = X_train[:,~nan_columns] logreg = make_pipeline(StandardScalar(), LogisticRegression()) scores = cross_val_score(logreg, X_drop_columns, y_train, cv=10) np.mean(scores) ``` 0.794 ```python mean_pipe = make_pipeline(SimpleImputer(strategy='median'), StandardScalar(), LogisticRegression()) scores = cross_val_score(mean_pipe, X_train, y_train, cv=10) np.mean(scores) ``` 0.729 ??? Here's the comparison of dropping the columns versus doing the mean amputation. We actually see that the mean imputation is worse. In general, if you have very few missing values, it might actually not be so bad and putting in the mean might work. Here I basically designed the dataset so that mean imputation fails. This is not very realistic. --- class: spacious # KNN Imputation - Find k nearest neighbors that have non-missing values. - Fill in all missing values using the average of the neighbors. - PR in scikit-learn: https://github.com/scikit-learn/scikit-learn/pull/9212 ??? FIXME illustration In terms of complexity is using Nearest Neighbors. So the idea is for each data point, we find the kNearest neighbors and then fill in the missing values as the average of the features of these neighbors. The difficulty here is measuring distances if there are missing values. If there are missing values, I can't just compute Euclidean distances. --- .smaller[ ```python # not merged yet! knn_pipe = make_pipeline(KNNImputer(), StandardScaler(), LogisticRegression()) np.mean(scores) ``` ``` 0.849 ``` ] .center[ ![:scale 60%](images/knn_imputation.png) ] ??? --- # Model-Driven Imputation - Train regression model for missing values - Iterate: retrain after filling in - IterativeImputer in next sklearn release https://github.com/scikit-learn/scikit-learn/issues/11259 .smaller[ ```python rf_imp = IterativeImputer(predictor=RandomForestRegressor(n_estimators=100)) rf_pipe = make_pipeline(rf_imp, StandardScaler(), LogisticRegression()) scores = cross_val_score(rf_pipe, X_rf_imp, y_train, cv=10) np.mean(scores) ``` ``` 0.845 ``` ] ??? The next step up in complexity is using an arbitrary regression model for imputation. I mean, arguably, the kNN is also a regression model, but there's like some intricacies that make it a little bit different. So the idea with using a regression model is you do the first pass and impute data using the mean. And then you try to predict the missing features using a regression model trained on the non-missing features and then you iterate this until stuff doesn't change anymore. You can use any model you like, and this is very flexible, and it can be fast if you have a fast model. --- class: spacious # Comparision of Imputation Methods .center[ ![:scale 100%](images/med_knn_rf_comparison.png) ] ??? FIXME better illustration/ graph Here's a comparison on the iris dataset again. This is a very artificial example and is only meant to illustrate the ideas. In practice, using the mean or median is actually quite decent. --- class: left, middle #Feature Engineering ??? I want to talk a little bit about feature engineering and in particular about interaction features. For linear models, interaction features are actually quite interesting. --- class: center #Interaction Features ![:scale 70%](images/img_33.png) ??? So let's look at this 2 dimensional dataset, 2 classes. --- class: center #Interaction Features ![:scale 50%](images/img_34.png) ![:scale 70%](images/img_35.png) ??? If I fit the model, do logistic regression, I get something like this, which does very badly. But if I do feature engineering, it means I add new features to dataset which will make my linear model more powerful. --- class: center, middle ![:scale 50%](images/img_36.png) ![:scale 70%](images/img_37.png) ??? So one idea would be to just add the product of the 2 existing features. So add a new feature that's like x zero times x one, and now I'm in a 3-dimensional space. In the 3 dimensional space, I can literally separate the data. --- .center[ ![:scale 60%](images/img_38.png) ] .smaller[ ```python X_i_train, X_i_test, y_train, y_test = train_test_split( X_interaction, y, random_state=0) logreg3 = LogisticRegressionCV().fit(X_i_train, y_train) logreg3.score(X_i_test, y_test) ``` ] ``` 0.960 ``` ??? If it’s projected down to the original space, it will be this black curve which is the logistic regression. So it's a linear classifier only in the expanded 3-dimensional feature space where it added the interaction. Interactions are one way to who blow up the feature space, adding more features in particular for linear models makes the model much more powerful. So you can do this for continuous features. --- .center[![:scale 80%](images/img_40.png)] - One model per gender! - Keep original: common model + model for each gender to adjust. - Product of multiple categoricals: common model + multiple models to adjust for combinations ??? For discrete features, it does something slightly different. So assume I have a database of users with an age, how many objects they bought, gender, how much money they spend on my website, and how much time they spend on my website. So I can Hot Encode the gender and they get like gender variable at 01. So now if I do interaction features of all the different features with the gender feature, it looks like this. Now I have age times male, article bought times male, males spend time, spend dollar male and so on and likewise for female. This is just the interaction features but the gender features are either zero or one. And so what this means is, these features are only active for men, these features only active for women. So learning a model on this feature space is basically the same as learning two separate models, one for men and one for women. What you might want to do is use the original features, and then use the feature interaction with men and the feature interaction with women. So you have a shared model, and then you have how men and women deviate from this model. So now you basically if you do that you have three times as many degrees of freedom in your linear model. --- # More interactions .padding-top[ ``` age articles_bought gender spend$ time_online + Male * (age articles_bought spend$ time_online ) + Female * (age articles_bought spend$ time_online ) + (age > 20) * (age articles_bought gender spend$ time_online) + (age <= 20) * (age articles_bought gender spend$ time_online) + (age <= 20) * Male * (age articles_bought gender spend$ time_online) ``` ] ??? You can obviously do this with more discrete variables, let's say you have the original features, then you have a model that has the features only for men, a model that has the features only for women, then you can have a model only for people above 20, a model for people only below 20, a model for people below 20 that are male, and so on. So by adding these interaction features to basically create a bigger, bigger features space, which allows you to have linear models that adjust for a subpopulation of the overall dataset. So now I have instead of having six features, I now have 36 features and can have a much more powerful model. --- class: center # Polynomial Features ![:scale 60%](images/img_41.png) ??? N/A --- class: center # Polynomial Features ![:scale 60%](images/img_42.png) ??? You can also add polynomials. So the thing called polynomial features in scikit-learn does interactions between all possible features and it adds polynomials. If you do interactions only equal to true, it will only do interactions and then the order is how many features it doesn't interactions with. Order of three means all products of three features. By default interactions is false and it’ll add also the powers of single feature, which only makes sense for continuous features. So for each x it will add x squared, x cube and so on. All right, that's all for today. --- class: spacious # Polynomial Features .small-padding-top[ - PolynomialFeatures() adds polynomials and interactions. - Transformer interface like scalers etc. - Create polynomial algorithms with make_pipeline! ] ??? N/A --- class: smaller # Polynomial Features ```python from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures() X_bc_poly = poly.fit_transform(X_bc_scaled) print(X_bc_scaled.shape) print(X_bc_poly.shape) ``` ``` (379, 13) (379, 105) ``` ```python scores = cross_val_score(RidgeCV(), X_bc_scaled, y_train, cv=10) np.mean(scores), np.std(scores) ``` ``` (0.759, 0.081) ``` ```python scores = cross_val_score(RidgeCV(), X_bc_poly, y_train, cv=10) np.mean(scores), np.std(scores) ``` ``` (0.865, 0.080) ``` ??? N/A --- class: center, middle # Linear Models --- class:center # Linear Models for Regression ![:scale 50%](images/linear_regression_1d.png) $$\hat{y} = w^T \mathbf{x} + b = \sum_{i=1}^p w_i x_i +b$$ ??? Predictions in all linear models for regression are of the form shown here: It's an inner product of the features with some coefficient or weight vector w, and some bias or intercept b. In other words, the output is a weighted sum of the inputs, possibly with a shift. here i runs over the features and x_i is one feature of the data point x. These models are called linear models because they are linear in the parameters w. The way I wrote it down here they are also linear in the features x_i. However, you can replace the features by any non-linear function of the inputs, and it'll still be a linear model. There are many differnt linear models for regression, and they all share this formula for making predictions. The difference between them is in how they find w and b based on the training data. --- # Ordinary Least Squares $$\hat{y} = w^T \mathbf{x} + b = \sum_{i=1}^p w_i x_i +b $$ `$$\min_{w \in \mathbb{R}^p, b\in\mathbb{R}} \sum_{i=1}^n ||w^T\mathbf{x}_i + b - y_i||^2$$` Unique solution if $\mathbf{X} = (\mathbf{x}_1, ... \mathbf{x}_n)^T$ has full column rank. ??? The most straight-forward solution that goes back to Gauss is ordinary least squares. In ordinary least squares, find w and b such that the predictions on the training set are as accurate as possible according the the squared error. That intuitively makes sense: we want the predictions to be good on the training set. If there is more samples than features (and the samples span the whole feature space), then there is a unique solution. The problem is what's called a least squares problem, which is particularly easy to optimize and get the unique solution to. However, if there are more features than samples, there are usually many perfect solutions that lead to 0 error on the training set. Then it's not clear which solution to pick. Even if there are more samples than features, if there are strong correlations among features the results might be unstable, and we'll see some examples of that soon. Before we look at examples, I want to introduce a popular alternative. --- # Ridge Regression `$$ \min_{w \in \mathbb{R}^p, b\in\mathbb{R}} \sum_{i=1}^n (w^T\mathbf{x}_i + b - y_i)^2 + \alpha ||w||^2 $$` Always has a unique solution. Tuning parameter alpha. ??? In Ridge regression we add another term to the optimization problem. Not only do we want to fit the training data well, we also want w to have a small squared l2 norm or squared euclidean norm. The idea here is that we're decreasing the "slope" along each of the feature by pushing the coefficients towards zero. This constraings the model to be more simple. So there are two terms in this optimization problem, which is also called the objective function of the model: the data fitting term here that wants to be close to the training data according to the squared norm, and the prenalty or regularization term here that wants w to have small norm, and that doesn't depend on the data. Usually these two goals are somewhat opposing. If we made w zero, the second term would be zero, but the predictions would be bad. So we need to trade off between these two. The trade off is problem specific and is specified by the user. If we set alpha to zero, we get linear regression, if we set alpha to infinity we get a constant model. Obviously usually we want something in between. This is a very typical example of a general principle in machine learning, called regularized empirical risk minimization. --- # (regularized) Empirical Risk Minimization `$$ min_{f \in F} \sum_{i=1}^n L(f(\mathbf{x}_i), y_i) + \alpha R(f) $$` ??? FIXME pointers data fitting / regularization! Many models in machine learning, like linear models, SVMs and neural networks follow the general framework of empirical risk minimization, which you can see here. We formulate the machine learning problem as an optimization problem over a family of functions. In our case that was the family of linear functions parametrized by w and b. The minimization problem consists of two parts, the data fitting part and the model complexity part. The data fitting part says that the predictions mad eby our functions should be accurate according to some loss L. For our regression problems that was the squared loss. The model complexity part says that we prefer simple models and penalizes complicated f. Most machine learning algorithms can be cast into this, with a particular choice of family of functions f, loss function L and regularizer R. And most of machine learning theory is build around this framework. People proof for differnt choices of F and L and R that if you minimize this, you'll be able to generalize well. And that makes intuitive sense. To do well on the test set, we definitely want to do reasonably well on the training set. We don't expect that we can do better on a test set than the training set. But we also want to minimize the performance difference between training and test set. If we restrict our model to be simple via the regularizer R, we have better chances of the model generalizing. --- # Reminder on model complexity ![:scale 80%](images/overfitting_underfitting_cartoon_full.png) ??? I hope this sounds familiar from what we talked about last time. This is a particular way of dealing with overfitting and underfitting. For this framework in general, or for ridge regression in particular, trading off the data fitting and the regularization changes the model complexity. If we set alpha high we restrict the model, and we will be on the left side of the graph. If we make alpha small, we allow the model to fit the data more, and we're on the right side of the graph. --- # Boston Housing Dataset ![:scale 70%](images/boston_housing_scatter.png) .smaller[ ```python print(X.shape) print(y.shape) ``` ``` (506, 13) (506,) ``` ] ??? Ok after all this pretty abstract talk, let's make this concrete. Let's do some regression on the boston housing dataset. After the last homework you're hopefully familiar with it. The idea is to predict prices of property in the boston area in different neighborhoods. This is a dataset from the 70s I think, so everything is pretty cheap. Most of the features you can see are continuous, with the exception of the charlston river variable which says whether the neighborhood is on the river. Keep in mind that this data lives in a 13 dimensional space and these univariate plots only look at 13 different projections of the data, and can't capture any of the interactions. But still we can see that the price clearly depends on some of these variables. It's also pretty clear that the dependency is non-linear for some of the variables. We'll still start with a linear model, because its a very simple class of models, and I'd always star approaching any model from the simplest baseline. In this case it's linear regression. We're having 506 samples and 13 features. We have much more samples than features. Linear regression should work just fine. Also it's a tiny dataset, so basically anything we'll try will run instantaneously, which is also good to keep in mind. Another thing that you can see in this graph is that the features have very different scales. Here's a box plot that shows that even more clearly. --- ![:scale 80%](images/boston_scaling.png) ??? That's something that will trip up the distance based models models we talked about last time, as well as the linear models we're talking about today. For the penalized models the different scales mean that different features are penalized differently, which you usually want to avoid. Usually there is no particular semantics attached to the fact that one feature has different magnitutes than another. We could measure something in inches instead of miles, and that would change the outcome of the model. That's certainly not something we want. A good idea is to scale the data to get rid of this effect. We'll talk about that and other preprocessing methods in-depth on Wednesday next week. Today I'm mostly gonna ignore this. But let's get started with Linear Regression --- class: middle, spacious ```python from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, random_state=0) np.mean(cross_val_score(LinearRegression(), X_train, y_train, cv=10)) ``` ``` 0.717 ``` ```python np.mean(cross_val_score(Ridge(), X_train, y_train, cv=10)) ``` ``` 0.715 ``` ??? Let’s look at two simple models. Linear regression and Ridge regression. What I've done is I’ve split the data into training and test set and used 10 fold cross-validation to evaluate them. Here I use cross_val_score together with the model, the training data, training labels, and 10 fold cross-validation. This will return 10 scores and I'm going to compute the mean of them. I'm doing this for both linear regression and Ridge regression. Here is ridge regression uses a default value of alpha of 1. Here these two scores are quite similar. --- # Coefficient of determination R^2 `$$ R^2(y, \hat{y}) = 1 - \frac{\sum_{i=0}^{n - 1} (y_i - \hat{y}_i)^2}{\sum_{i=0}^{n - 1} (y_i - \bar{y})^2} $$` `$$ \bar{y} = \frac{1}{n} \sum_{i=0}^{n - 1} y_i$$` Can be negative for biased estimators - or the test set! ??? The scores are R squared or coefficient of determination. This is basically a score that's usually between zero and one, where one means perfect prediction or perfect correlation and zero means a random prediction. What it does is it computes the mean of the targets over the data you're evaluating it on and then it looks at the distance between the prediction and the ground truth relative to the mean. If it's negative, it means you do a worse job at predicting and just predicting the mean. It can happen if your model was really bad and bias. The other reason is if you use a test set. This is guaranteed to be positive on the data it was fit on with an unbiased linear model, which will nearly never apply to what we’re doing. The R^2 can be misleading if there's outliers in the training data and some consider it a bad metric. Max Kuhn, author of APM thinks it's a bad metric. It's not clear to me that MSE is much better in general, though. Reducing anything to a single number is tricky. --- # Scaling ```python from sklearn.linear_model import Ridge from sklearn.preprocessing import StandardScaler X, y = boston.data, boston.target X_train, X_test, y_train, y_test = train_test_split( X, y, random_state=0) scaler = StandardScaler() scaler.fit(X_train) X_train_scaled = scaler.transform(X_train) ridge = Ridge().fit(X_train_scaled, y_train) X_test_scaled = scaler.transform(X_test) ridge.score(X_test_scaled, y_test) ``` ??? If you want to scale the data, you can use StandardScaler for that. It makes the mean zero and the standard deviation one. It's an unsupervised model, so it only takes the data X_train to fit. Fitting just means computing mean and standard devitation. Then, we can scale the data using the transform method. Make sure that you fit the data only on the training set, and then transform the training and the test set. If you want to use scaling inside cross-validation, it's a little bit more tricky. As I said, more details next wednesday. I'm just mentioning this here because it might come in handy for the homework. Ok but so let's come back to the Ridge model. Above we just used the default setting for the parameter alpha, which is one. This is a reasonable default, but there is no reason why this should give us the optimum generalization performance on this problem. So it's a good idea to adjust the parameter. As we saw on Monday, we can easily do that with gridsearchCV. --- .smallest[ ```python from sklearn.model_selection import GridSearchCV param_grid = {'alpha': np.logspace(-3, 3, 13)} print(param_grid) ``` ``` {'alpha': array([ 0.001, 0.003, 0.01, 0.032, 0.1, 0.316, 1., 3.162, 10., 31.623, 100., 316.228, 1000.])} ``` ```python grid = GridSearchCV(Ridge(), param_grid, cv=10) grid.fit(X_train, y_train) ``` ] .center[ ![:scale 50%](images/ridge_alpha_search.png) ] ??? Coming back to the ridge regression we used the standard alpha of one which is a reasonable default, but by no means, this is guaranteed to make sense in this particular problem. Here I’ve done the grid search. As we talked about on Monday, I defined a parameter grid where the key is the parameter I want to search (alpha in Ridge) and the parameters I want to try. For regularization parameters, like alpha, it’s usually good to do them on the logarithmic grid. I do a relatively fine grid here with 13 different points mostly because I wanted to have a nice plot. In reality, I would use a three or six or something like that. I’ve instantiated GridSearchCV with Ridge(), the parameter grid and do 10 fold cross-validation and then I called grid.fit. I’ve reported the mean training accuracy and mean test accuracy over 10 cross-validation folds for each of the parameter settings. Okay, you can see a couple things here. A) There's a lot of uncertainty B) The training set is always better than the test set. C) The most important thing that you can see here is that regularization didn't help. Making alpha as small as possible is the best. What I'm going to do next is I'm going to modify this dataset a little bit so that we can see the effect of the regularization. I’m going to modifying by using a polynomial expansion, again we're going to talk about a little bit more on Wednesday. --- # Adding features ```python from sklearn.preprocessing import PolynomialFeatures, scale poly = PolynomialFeatures(include_bias=False) X_poly = poly.fit_transform(scale(X)) print(X_poly.shape) X_train, X_test, y_train, y_test = train_test_split(X_poly, y) ``` ``` (506, 104) ``` ```python np.mean(cross_val_score(LinearRegression(), X_train, y_train, cv=10)) ``` ``` 0.74 ``` ```python np.mean(cross_val_score(Ridge(), X_train, y_train, cv=10)) ``` ``` 0.76 ``` ??? Basically, I'm adding a whole bunch of more features, which is the interaction between all pairs of features and to square for each feature. This allows me to learn not an only linear function, but any polynomial of degree 2 of the features, which gives me 104 features. Now there are way more features in relation to the training data size. It's still fewer features than data points, but unlike before now, they're in the same order of magnitude. Now, this is a problem that's much harder for linear regression without regularization. Keep in mind the power of the linear models really depends a lot on the dimensionality of the input space. Previously I had 16 degrees of freedom and now I have 140 degrees of freedom, so I can learn a much more complex model now. You can also think about this in terms of a 2D problem, you draw a linear line, it seems to be very, very restrictive. But if you look at a very high dimensional space, if you have more features than samples, then you can solve any problem perfectly. So it's not restrictive at all. The powerfulness of the linear models depends a lot on the majority of the features. --- .center[ ![:scale 60%](images/ridge_alpha_search_poly.png) ] .smaller[ ```python print(grid.best_params_) print(grid.best_score_) ``` ``` {'alpha': 31.6} 0.83 ``` ] ??? Let's look at what happens if you do a Grid search. You can see a clear fit, hopefully, it looks to you a lot like the cartoon curve that I did before. If we increase alpha, the model gets less complex and the training performance goes down. But if I increase regularization, the test score goes up for a while and if I make it too strong, it goes down. You can also see that the standard deviation is much higher with less regularization than it is with more regularization. This is sort of a more robust model and it generalizes this better. Using regularization, I found the best alpha and the best R square. It actually improved quite a bit. The alpha doesn't really mean anything depending on how you scale your data, don't think too much about that. --- # Plotting coefficient values (LR) ```python lr = LinearRegression().fit(X_train, y_train) plt.scatter(range(X_poly.shape[1]), lr.coef_, c=np.sign(lr.coef_), cmap="bwr_r") ``` .center[ ![:scale 55%](images/lr_coefficients_large.png) ] ??? In the previous slide, linear regression did nearly as well as the Ridge Regression with the default parameters but let's look at the coefficients of a linear model. These are the coefficients of the linear model trained on the polynomial futures. I plot them adding a color to represent positive and negative. The magnitude here is 1 and then 13 zeros. We have two features, one is like, probably more than trillions times two and then there's another one that's very negative. What I conclude from this is, these 2 features are very highly correlated. The model makes both of them really, really big and then they cancel each other out. This is not a very nice model, because it relates to American stability and also it tells me that these 2 features are like extremely important, but they might not be important at all. Like the other features might be more important, but they're nullified by this cancellation effect. They (0.2 and -0.8) need to cancel each other out because the predictions are reasonable and all the houses only cost like $70,000. --- # Ridge Coefficients ```python ridge = grid.best_estimator_ plt.scatter(range(X_poly.shape[1]), ridge.coef_, c=np.sign(ridge.coef_), cmap="bwr_r") ``` .center[ ![:scale 60%](images/ridge_coefficients.png) ] ??? Let's look at the Ridge model. This is the best estimator, which is the model that was found in a grid search with the best parameter settings. This looks much more reasonable. This feature, which was a very negative one, still is very negative. But now this is actually three and minus three. So this is a much more reasonable range. We can also look at the features and the effect of different values of alpha. Here is a Ridge with 3 different values of alpha. In the previous random seat, alpha equal to 14 was the best now and now we have it equal to 30 something. The green one is more or less the best setting and then there's a smaller and a bigger one. You can see that basically what alpha does is, on average, it pushes all the coefficients toward zero. So here you can see this coefficient shrank going from 1 to 14 going to 0 and the same here. So basically, they all push the different features towards 0. If you look at this long enough, you can see things that are interesting, the first one with alpha equal to one it's positive, and with alpha equal to 100, it's negative. That means depending on how much you regularize the direction of effect goes in opposite directions, what that tells me is don't interpret your models too much because clearly, either it has a positive or negative effect, it can't have both. --- ```python ridge100 = Ridge(alpha=100).fit(X_train, y_train) ridge1 = Ridge(alpha=1).fit(X_train, y_train) plt.figure(figsize=(8, 4)) plt.plot(ridge1.coef_, 'o', label="alpha=1") plt.plot(ridge.coef_, 'o', label="alpha=14") plt.plot(ridge100.coef_, 'o', label="alpha=100") plt.legend() ``` .center[ ![:scale 60%](images/ridge_coefficients_alpha.png) ] ??? One other way to visualize the coefficient is to look at the coefficient path or regularization path. On the x-axis is the alpha, and on the y-axis, the coefficient magnitude. Basically, I looped over all of the different alphas and you can see how they shrink towards zero to increase alpha. There are some very big coefficients that go to zero very quickly and some coefficients here that stay the same for a long time. --- # Learning Curves ![:scale 90%](images/ridge_learning_curve.png) ??? The thing I want to illustrate is if you have less data, regularization helps more. You can see the effects here between the orange and the red are very drastic. The green one probably has too much regularization. If you have very little data, even regularizing way too much is okay, but there's sort of a sweet spot here which is the orange one and the orange one does best. The more data you add, the less important regularization becomes. If you take the blue curve at the beginning it’s much higher than the red curve, but then, in the end, they overlap. --- # Lasso Regression `$$ \min_{w \in \mathbb{R}^p, b\in\mathbb{R}} \sum_{i=1}^n ||w^T\mathbf{x}_i + b - y_i||^2 + \alpha ||w||_1 $$` - Shrinks w towards zero like Ridge - Sets some w exactly to zero - automatic feature selection! ??? Lasso Regression looks very similar to Ridge Regression. The only thing that is changed is we use the L1 norm instead of the L2 norm. L2 norm is the sum of squares, the L1 norm is the sum of the absolute values. So again, we are shrinking w towards 0, but we're shrinking it in a different way. The L2 norm penalizes very large coefficients more, the L1 norm penalizes all coefficients equally. What this does in practice is its sets some entries of W to exactly 0. It does automatic feature selection if the coefficient of zero means it doesn't influence the prediction and so you can just drop it out of the model. This model does features selection together with prediction. Ideally what you would want is, let's say you want a model that does features selections. The goal is to make our model automatically select the features that are good. What you would want to penalize the number of features that it uses, that would be L0 norm. --- # Understanding L1 and L2 Penalties .wide-left-column[ ![:scale 100%](images/l2_l1_l0.png) ] .narrow-right-column[ .smaller[ `$$ \ell_2(w) = \sqrt{\sum_i w_i ^ 2}$$` `$$ \ell_1(w) = \sum_i |w_i|$$` `$$ \ell_0(w) = \sum_i 1_{w_i != 0}$$` ]] ??? The L0 norm penalizes the number of features that are not zero. The L1 norm penalizes the absolute values. L2 norm penalizes the squared norm. The problem with L0 norm is it's one everywhere except at zero. And that's not differentiable, it's not even continuous, so this is very hard to optimize. Basically, we just gave up on this and we use the next best thing we can do, which is the L1 norm. The L1 norm is not differentiable, there’s like this kink in zero, but we can still optimize it pretty well. The L2 norm penalizes large values much more than small values. --- # Understanding L1 and L2 Penalties .wide-left-column[ ![:scale 100%](images/l1_kink.png) ] .narrow-right-column[ .padding-top[ .smaller[ `$ f(x) = (2 x - 1)^2 $` `$ f(x) + L2 = (2 x - 1)^2 + \alpha x^2$` `$ f(x) + L1= (2 x - 1)^2 + \alpha |x|$` ]]] ??? Let's say I want to do a one-dimensional linear regression problem. The problem to solve is something of this form. I want my coefficients to be such that, let's say x, the one coefficient I'm looking for and this is sort of the squared loss function that I have. F is the daycare fitting term here, and this is would be like some quadratic. The loss function is in blue, its quadratic. The loss function plus our L2 regularization, the ridge is in orange. The data fitting plus L1 regularization is in green. The blue is a squared quadratic function, the orange is a quadratic function that sort of moved a little bit toward zero and the green one also moved towards zero but there's a kink here. Basically, this is just sort of the kink at the absolute value of zero and that makes it much more likely that the optimum will actually be at zero. So here the optimum for orange is somewhere here, so it was pushed towards zero but not exactly zero. --- class: center # Understanding L1 and L2 Penalties ![:scale 70%](images/l1l2ball.png) ??? A different way to visualize this is in two dimensions. Let's say we have 2 coefficients. So one way to look at these norms, let’s say we want to minimize the norm but say we want to fix the norm, like more or less equivalent, these are the L1 and L2 balls. So this is all the 2D vectors that have Euclidean norm 1 or the ball and all the vectors that have L1 norm is the diamond. Let's say we restrict ourselves to a solution that lies on this ball. --- class: center # Understanding L1 and L2 Penalties ![:scale 70%](images/l1l2ball_intersect.png) ??? And then we have a quadratic function like this, this is sort of the data fitting term again and so now if we want a solution with L2 norm equal to zero, we get the solution here, the intersection of the height lines of the data fitting term and constraint to L2 norm. But for this diamond, we're going to hit very likely one of the corners just because of the geometry of the space, we're likely to either of these corners. The corner means one of the coefficients is exactly zero, and the other one is one. So this is another sort of geometric realization of trying to understand why does this lead to exact zeros hop. --- # Grid-Search for Lasso ```python param_grid = {'alpha': np.logspace(-3, 0, 13)} print(param_grid) ``` ``` {'alpha': array([ 0.001, 0.003, 0.01, 0.032, 0.1, 0.316, 1., 3.162, 10., 31.623, 100., 316.228, 1000.])} ``` ```python grid = GridSearchCV(Lasso(normalize=True), param_grid, cv=10) grid.fit(X_train, y_train) print(grid.best_params_) print(grid.best_score_) ``` ``` {'alpha': 0.001} 0.837 ``` ??? Now we can do Grid Search again, the default parameters usually don't work very well for Lasso. I use alpha on the logarithmic grid. I fitted and then I get the best score. --- ![:scale 90%](images/lasso_alpha_search.png) ??? Looking at the training test set performance, you can see that if you increase the regularization, this model gets really bad. The ridge regression didn't go this badly. If you set the realization to one, all coefficients become zero. Other than that there's reasonable performance, which is about as good as the ridge performance. --- .center[ ![:scale 60%](images/lasso_coefficients.png) ] ```python print(X_poly.shape) np.sum(lasso.coef_ != 0) ``` ``` (506, 104) 64 ``` ??? These are the coefficients of the model. Out of the 104 features, it only selected 64 that are non-zero, the other ones are exactly zero. You can see this visualized here. The white ones are exactly zero and the other ones on non-zero. If I wanted I could prune the future space a lot and that makes the model possibly more interpretable. There's a slight caveat here if two of the features that are very correlated, Lasso will pick one of them at random and make the other one zero. Just because something's zero doesn't mean it's not important. It means you can drop it out of this model. If you have two features that are identical, one of them will be zero and one of them will be not zero and it's going to be randomly selected. That makes interpretation a little bit harder. --- # Elastic Net - Combines benefits of Ridge and Lasso - two parameters to tune. `$$\min_{w \in \mathbb{R}^p, b\in\mathbb{R}} \sum_{i=1}^n ||w^T\mathbf{x}_i + b - y_i||^2 + \alpha_1 ||w||_1 + \alpha_2 ||w||^2_2 $$` ??? You can also combine them. This actually what works best in practice. This is what's called the Elastic Net. Elastic Net tries to combine both of these penalizations together. You now have both terms, you have the L1 norm and the L2 norm and you have different values of alpha. Basically, this generalizes both. If you choose both these are alpha, it can become ridge and it can become Lasso, it can become any anything in between. Generally, ridge helps generalization. So it's a good idea to have the ridge penalty in there, but also maybe if there are some features that are really not useful, the L1 penalty helps makes the same exactly zero. --- class: center # Comparing unit balls ![:scale 70%](images/l1l2_elasticnet.png) ??? Looking at the unit balls, if you zoom in here you would see the Elastic Net ball is kind of round, but also has a corner. Important things are the corners because that allows you to hit the exact zeros. --- # Parametrization in scikit-learn `$$\min_{w \in \mathbb{R}^p, b\in\mathbb{R}} \sum_{i=1}^n ||w^T\mathbf{x}_i + b - y_i||^2 + \alpha \eta ||w||_1 + \alpha (1 - \eta) ||w||^2_2 $$` Where $\eta$ is the relative amount of l1 penalty (`l1_ratio` in the code). ??? The way this parameterize in scikit-learn is slightly different. In scikit-learn, you have a parameter alpha, which is the amount of regularization and then there's a parameter called l1_ratio, that says how much of the penalty should be L1 and L2. If you make this one, you have Lasso, if you make it zero, you have a Ridge. Don't use Lasso or Ridge and set alpha zero, because the solver will not handle it well. If you actually want alpha equal to zero, use linear regression. Now we have more parameters to tune, but we just have a more general model. This actually works pretty well often. --- # Grid-searching ElasticNet ```python from sklearn.linear_model import ElasticNet param_grid = {'alpha': np.logspace(-4, -1, 10), 'l1_ratio': [0.01, .1, .5, .9, .98, 1]} grid = GridSearchCV(ElasticNet(), param_grid, cv=10) grid.fit(X_train, y_train) print(grid.best_params_) print(grid.best_score_) ``` ``` {'alpha': 0.0001, 'l1_ratio': 0.01} 0.718 ``` ??? Here is me doing a grid search. If you have two parameters for the grid search it will do all possible combinations. Here I do a logarithmic space for alpha and for the l1_ratio, I use something that’s very close to zero and something that's very close to one and some stuff in between. If you want to analyze the output of a 2D grid search a little bit harder we can’t do the nice curve anymore. --- class: smaller # Analyzing grid-search results ```python import pandas as pd res = pd.pivot_table(pd.DataFrame(grid.cv_results_), values='mean_test_score', index='param_alpha', columns='param_l1_ratio') ``` .center[ ![:scale 60%](images/elasticnet_search.png) ] ??? The way that I like to do it is, here's the grip.cv results. And I put it in a data frame and then I'll make a pivot table where the values are test score, the index is one parameter and the columns are the other parameter. It allows me to visualize the grid search nicely. This is alpha and this is l1_ratio and you can see that if the l1_ratio is pretty high, there are some pretty good results, if you set the alpha accordingly. So here's like the diagonal of pretty good things. This is the model that did best. There's a slight caveat here that right now I did this with cross-validation and so this is the cross-validation accuracy. Last time I said, this is not really a good measure of generalization performance. So here, I searched way more parameters, I tried like 5 or 10 times as many models. So it's likely that by chance, I'll get better results. I didn't do this here in particular, because the data set is small and very noisy but in practice, if you want to compare models, you should evaluate it on a test set and see which of the models actually are better on the test. One more thing, why this is helpful is if the best value is on the edge of this graph that means my ranges were too small. Question is why we're using r square instead of the squared loss, one of the answers is that's the default in scikit-learn and the other answer is it's nice to know the range so you know that perfect prediction is one and you have some idea of what 0.5 means, the RMSE (the other norm that you usually use is the RMSE) depends on the scale of the output. So for example for the housing prices, it might be interesting to see what is the standard error in terms of dollars. If you want, like something that is in the units of the output, RMSE is good or mean absolute error might even be better. If you want something that is independent of the units of the output r square is pretty good because you know it's going to be between zero and one and it's measure something like the correlation and so if it's like 0.9, you know it’s a pretty good model. If my RMSE is 10,000 I don't know if have a good model or a bad model depends on what the range of the outputs is. The last thing I want to talk about today is this was basically changing the regularization parts. The two most times regularization we looked at is Ridge which is L2 penalty, Lasso which is an L1 penalty and combining two of them which is Elastic Net. So now I want to talk about changing the first part, which was the squared loss of the predictions, basically. --- class: middle # Questions ?