class: center, middle ### W4995 Applied Machine Learning # Preprocessing and Feature Transformations 02/05/20 Andreas C. Müller ??? Today we’ll talk about preprocessing and feature engineering. What we’re talking about today mostly applies to linear models, and not to tree-based models, but it also applies to neural nets and kernel SVMs. FIXME making column encoding consistent in pandas the way kaggle says FIXME target encoder be more clear on how to transform test set FIXME add column transformer full syntax FIXME add grid-search column-transformer example FIXME add Max number of columns --- class: center, middle Coming up with features is difficult, time-consuming, requires expert knowledge. "Applied machine learning" is basically feature engineering. .quote_author[Andrew Ng] --- class: middle ![:scale 90%](images/house_price_scatter.png) ??? --- class: center, middle #Scaling ??? N/A --- class: center, middle .center[ ![:scale 70%](images/house_price_boxplot.png) ] ??? Let’s start with the different scales. Many model want data that is on the same scale. KNearestNeighbors: If the distance in TAX is between 300 and 400 then the distance difference in CHArS doesn’t matter! Linear models: the different scales mean different penalty. L2 is the same for all! We can also see non-gaussian distributions here btw! --- # Scaling and Distances ![:scale 90%](images/knn_scaling.png) ??? Here is an example of the importance of scaling using a distance-based algorithm, K nearest neighbors. My favorite toy dataset with two classes in two dimensions. The scatter plots look identical, but on the left hand side, the two axes have very different scales. The x axis has much larger values than the y axis. On the right hand side, I used standard scaler and so both features have zero mean and unit variance. So what do you think will happen if I use k nearest neighbors here? Let's see --- # Scaling and Distances ![:scale 90%](images/knn_scaling2.png) ??? As you can see, the difference is quite dramatic. Because the X axis has such a larger magnitude on the left-hand side, only distances along the x axis matter. However, the important feature for this task is the y axis. So the important feature gets entirely ignored because of the different scales. And usually the scales don't have any meaning - it could be a matter of changing meters to kilometers. Linear models: the different scales mean different penalty. L2 is the same for all! --- class: center # Ways to Scale Data
![:scale 80%](images/scaler_comparison_scatter.png) ??? Here's an illustration of four of the most common ways. One of the most common ones that we already saw before is the Standard Scaler. StandardScaler subtracts the mean and divides by standard deviation. Making all the features have a zero mean and a standard deviation of one. One thing about this is that it won't guarantee any minimum or maximum values. The range can be arbitrarily large. MinMaxScaler subtracts minimum and divides by range. Scales between 0 and 1. So all the features will have an exact minimum at zero and exact maximum at one. This mostly makes sense if there are actually minimum and maximum values in your data set. If it's actually Gaussian distribution, the scaling might not make a lot of sense, because if you add one more data point, it's very far away and it will cram all the other data points more together. This will make sense if you have a grayscale value between 0 and 255 or something else that has like clearly defined boundaries. Another alternative is the RobustScaler. It’s the robust version of the StandardScaler. It computes the median and the quartiles. This cannot be skewed by outliers. The StandardScaler uses mean and standard deviation. So if you have a point that’s very far away, it can have unlimited influence on the mean. The RobustScaler uses robust statistics, so it’s not skewed be outliers. The final one is the Normalizer. This projects things either on the L1 or L0, meaning it makes sure to vectors have length one either in L1 norm or L2 norm. If you do this for L2 norm, it means you don't care about the length you project on a circle. More commonly used in L1 norm, it projects onto the diamond. What that does is basically it means you make sure the sum of all the entries is one. That's often used if you have histograms or if you have counts of things. If you want to make sure that you have frequency features instead of count features, you can use the L1 normalizer. --- class: spacious # Sparse Data - Data with many zeros – only store non-zero entries. - Subtracting anything will make the data “dense” (no more zeros) and blow the RAM. - Only scale, don’t center (use MaxAbsScaler) ??? FIXME ugly slide There’s another one that's important for sparse data. It’s called the MaxAbsScaler in scikit-learn. So sparse data is the kind of data that happens actually quite frequently in practice where most of the features are zero most of the time. So if you have tens of thousands of features, but they're always nearly zero. For example, you can think about these being particular user actions and at any time, any user makes only very few actions. But there are many, many possibilities. So if you have data like that, you only want to store the non-zero values, you don't want to store all the zeros. If you would store all the zeros, the data wouldn't even fit into the RAM often. So you have this very large sparse data set and you want to scale it, but if you subtract anything from it, if you want to make it zero mean then all the entries will be non-zero because the chance that the mean is actually zero is small. So then if you subtract something from everything, you're going to move all those zeros away from zero and then you can't store the data in sparse format and your RAM will blow up. Basically, you can't subtract anything from sparse matrix because that won't be sparse anymore. But we can still scale it because any number times zero is still zero. So if we scale it, then it will have the same structure again. The MaxAbsScaler sets the maximum absolute values to one. Basically, it looks at the maximum absolute values in the whole dataset, or for each feature and makes sure that this is one, just by scaling by one divided by the maximum absolute value. --- # Standard Scaler Example ```python from sklearn.linear_model import Ridge # Back to King Country house prices X_train, X_test, y_train, y_test = train_test_split( X, y, random_state=0) scaler = StandardScaler() scaler.fit(X_train) X_train_scaled = scaler.transform(X_train) ridge = Ridge().fit(X_train_scaled, y_train) X_test_scaled = scaler.transform(X_test) ridge.score(X_test_scaled, y_test) ``` ``` 0.684 ``` ??? Here’s how you do the scaling with StandardScaler in scikit-learn. Similar interface to models, but“transform” instead of “predict”. “transform” is always used when you want a new representation of the data. Fit on training set, transform training set, fit ridge on scaled data, transform test data, score scaled test data. The fit computes mean and standard deviation on the training set, transform subtracts the mean and the standard deviation. We fit on the training set and apply transform on both the training and the test set. That means the training set mean gets subtracted from the test set, not the test-set mean. That’s quite important. --- class: center, middle .center[ ![:scale 100%](images/no_separate_scaling.png) ] ??? Here’s an illustration why this is important using the min-max scaler. Left is the original data. Center is what happens when we fit on the training set and then transform the training and test set using this transformer. The data looks exactly the same, but the ticks changed. Now the data has a minimum of zero and a maximum of one on the training set. That’s not true for the test set, though. No particular range is ensured for the test-set. It could even be outside of 0 and 1. But the transformation is consistent with the transformation on the training set, so the data looks the same. On the right you see what happens when you use the test-set minimum and maximum for scaling the test set. That’s what would happen if you’d fit again on the test set. Now the test set also has minimum at 0 and maximum at 1, but the data is totally distorted from what it was before. So don’t do that. --- class: center, middle # Sckit-Learn API Summary ![:scale 90%](images/api-table.png) Efficient shortcuts: ```python est.fit_transform(X) == est.fit(X).transform(X) # mostly est.fit_predict(X) == est.fit(X).predict(X) # mostly ``` ??? Here’s a summary of the scikit-learn methods. All models have a fit method which takes the training data X_train. If the model is supervised, such as our classification and regression models, they also take a y_train parameter. The scalers don’t use a y_train because they don’t use the labels at all – you could say they are unsupervised methods, but arguably they are not really learning methods at all. Models (also known as estimators in scikit-learn) to make a prediction of a target variable, you use the predict method, as in classification and regression. If you want to create a new representation of the data, a new kind of X, then you use the transform method, as we did with scaling. The transform method is also used for preprocessing, feature extraction and feature selection, which we’ll see later. All of these change X into some new form. There’s two important shortcuts. To fit an estimator and immediately transform the training data, you can use fit_transform. That’s often more efficient then using first fit and then transform. The same goes for fit_predict. --- class: smaller, compact ```python from sklearn.model_selection import cross_val_score from sklearn.linear_model import RidgeCV scores = cross_val_score(RidgeCV(), X_train, y_train, cv=10) np.mean(scores), np.std(scores) ``` ``` (0.694, 0.027) ``` ```python scores = cross_val_score(RidgeCV(), X_train_scaled, y_train, cv=10) np.mean(scores), np.std(scores) ``` ``` (0.694, 0.027) ``` ```python from sklearn.neighbors import KNeighborsRegressor scores = cross_val_score(KNeighborsRegressor(), X_train, y_train, cv=10) np.mean(scores), np.std(scores) ``` ``` (0.500, 0.039) ``` ```python from sklearn.neighbors import KNeighborsRegressor scores = cross_val_score(KNeighborsRegressor(), X_train_scaled, y_train, cv=10) np.mean(scores), np.std(scores) ``` ``` (0.786, 0.030) ``` ??? Let’s apply the scaler to the King Country housing data. First I used the StandardScaler to scale the training data. Then I applied ten-fold cross-validation to evaluate the Ridge model on the data with and without scaling. I used RidgeCV which automatically picks alpha for me. With and without scaling we get an R^2 of about .72, so no difference. Often there is a difference for Ridge, but not in this case. If we use KneighborsRegressor instead, we see a big difference. Without scaling R^2 is about .5, and with scaling it’s .75. That makes sense since we saw that for distance calculations basically all features are dominated by the TAX feature. However, there is a bit of a problem with the analysis we did here. Can you see it? --- class: center, middle # A note on preprocessing # (and pipelines) ??? I want to talk a bit more about preprocessing and cross-validation here, and introduce pipelines. --- class: smallest, compact # A common errror ```python print(X.shape) ``` ``` (100, 10000) ``` ```python # select most informative 5% of features from sklearn.feature_selection import SelectPercentile, f_regression select = SelectPercentile(score_func=f_regression, percentile=5) select.fit(X, y) X_selected = select.transform(X) print(X_selected.shape) ``` ``` (100, 500) ``` ```python from sklearn.model_selection import cross_val_score from sklearn.linear_model import Ridge np.mean(cross_val_score(Ridge(), X_selected, y)) ``` ``` 0.90 ``` ```python ridge = Ridge().fit(X_selected, y) X_test_selected = select.transform(X_test) ridge.score(X_test_selected, y_test) ``` ``` -0.18 ``` --- class: smallest #Leaking Information .left-column[ ```python # BAD! select.fit(X, y) # includes the cv test parts! X_sel = select.transform(X) scores = [] for train, test in cv.split(X, y): ridge = Ridge().fit(X_sel[train], y[train]) score = ridge.score(X_sel[test], y[test]) scores.append(score) ``` ] .right-column[ ```python # GOOD! scores = [] for train, test in cv.split(X, y): select.fit(X[train], y[train]) X_sel_train = select.transform(X[train]) ridge = Ridge().fit(X_sel_train, y[train]) X_sel_test = select.transform(X[test]) score = ridge.score(X_sel_test, y[test]) scores.append(score) ``` ] .reset-column[ Need to include preprocessing in cross-validation ! ] ??? What we did was we trained the scaler on the training data, and then applied cross-validation to the scaled data. Tha’s what’s show on the left. The problem is that we use the information of all of the training data for scaling, so in particular the information in the test fold. This is also known as information leakage. If we apply our model to new data, this data will not have been used to do the scaling, so our cross-validation will give us a biased result that might be too optimistic. On the right you can see how we should do it: we should only use the training part of the data to find the mean and standard deviation, even in cross-validation. That means that for each split in the cross-validation, we need to scale the data a bit differently. This basically means the scaling should happen inside the cross-validation loop, not outside. In practice, estimating mean and standard deviation is quite robust and you will not see a big difference between the two methods. But for other preprocessing steps that we’ll see later, this might make a huge difference. So we should get it right from the start. --- class: smaller ```python # Housing data example from sklearn.linear_model import Ridge X, y = df, target scaler = StandardScaler() scaler.fit(X_train) X_train_scaled = scaler.transform(X_train) ridge = Ridge().fit(X_train_scaled, y_train) X_test_scaled = scaler.transform(X_test) ridge.score(X_test_scaled, y_test) ``` ``` 0.684 ``` ```python from sklearn.pipeline import make_pipeline pipe = make_pipeline(StandardScaler(), Ridge()) pipe.fit(X_train, y_train) pipe.score(X_test, y_test) ``` ``` 0.684 ``` ??? Now I want to show you how to do preprocessing and crossvalidation right with scikit-learn. At the top here you see the workflow for scaling the data and then applying ridge again. Fit the scaler on the training set, transform on the training set, fit ridge on the training set, transform the test set, and evaluate the model. Because this is such a common pattern, scikit-learn has a tool to make this easier, the pipeline. The pipeline is an estimator that allows you to chain multiple transformations of the data before you apply a final model. You can build a pipeline using the make_pipeline function. Just provide as parameters all the estimators. All but the last one need to have a transform method. Here we only have two steps: the standard scaler and ridge. make_pipeline returns an estimator that does both steps at once. We can call fit on it to fit first the scaler and then ridge on the scaled data, and when we call score, it transforms the data and then evaluates the model. The code below is exactly equivalent to the code above, only shorter. --- class: left, middle .center[ ![:scale 70%](images/pipeline.png) ] ??? Let’s dive a bit more into the pipeline. Here is an illustration of what happens with three steps, T1, T2 and Classifier. Imagine T1 to be a scaler and T2 to be any other transformation of the data. If we call fit on this pipeline, it will first call fit on the first step with the input X. Then it will transform the input X to X1, and use X1 to fit the second step, T2. Then it will use T2 to transform the data from X1 to X2. Then it will fit the classifier on X2. If we call predict on some data X’, say the test set, it will call transform on T1, creating X’1. Then it will use T2 to transform X’1 into X’2, and call the predict method of the classifier on X’2. This sounds a bit complicated, but it’s really just doing “the right thing”to apply multiple transformation steps. --- class: smallest # Undoing our feature selection mistake .left-column[ ```python # BAD! select.fit(X, y) # includes the cv test parts! X_sel = select.transform(X) scores = [] for train, test in cv.split(X, y): ridge = Ridge().fit(X_sel[train], y[train]) score = ridge.score(X_sel[test], y[test]) scores.append(score) ``` Same as: ```python select.fit(X, y) X_selected = select.transform(X, y) np.mean(cross_val_score(Ridge(), X_selected, y)) ``` ``` 0.90 ``` ] .right-column[ ```python # GOOD! scores = [] for train, test in cv.split(X, y): select.fit(X[train], y[train]) X_sel_train = select.transform(X[train]) ridge = Ridge().fit(X_sel_train, y[train]) X_sel_test = select.transform(X[test]) score = ridge.score(X_sel_test, y[test]) scores.append(score) ``` Same as: ```python pipe = make_pipeline(select, Ridge()) np.mean(cross_val_score(pipe, X, y)) ``` ``` -0.079 ``` ] ??? How does that help with the cross-validation problem? Because now all steps are contain in pipeline, we can simply pass the whole pipeline to crossvalidation, and all processing will happen inside the cross-validation loop. That solve the data leakage problem. Here you can see how we can build a pipeline using a standard scaler and kneighborsregressor and pass it to cross-validation. --- # Naming Steps ```python from sklearn.pipeline import make_pipeline knn_pipe = make_pipeline(StandardScaler(), KNeighborsRegressor()) print(knn_pipe.steps) ``` ``` [('standardscaler', StandardScaler()), ('kneighborsregressor', KNeighborsRegressor())] ``` ```python from sklearn.pipeline import Pipeline pipe = Pipeline([("scaler", StandardScaler()), ("regressor", KNeighborsRegressor())]) ``` ??? But let’s talk a bit more about pipelines, because they are great. The pipeline has an attribute called steps, which --- contains its steps. Steps is a list of tuples, where the first entry is a string and the second is an estimator (model). The string is the “name” that is assigned to this step in the pipeline. You can see here that our first step is called “standardscaler” in all lower case letters, and the second is called kneighborsregressor, also all lower case letters. By default, step names are just lowercased classnames. You can also name the steps yourself using the Pipeline class directly. Then you can specify the steps as tuples of name and estimator. make_pipeline is just a shortcut to generate the names automatically. --- # Pipeline and GridSearchCV .small-padding-top[ ```python from sklearn.model_selection import GridSearchCV knn_pipe = make_pipeline(StandardScaler(), KNeighborsRegressor()) param_grid = {'kneighborsregressor__n_neighbors': range(1, 10)} grid = GridSearchCV(knn_pipe, param_grid, cv=10) grid.fit(X_train, y_train) print(grid.best_params_) print(grid.score(X_test, y_test)) ``` ``` {'kneighborsregressor__n_neighbors': 7} 0.60 ``` ] ??? These names are important for using pipelines with gridsearch. Recall that for using GridSearchCV you need to specify a parameter grid as a dictionary, where the keys are the parameter names. If you are using a pipeline inside GridSearchCV, you need to specify not only the parameter name, but also the step name – because multiple steps could have a parameter with the same name. The way to do this is to use the stepname, then two underscores, and then the parameter name, as the key for the param_grid dictionary. You can see that the best_params_ will have this same format. This way you can tune the parameters of all steps in a pipeline at once! And you don’t have to worry about leaking information, since all transformations are contained in the pipeline. You should always use pipelines for preprocessing. Not only does it make your code shorter, it also makes it less likely that you have bugs. --- # Going wild with Pipelines ```python from sklearn.datasets import load_diabetes diabetes = load_diabetes() X_train, X_test, y_train, y_test = train_test_split( diabetes.data, diabetes.target, random_state=0) from sklearn.preprocessing import PolynomialFeatures pipe = make_pipeline( StandardScaler(), PolynomialFeatures(), Ridge()) param_grid = {'polynomialfeatures__degree': [1, 2, 3], 'ridge__alpha': [0.001, 0.01, 0.1, 1, 10, 100]} grid = GridSearchCV(pipe, param_grid=param_grid, n_jobs=-1, return_train_score=True) grid.fit(X_train, y_train) ``` --- # Going wilder with Pipelines ```python pipe = Pipeline([('scaler', StandardScaler()), ('regressor', Ridge())]) param_grid = {'scaler': [StandardScaler(), MinMaxScaler(), 'passthrough'], 'regressor': [Ridge(), Lasso()], 'regressor__alpha': np.logspace(-3, 3, 7)} grid = GridSearchCV(pipe, param_grid) grid.fit(X_train, y_train) grid.score(X_test, y_test) ``` --- # Going wildest with Pipelines ```python from sklearn.tree import DecisionTreeRegressor pipe = Pipeline([('scaler', StandardScaler()), ('regressor', Ridge())]) # check out searchgrid for more convenience param_grid = [{'regressor': [DecisionTreeRegressor()], 'regressor__max_depth': [2, 3, 4], 'scaler': ['passthrough']}, {'regressor': [Ridge()], 'regressor__alpha': [0.1, 1], 'scaler': [StandardScaler(), MinMaxScaler(), 'passthrough']} ] grid = GridSearchCV(pipe, param_grid) grid.fit(X_train, y_train) grid.score(X_test, y_test) ``` --- class: left, middle # Categorical Variables --- # Categorical Variables .smaller[ ```python import pandas as pd df = pd.DataFrame({ 'boro': ['Manhattan', 'Queens', 'Manhattan', 'Brooklyn', 'Brooklyn', 'Bronx'], 'salary': [103, 89, 142, 54, 63, 219], 'vegan': ['No', 'No','No','Yes', 'Yes', 'No']}) ``` ]
boro
salary
vegan
0
Manhattan
103
No
1
Queens
89
No
2
Manhattan
142
No
3
Brooklyn
54
Yes
4
Brooklyn
63
Yes
5
Bronx
219
No
??? Before we can apply a machine learning algorithm, we first need to think about how we represent our data. Earlier, I said x \in R^n. That’s not how you usually get data. Often data has units, possibly different units for different sensors, it has a mixture of continuous values and discrete values, and different measurements might be on totally different scales. First, let me explain how to deal with discrete input variables, also known as categorical features. They come up in nearly all applications. Scikit-learn requires you to explicitly handle these, and assumes in general that all your input is continuous numbers. This is different from how many libraries in R do it, which deal with categorical variables implicitly. --- # Ordinal encoding .smaller[ ```python df['boro_ordinal'] = df.boro.astype("category").cat.codes df ``` ] .left-column[
boro
salary
vegan
0
2
103
No
1
3
89
No
2
2
142
No
3
1
54
Yes
4
1
63
Yes
5
0
219
No
] -- .right-column[ ![:scale 100%](images/boro_ordinal.png) ] ??? If you encode all three values using the same feature, then you are imposing a linear relation between them, and in particular you define an order between the categories. Usually, there is no semantic ordering of the categories, and so we shouldn’t introduce one in our representation of the data. --- # Ordinal encoding .smaller[ ```python df['boro_ordinal'] = df.boro.astype("category").cat.codes df ``` ] .left-column[
boro_ordinal
salary
vegan
0
2
103
No
1
3
89
No
2
2
142
No
3
1
54
Yes
4
1
63
Yes
5
0
219
No
] .right-column[ ![:scale 100%](images/boro_ordinal_classification.png) ] --- # One-Hot (Dummy) Encoding .narrow-left-column[
boro
salary
vegan
0
Manhattan
103
No
1
Queens
89
No
2
Manhattan
142
No
3
Brooklyn
54
Yes
4
Brooklyn
63
Yes
5
Bronx
219
No
] .wide-right-column[ .tiny[ ```python pd.get_dummies(df) ```
salary
boro_Bronx
boro_Brooklyn
boro_Manhattan
boro_Queens
vegan_No
vegan_Yes
0
103
0
0
1
0
1
0
1
89
0
0
0
1
1
0
2
142
0
0
1
0
1
0
3
54
0
1
0
0
0
1
4
63
0
1
0
0
0
1
5
219
1
0
0
0
1
0
] ] ??? Instead, we add one new feature for each category, And that feature encodes whether a sample belongs to this category or not. That’s called a one-hot encoding, because only one of the three features in this example is active at a time. You could actually get away with n-1 features, but in machine learning that usually doesn’t matter One way to do is with Pandas. Here I have an example of a data frame where I have the boroughs of New York as a categorical variable and variable saying whether they are vegan. One to get the dummies is to get dummies on this data frame. This will create new columns, it will actually replace borough column by four columns that correspond to the four different values. The get_dummies applies transformation to all columns that have a dtype that's either object or categorical. In this case we didn't actually want to transform the target variable vegan. --- # One-Hot (Dummy) Encoding .narrow-left-column[
boro
salary
vegan
0
Manhattan
103
No
1
Queens
89
No
2
Manhattan
142
No
3
Brooklyn
54
Yes
4
Brooklyn
63
Yes
5
Bronx
219
No
] .wide-right-column[ .tiny[ ```python pd.get_dummies(df, columns=['boro']) ```
salary
vegan
boro_Bronx
boro_Brooklyn
boro_Manhattan
boro_Queens
0
103
No
0
0
1
0
1
89
No
0
0
0
1
2
142
No
0
0
1
0
3
54
Yes
0
1
0
0
4
63
Yes
0
1
0
0
5
219
No
1
0
0
0
] ] ??? We can specify selectively which columns to apply the encoding to. --- # One-Hot (Dummy) Encoding .narrow-left-column[
boro
salary
vegan
0
2
103
No
1
3
89
No
2
2
142
No
3
1
54
Yes
4
1
63
Yes
5
0
219
No
] .wide-right-column[ .tiny[ ```python pd.get_dummies(df_ordinal, columns=['boro']) ```
salary
vegan
boro_0
boro_1
boro_2
boro_3
0
103
No
0
0
1
0
1
89
No
0
0
0
1
2
142
No
0
0
1
0
3
54
Yes
0
1
0
0
4
63
Yes
0
1
0
0
5
219
No
1
0
0
0
] ] ??? This also helps if the variable was already encoded using integers. Sometimes, someone has already encoded the categorical variables to integers like here. So here this is exactly the same information only except instead of strings you have them numbered. If you call the get_dummies on this nothing happens because none of them are object data types or categorical data types. If you want to look at the One Hot Encoding, you can explicitly pass columns equal and this will transform into boro_1, boro_2, boro_3. In this case get_dummies usually wouldn't do anything, but we can tell it which variables are categorical and it will dummy encode those for us. --- .tiny[ .left-column[ ```python df = pd.DataFrame({ 'boro': ['Manhattan', 'Queens', 'Manhattan', 'Brooklyn', 'Brooklyn', 'Bronx'], 'salary': [103, 89, 142, 54, 63, 219], 'vegan': ['No', 'No','No','Yes', 'Yes', 'No']}) df_dummies = pd.get_dummies(df, columns=['boro'] ```
salary
vegan
boro_Bronx
boro_Brooklyn
boro_Manhattan
boro_Queens
0
103
No
0
0
1
0
1
89
No
0
0
0
1
2
142
No
0
0
1
0
3
54
Yes
0
1
0
0
4
63
Yes
0
1
0
0
5
219
No
1
0
0
0
] .right-column[ ```python df = pd.DataFrame({ 'boro': ['Brooklyn', 'Manhattan', 'Brooklyn', 'Queens', 'Brooklyn', 'Staten Island'], 'salary': [61, 146, 142, 212, 98, 47], 'vegan': ['Yes', 'No','Yes','No', 'Yes', 'No']}) df_dummies = pd.get_dummies(df, columns=['boro']) ```
salary
vegan
boro_Brooklyn
boro_Manhattan
boro_Queens
boro_Staten Island
0
61
Yes
1
0
0
0
1
146
No
0
1
0
0
2
142
Yes
1
0
0
0
3
212
No
0
0
1
0
4
98
Yes
1
0
0
0
5
47
No
0
0
0
1
]] ??? If someone else gives you a new data set and in this new data set there is Staten Island, Manhattan, Bronx and Brooklyn. So new dataset doesn't have anyone from Queens. So now you transform this with get_dummies, you get something that has the same shape as the original data but actually, the last column means something completely different. Because now the last column is Staten Island, not Queens. If someone gives you separate training and test data sets, if you call get_dummies, you don't know that the columns correspond actually to the same thing. Unless you take care of the names, unfortunately, scikit-learn completely ignores column names. --- class: smaller #Pandas Categorical Columns .smaller[ ```python df = pd.DataFrame({ 'boro': ['Manhattan', 'Queens', 'Manhattan', 'Brooklyn', 'Brooklyn', 'Bronx'], 'salary': [103, 89, 142, 54, 63, 219], 'vegan': ['No', 'No','No','Yes', 'Yes', 'No']}) df['boro'] = pd.Categorical(df.boro, categories=['Manhattan', 'Queens', 'Brooklyn', 'Bronx', 'Staten Island']) pd.get_dummies(df, columns=['boro']) ``` ] .tiny[
salary
vegan
boro_Manhattan
boro_Queens
boro_Brooklyn
boro_Bronx
boro_Staten Island
0
103
No
1
0
0
0
0
1
89
No
0
1
0
0
0
2
142
No
1
0
0
0
0
3
54
Yes
0
0
1
0
0
4
63
Yes
0
0
1
0
0
5
219
No
0
0
0
1
0
] ??? The way to fix this is by using Pandas categorical types. Since we know what the boroughs of Manhattan are, we can create Pandas categorical dtype, we can create this categorical dtype with the categories Manhattan, Queens, Brooklyn, Bronx, and Staten Island. So now I have my column here and I'm going to convert it to a categorical dtype. So now it will not actually store the strings. It will just internally store zero to four, and it will also store what are the possible values. If a call get_dummies it will use all the possible values and for each of the possible values it will create a column. Even though Staten Island has not appeared in my dataset, it will still make a column for Staten Island. If I fix this categorical dtype I can apply it to the training and test data set and that'll make sure all the columns are always the same no matter what are the values are actually in the data set. --- # OneHotEncoder ```python import pandas as pd df = pd.DataFrame({'salary': [103, 89, 142, 54, 63, 219], 'boro': ['Manhattan', 'Queens', 'Manhattan', 'Brooklyn', 'Brooklyn', 'Bronx']}) ce = OneHotEncoder().fit(df) ce.transform(df).toarray() ``` ``` array([[ 0., 0., 1., 0., 0., 0., 0., 1., 0., 0.], [ 0., 0., 0., 1., 0., 0., 1., 0., 0., 0.], [ 0., 0., 1., 0., 0., 0., 0., 0., 1., 0.], [ 0., 1., 0., 0., 1., 0., 0., 0., 0., 0.], [ 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.], [ 1., 0., 0., 0., 0., 0., 0., 0., 0., 1.]]) ``` - Always transforms all columns ??? - Fit-transform paradigm ensures train and test-set categories correspond. --- # OneHotEncoder + ColumnTransformer ```python categorical = df.dtypes == object preprocess = make_column_transformer( (StandardScaler(), ~categorical), (OneHotEncoder(), categorical)) model = make_pipeline(preprocess, LogisticRegression()) ``` ??? The way to use this with mixed type data is column transformer, which allows you to transforms only some of the columns. For example, you can call categorical encoder only on the categorical columns and call StandardScaler on the non-categorical columns, and then use that to preprocess your data. Right now using Pandas, make sure your column names match up, make everything to an integer, or use column transformer and everything is awesome. In contrast to basically all other estimators in sklearn, this uses the column information in pandas and allows you to slice out different columns based on column names, integer indices or boolean masks. In this example I'm constructing a boolean mask --- class:center, middle ![:scale 100%](images/column_transformer_schematic.png) ??? Here's a schematic of the column transformer. Most commonly you might want to separate continuous and categorical columns, but you can select any subsets of columns you like. They can also overlap. Or you can apply multiple transformations to the same set of columns. Let's say I want a scaled version of the data, but I also want to extract principal components. I can use the same column as inputs to multiple transformers, and the results will be concatenated. FIXME add code --- class: some-space # Dummy variables and colinearity - One-hot is redundant (last one is 1 – sum of others) - Can introduce co-linearity - Can drop one - Choice which one matters for penalized models - Keeping all can make the model more interpretable ??? N/A --- class: some-space #Models Supporting Discrete Features - In principle: - All tree-based models, naive Bayes - In scikit-learn: - Some Naive Bayes classifiers. - In scikit-learn "soon": - Decision trees, random forests, gradient boosting ??? In principle all tree-based models support categorical features, in scikit-learn none of them do, hopefully, soon they will. So what you can do is either you do the One Hot Encoder or you just encode this as integers and treat it as a continuous. If you have very high categorical variables with many levels, maybe it keeping it as an integer might make more sense. --- # Target Encoding (Impact Encoding) ![:scale 100%](images/zip_code_prices.png) --- class: some-space # Target Encoding (Impact Encoding) - For high cardinality categorical features - Instead of 70 one-hot variables, one “response encoded” variable. - For regression: - "average price in zip code” - Binary classification: – “building in this zip code have a likelihood p for class 1” - Multiclass: – One feature per class: probability distribution ??? So there's also another way to encode categorical variables that is often used, I like to call it target-Based Encoding. It's basically for very high cardinality categorical features. For example, if you have categorical feature it's all US states and you don't have a lot of samples or if you have categorical features that's all US zip codes, if you have all different things, you don't want to do One Hot Encoding. So you get 50 new features, which if you don't have a lot of data would be a lot of features. So instead, you can use one single variable, it basically encodes the response. So for regression, it would be people in this state have an average response of that. Obviously you don't want to do this on the test set basically or you want to do this on the whole dataset for each level of the categorical variable, you want to find out what is the mean response and just use this as the future value. So you get one single future. For binary classification, you can just use the fraction of people that are classified as Class One. For multi-class, you usually do the percentage or fraction of people in each of the classes. So in multi-class, you get one new feature per class and you count for each state how many people in this state are classified for each of them. --- class: center, middle # More encodings for categorical features: ## http://contrib.scikit-learn.org/categorical-encoding/ --- # Load data, include ZIP code ```python data = fetch_openml("house_sales", as_frame=True) X = data.frame.drop(['date', 'price'], axis=1) X_train, X_test, y_train, y_test = train_test_split(X, target) X_train.columns ``` ``` Index(['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15'], dtype='object') ``` --- class: compact .tiny[ ```python X_train.head() ```
bedrooms
bathrooms
sqft_living
sqft_lot
floors
...
zipcode
lat
long
sqft_living15
sqft_lot15
10666
4.0
2.50
2160.0
7000.0
2.0
...
98029.0
47.566
-122.013
2300.0
7440.0
19108
4.0
4.25
3250.0
11780.0
2.0
...
98004.0
47.632
-122.203
1800.0
9000.0
20132
3.0
2.50
1280.0
1920.0
3.0
...
98105.0
47.662
-122.324
1450.0
1900.0
16169
4.0
1.50
1220.0
9600.0
1.0
...
98014.0
47.646
-121.909
1180.0
9000.0
16890
3.0
1.50
2120.0
6290.0
1.0
...
98108.0
47.566
-122.318
1620.0
5400.0
```python te = TargetEncoder(cols='zipcode').fit(X_train, y_train) te.transform(X_train).head() ```
bedrooms
bathrooms
sqft_living
sqft_lot
floors
...
zipcode
lat
long
sqft_living15
sqft_lot15
10666
4.0
2.50
2160.0
7000.0
2.0
...
6.164e+05
47.566
-122.013
2300.0
7440.0
19108
4.0
4.25
3250.0
11780.0
2.0
...
1.357e+06
47.632
-122.203
1800.0
9000.0
20132
3.0
2.50
1280.0
1920.0
3.0
...
8.503e+05
47.662
-122.324
1450.0
1900.0
16169
4.0
1.50
1220.0
9600.0
1.0
...
4.464e+05
47.646
-121.909
1180.0
9000.0
16890
3.0
1.50
2120.0
6290.0
1.0
...
3.604e+05
47.566
-122.318
1620.0
5400.0
```python y_train.groupby(X_train.zipcode).mean()[X_train.head().zipcode]) ```
zipcode
98029.0
98004.0
98105.0
98014.0
98108.0
price
616356.941
1.357e+06
850306.816
446448.065
360416.811
] --- class: smallest ```python X = data.frame.drop(['date', 'price', 'zipcode'], axis=1) scores = cross_val_score(Ridge(), X, target) np.mean(scores) ``` ``` 0.69 ``` -- ```python from sklearn.compose import make_column_transformer from sklearn.preprocessing import OneHotEncoder X = data.frame.drop(['date', 'price'], axis=1) ct = make_column_transformer((OneHotEncoder(), ['zipcode']), remainder='passthrough') pipe_ohe = make_pipeline(ct, Ridge()) scores = cross_val_score(pipe_ohe, X, target) np.mean(scores) ``` ``` 0.52 ``` -- ```python from category_encoders import TargetEncoder X = data.frame.drop(['date', 'price'], axis=1) pipe_target = make_pipeline(TargetEncoder(cols='zipcode'), Ridge()) scores = cross_val_score(pipe_target, X, target) np.mean(scores) ``` ``` 0.78 ``` --- class: center, middle # Questions?