Categorical Variables¶

In the last section we talked about scaling for continuous features, which all the features in the cancer dataset were. In the lending club data, we also saw another type of feature, categorical or discrete features. Categorical or discrete features are those can take one of several distinct values that are usually not numerical, and often not even ordered. In the lending club data, examples were the grade, which could be a letter from ‘A’ to ‘G’, or home ownership, which could be ‘RENT’, ‘MORTGAGE’, ‘OWN’ or ‘ANY’.

import pandas as pd

loans = pd.read_csv("C:/Users/t3kci/Downloads/loan.csv/loan.csv", nrows=100000)

C:\Users\t3kci\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3063: DtypeWarning: Columns (123,124,125,128,129,130,133,139,140,141) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

loans.grade.unique()

array(['C', 'D', 'B', 'A', 'E', 'F', 'G'], dtype=object)

loans.home_ownership.unique()

array(['RENT', 'MORTGAGE', 'OWN', 'ANY'], dtype=object)

Most machine learning algorithms require that we preprocess these features into some numeric encoding before we can use them. For our old friend KNeighborsClassifier, we could in theory define a distance metric on these categorical features (which is certainly possible and many options exist), though the simpler and more common approach is to transform the data so we can use the standard euclidean distance on all the features. Let’s take a small subset of the data to investigate the possible preprocessing methods:

# look only at the two loan statuses we discussed in Chapter TODO
loans_paid = loans[loans.loan_status.isin(['Fully Paid', 'Charged Off'])]
# For this example, we want some paid and some charged off loans
# we group them to make sure we get some of both. We then take the first 100 and remove the grouping index
some_loans = loans_paid.groupby('loan_status').apply(lambda x: x.head(10)).reset_index(drop=True)
# we only consider three relatively well-behaved features and the loan status
small_data = some_loans[['loan_amnt', 'home_ownership', 'grade', 'loan_status']]
small_data

	loan_amnt	home_ownership	grade	loan_status
0	8000	MORTGAGE	A	Charged Off
1	6000	MORTGAGE	C	Charged Off
2	10000	MORTGAGE	A	Charged Off
3	10000	RENT	E	Charged Off
4	35000	MORTGAGE	C	Charged Off
5	4800	MORTGAGE	C	Charged Off
6	35000	RENT	C	Charged Off
7	15000	OWN	B	Charged Off
8	16000	RENT	B	Charged Off
9	25000	MORTGAGE	A	Charged Off
10	30000	MORTGAGE	D	Fully Paid
11	40000	MORTGAGE	C	Fully Paid
12	20000	MORTGAGE	A	Fully Paid
13	4500	RENT	B	Fully Paid
14	8425	MORTGAGE	E	Fully Paid
15	20000	RENT	D	Fully Paid
16	6600	RENT	B	Fully Paid
17	2500	RENT	C	Fully Paid
18	4000	MORTGAGE	D	Fully Paid
19	2700	OWN	A	Fully Paid

The loan amount is a continuous feature; it’s an integer amount, but grade and home ownership are categorical. The loan_status is our classification target (which means it’s also a discrete variable, but we don’t consider it a feature and won’t process it as such). Scikit-learn requires you to explicitly handle categorical features in most cases, which is unlike some other libraries and frameworks. However, that gives you more control over the processing, and a better idea of what happens to your data.

The encoding that is most appropriate depends on the model you’re using, but there are some general encoding schemes that are frequently used.

Ordinal encoding¶

One of the simplest ways to encode categorical data is to assign an integer to each category. You could do this with the OrdinalEncoder in scikit-learn, or with pandas by using pandas categorical data:

# extract a column and convert it to categorical data (it was represented as strings before)
home_ownership_cat = small_data.home_ownership.astype('category')
home_ownership_cat

   MORTGAGE
   MORTGAGE
   MORTGAGE
       RENT
   MORTGAGE
   MORTGAGE
       RENT
        OWN
       RENT
   MORTGAGE
  MORTGAGE
  MORTGAGE
  MORTGAGE
      RENT
  MORTGAGE
      RENT
      RENT
      RENT
  MORTGAGE
       OWN
Name: home_ownership, dtype: category
Categories (3, object): [MORTGAGE, OWN, RENT]

# get integer codes from the categorical data
# all categorical operations are accessible through the cat attribute:
home_ownership_cat.cat.codes

   0
   0
   0
   2
   0
   0
   2
   1
   2
   0
  0
  0
  0
  2
  0
  2
  2
  2
  0
  1
dtype: int8

We could create a new dataframe using these integer codes, which now could be interpreted by a machine learning model. However, this is often problematic as this imposes an order and a distance between the different categories, that might not accurately reflect the semantics of the data. Both pandas and scikit-learn by default use the lexical ordering of categories, so MORTGAGE corresponds to 0, OWN to 1 and RENT to 2. This order makes little sense. We could specify our own ordering, say RENT, MORTGAGE, OWN (describing degrees of ownership) but this is also not entirely satisfactory: if we encode it using integers, we postulate that the difference between RENT and MORTGAGE is the same as the difference between MORTGAGE and OWN, and the difference between RENT and OWN is twice the distance between MORTGAGE and OWN. Making these assumption seems somewhat questionable, and in many cases, even ordering the categories might be hard - imagine working on a dataset of cars that includes the color, imposing any ordering there seems very arbitray.

For the grades, using integer encodings might be reasonable, as there is a clear ordering and distance. Whether this is appropriate might depend on the model you are using. If there is very few categories, such as here, it’s probably a safer bet to forego the ordinal encoding and use a different scheme instead.

One-Hot Encoding¶

The most commonly used encoding scheme for categorical variables by far is the so-called one-hot encoding or dummy encoding. The idea behind one-hot encoding is to add a new column for each value of a categorical variable, and set the column to 1 for the category that applies to the row, and 0 for all the other categories. An easy way to compute this encoding is the get_dummies function in pandas:

small_data[['grade']].head()

	grade
0	A
1	C
2	A
3	E
4	C

pd.get_dummies(small_data[['grade']]).head()

	grade_A	grade_C	grade_E
0	1	0	0
1	0	1	0
2	1	0	0
3	0	0	1
4	0	1	0

As you can see, get_dummies replaced the one column grade by five columns, one for each possible value. The original value for the grade of first row was A so the new column grade_A has a 1, while all the other columns have a 0. We can also call get_dummies on the whole dataframe. In this case, it will apply dummy encoding to all the columns that have either categorical data or objects (including strings):

pd.get_dummies(small_data).head()

	loan_amnt	home_ownership_MORTGAGE	home_ownership_RENT	grade_A	grade_C	grade_E	loan_status_Charged Off
0	8000	1	0	1	0	0	1
1	6000	1	0	0	1	0	1
2	10000	1	0	1	0	0	1
3	10000	0	1	0	0	1	1
4	35000	1	0	0	1	0	1

As you can see, loan_amnt wasn’t changed, while the dummy encoding was applied to all the other columns, including the target loan_status. As we don’t want to encode this column, we can explicitly provide the columns that we want to encode:

pd.get_dummies(small_data, columns=['home_ownership', 'grade']).head()

	loan_amnt	loan_status	home_ownership_MORTGAGE	home_ownership_RENT	grade_A	grade_C	grade_E
0	8000	Charged Off	1	0	1	0	0
1	6000	Charged Off	1	0	0	1	0
2	10000	Charged Off	1	0	1	0	0
3	10000	Charged Off	0	1	0	0	1
4	35000	Charged Off	1	0	0	1	0

Note

If a categorical variable was already represented as an integer you can force pandas to apply dummy encoding by passing the column name to the columns parameter of pd.get_dummies.

Aligning dataframes with pandas¶

A common problem in using get_dummies is that if you have multiple datasets or files, and you call get_dummies on each of them, you might get inconsistent encodings. Let’s split our toy data into training and test set and apply get_dummies on them separately:

from sklearn.model_selection import train_test_split
small_train, small_test = train_test_split(small_data, random_state=21)
# To avoid setting with copy warnings we copy the data after splitting
# this is probably not necessary
small_train = small_train.copy()
small_test = small_test.copy()
pd.get_dummies(small_train, columns=['home_ownership', 'grade'])

	loan_amnt	loan_status	home_ownership_MORTGAGE	home_ownership_RENT	grade_A	grade_B	grade_C	grade_D	grade_E
17	2500	Fully Paid	0	1	0	0	1	0	0
18	4000	Fully Paid	1	0	0	0	0	1	0
11	40000	Fully Paid	1	0	0	0	1	0	0
6	35000	Charged Off	0	1	0	0	1	0	0
14	8425	Fully Paid	1	0	0	0	0	0	1
1	6000	Charged Off	1	0	0	0	1	0	0
2	10000	Charged Off	1	0	1	0	0	0	0
12	20000	Fully Paid	1	0	1	0	0	0	0
3	10000	Charged Off	0	1	0	0	0	0	1
8	16000	Charged Off	0	1	0	1	0	0	0
0	8000	Charged Off	1	0	1	0	0	0	0
16	6600	Fully Paid	0	1	0	1	0	0	0
4	35000	Charged Off	1	0	0	0	1	0	0
15	20000	Fully Paid	0	1	0	0	0	1	0
9	25000	Charged Off	1	0	1	0	0	0	0

pd.get_dummies(small_test, columns=['home_ownership', 'grade'])

	loan_amnt	loan_status	home_ownership_MORTGAGE	home_ownership_OWN	home_ownership_RENT	grade_A	grade_B	grade_C	grade_D
7	15000	Charged Off	0	1	0	0	1	0	0
10	30000	Fully Paid	1	0	0	0	0	0	1
19	2700	Fully Paid	0	1	0	1	0	0	0
13	4500	Fully Paid	0	0	1	0	1	0	0
5	4800	Charged Off	1	0	0	0	0	1	0

Both of the dataframes have seven columns, but the meaning of the columns is quite different. The training set doesn’t have a column for home_ownership=OWN while the test set doesn’t have a column for grade=E. However, remember that scikit-learn doesn’t know about column names in dataframes, so if you’d passed this data directly into scikit-learn, you would get meaningless predictions without knowing it! TODO callout?

There’s several ways to avoid this; the easiest is to use get_dummies before splitting the data, that might not be possible if new data arrives and you want to apply an existing model. Another way is to encode all categorical variables using the pandas categorical type, and specifying all the known categories:

ownership_cats = ['MORTGAGE', 'OWN', 'RENT']
grade_cats = ['A', 'B', 'C', 'D', 'E']

small_test_explicit_cats = small_test.copy()
small_test_explicit_cats['home_ownership'] = pd.Categorical(small_test_explicit_cats['home_ownership'], categories=ownership_cats)
small_test_explicit_cats['grade'] = pd.Categorical(small_test_explicit_cats['grade'], categories=grade_cats)

Now the columns are aware of all possible values, and all of them will receive a column, whether the value is present or not (not the all-zero column grade_E):

pd.get_dummies(small_test_explicit_cats, columns=['home_ownership', 'grade'])

	loan_amnt	loan_status	home_ownership_MORTGAGE	home_ownership_OWN	home_ownership_RENT	grade_A	grade_B	grade_C	grade_D
7	15000	Charged Off	0	1	0	0	1	0	0
10	30000	Fully Paid	1	0	0	0	0	0	1
19	2700	Fully Paid	0	1	0	1	0	0	0
13	4500	Fully Paid	0	0	1	0	1	0	0
5	4800	Charged Off	1	0	0	0	0	1	0

This method is very explicit and safe, but can be a bit cumbersome if there’s many categorical columns, or not all of the categories are known beforehand. A somewhat simpler and less explicit method is using pd.align after calling get_dummies:

# compute dummy encoding (the columns will not match afterwards)
small_train_dummies = pd.get_dummies(small_train, columns=['home_ownership', 'grade'])
small_test_dummies = pd.get_dummies(small_test, columns=['home_ownership', 'grade'])

# align dataframes
# join='right' aligns test (left=self) to train (right=other) keeping only the columns in train
# axis=1 means we align only the columns, and don't try joining the row indices
# align returns two aligned frames; because we did a right join, train is unchanged
# so we can discard the second return value (and assign it to _)
# fill value specifies what to put into previously non-existing columns
test_aligned, _ = small_test_dummies.align(small_train_dummies, join='right', axis=1, fill_value=0)
test_aligned

	loan_amnt	loan_status	home_ownership_MORTGAGE	home_ownership_RENT	grade_A	grade_B	grade_C	grade_D
7	15000	Charged Off	0	0	0	1	0	0
10	30000	Fully Paid	1	0	0	0	0	1
19	2700	Fully Paid	0	0	1	0	0	0
13	4500	Fully Paid	0	1	0	1	0	0
5	4800	Charged Off	1	0	0	0	1	0

The result is an aligned dataframe test_aligned that has the same columns as small_train_dummies, the dataframe it was aligned with. This ensures that the shapes are compatible between training and test set and we will get meaningful results from scikit-learn. Note that ‘OWN’ column that was previously present in the test set was dropped as it is not present in the training set. We could also perform an inner join when aligning the dataframes (which is the default in pd.align if you don’t specify join), so that the aligned dataframes contain all the columns present in either of the dataframes:

# if we don't specify join, an inner join is performed and we retain all the columns
# in this case we also want to store the new aligned training set
test_aligned, train_aligned = small_test_dummies.align(small_train_dummies, axis=1, fill_value=0)
test_aligned

	grade_A	grade_B	grade_C	grade_D	home_ownership_MORTGAGE	home_ownership_OWN	home_ownership_RENT	loan_amnt	loan_status
7	0	1	0	0	0	1	0	15000	Charged Off
10	0	0	0	1	1	0	0	30000	Fully Paid
19	1	0	0	0	0	1	0	2700	Fully Paid
13	0	1	0	0	0	0	1	4500	Fully Paid
5	0	0	1	0	1	0	0	4800	Charged Off

However, that’s not very useful if you already created a model using the original training dataset.

One-hot with sklearn¶

Another option (and the one we will use for most of the book) is to do our encoding with scikit-learn. Because scikit-learn has a concept of training and test set, the issue of aligning the data is handled automatically. The dummy or one-hot encoding in scikit-learn is implemented in the OneHotEncoder, which is a transformer, just like other preprocessing methods.

from sklearn.preprocessing import OneHotEncoder
# By default, OneHotEncoder errors if it sees unknown categories in the test data
# we will overwrite this behavior by specifying handle_unknown='ignore'
# also OneHotEncoder by default outputs scipy sparse matrices, which are more efficient but cumbersome
# we disable that with sparse=False
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)
# fit on the training set; stores the categories for each column
ohe.fit(small_train)
# apply one-hot encoding to both training and test set
X_train_ohe = ohe.transform(small_train)
X_test_ohe = ohe.transform(small_test)

X_train_ohe

array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        1., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        0., 1., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.,
        1., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0.,
        1., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        1., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 1., 0.,
        0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0.,
        0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        0., 0., 1., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 1.,
        0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0.,
        0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1.,
        0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0.,
        1., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0.,
        0., 1., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 1., 0.,
        0., 0., 0., 1., 0.]])

There are two things you might notice in this output: as with all scikit-learn transformations, the output is a numpy array without column names, which can be a bit inconvenient. Secondly, all the columns have been one-hot-encoded, the loan amount is not present any more. We can actually get the feature names by using the get_feature_names method of the OneHotEncoder:

ohe.get_feature_names()

array(['x0_2500', 'x0_4000', 'x0_6000', 'x0_6600', 'x0_8000', 'x0_8425',
       'x0_10000', 'x0_16000', 'x0_20000', 'x0_25000', 'x0_35000',
       'x0_40000', 'x1_MORTGAGE', 'x1_RENT', 'x2_A', 'x2_B', 'x2_C',
       'x2_D', 'x2_E', 'x3_Charged Off', 'x3_Fully Paid'], dtype=object)

By default, the input columns in scikit-learn are named x0, x1 and so on, so what this tells us is that for the first feature, loan amount, several new columns were added to one-hot-encode the observed integer values, which was not what we had in mind. We can get more informative feature names by passing the original dataframe column names to get_feature_names:

ohe.get_feature_names(small_train.columns)

array(['loan_amnt_2500', 'loan_amnt_4000', 'loan_amnt_6000',
       'loan_amnt_6600', 'loan_amnt_8000', 'loan_amnt_8425',
       'loan_amnt_10000', 'loan_amnt_16000', 'loan_amnt_20000',
       'loan_amnt_25000', 'loan_amnt_35000', 'loan_amnt_40000',
       'home_ownership_MORTGAGE', 'home_ownership_RENT', 'grade_A',
       'grade_B', 'grade_C', 'grade_D', 'grade_E',
       'loan_status_Charged Off', 'loan_status_Fully Paid'], dtype=object)

OneHotEncoder, like all other estimators in scikit-learn, always works on all input columns. So we need to pass it only the column that we want to transform. So one possible way to do this would be to slice of the categorical columns, transform them with OneHotEncoder, like so:

small_train.columns

Index(['loan_amnt', 'home_ownership', 'grade', 'loan_status'], dtype='object')

train_cat = small_train[['home_ownership', 'grade']]
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)
train_cat_ohe = ohe.fit_transform(train_cat)
train_cat_ohe

array([[0., 1., 0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 0., 1.],
       [1., 0., 0., 0., 1., 0., 0.],
       [1., 0., 1., 0., 0., 0., 0.],
       [1., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 1.],
       [0., 1., 0., 1., 0., 0., 0.],
       [1., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0.],
       [1., 0., 1., 0., 0., 0., 0.]])

We can now make this one-hot-encoded array back into a dataframe by using get_feature_names:

train_df_ohe = pd.DataFrame(train_cat_ohe,
                            # create column names
                            columns=ohe.get_feature_names(train_cat.columns),
                            # keep the old index.
                            # this is optional but will make joining with the other features easier
                            index=train_cat.index)
train_df_ohe.head()

	home_ownership_MORTGAGE	home_ownership_RENT	grade_C	grade_D	grade_E
17	0.0	1.0	1.0	0.0	0.0
18	1.0	0.0	0.0	1.0	0.0
11	1.0	0.0	1.0	0.0	0.0
6	0.0	1.0	1.0	0.0	0.0
14	1.0	0.0	0.0	0.0	1.0

Then we can concatenate it again with the remaining ‘amount’ feature:

train_df_all = pd.concat([train_df_ohe, small_train[['loan_amnt']]], axis=1)
train_df_all.head()

	home_ownership_MORTGAGE	home_ownership_RENT	grade_C	grade_D	grade_E	loan_amnt
17	0.0	1.0	1.0	0.0	0.0	2500
18	1.0	0.0	0.0	1.0	0.0	4000
11	1.0	0.0	1.0	0.0	0.0	40000
6	0.0	1.0	1.0	0.0	0.0	35000
14	1.0	0.0	0.0	0.0	1.0	8425

While this is the result we wanted, this was pretty complicated compared to using pd.get_dummies. Luckily, scikit-learn has another tool that will make this much easier, the ColumnTransformer.

Combining ColumnTransformer and Pipeline¶

While ColumnTransformer by itself is already quite awesome, the real power comes from combining it with Pipeline to encapsulate the whole preprocessing and model training. Let’s apply KNeighborsClassifier to the small subset of the lending club data we have been looking at. This is more for illustrative purposes, as we’re using a tiny subset, and KNeighborsClassifier is potentially not a good model for this dataset, but we will see this combination in many of our later examples.

Let’s start from the beginning create a small dataset with 200 ‘Charged Off’ and 200 ‘Fully Paid’ loans.

loans = pd.read_csv("C:/Users/t3kci/Downloads/loan.csv/loan.csv", nrows=1000000, low_memory=False)
loans_paid = loans[loans.loan_status.isin(['Fully Paid', 'Charged Off'])]
some_loans = loans_paid.groupby('loan_status').apply(lambda x: x.head(200)).reset_index(drop=True)
X = some_loans[['loan_amnt', 'home_ownership', 'grade', 'loan_status']]
X.head()

	loan_amnt	home_ownership	grade	loan_status
0	8000	MORTGAGE	A	Charged Off
1	6000	MORTGAGE	C	Charged Off
2	10000	MORTGAGE	A	Charged Off
3	10000	RENT	E	Charged Off
4	35000	MORTGAGE	C	Charged Off

X.loan_status.value_counts()

Fully Paid     200
Charged Off    200
Name: loan_status, dtype: int64

First, we split off the target column as usual, and split the data into training and test set:

X_train, X_test, y_train, y_test = train_test_split(X.drop(columns='loan_status'),
                                                    X.loan_status, random_state=23)

X_train.head()

	loan_amnt	home_ownership	grade
332	12000	MORTGAGE	E
61	7125	RENT	E
185	10500	OWN	D
75	15000	OWN	C
135	40000	OWN	A

Next, we assemble our ColumnTransformer as we did above, applying scaling to the loan_amnt and applying one-hot encoding to home_ownership and grade:

ct = make_column_transformer((OneHotEncoder(handle_unknown='ignore', sparse=False),
                              ['home_ownership', 'grade']),
                             (StandardScaler(), ['loan_amnt']))
ct

ColumnTransformer

ColumnTransformer(transformers=[('onehotencoder',
                                 OneHotEncoder(handle_unknown='ignore',
                                               sparse=False),
                                 ['home_ownership', 'grade']),
                                ('standardscaler', StandardScaler(),
                                 ['loan_amnt'])])

onehotencoder

['home_ownership', 'grade']

OneHotEncoder

OneHotEncoder(handle_unknown='ignore', sparse=False)

standardscaler

['loan_amnt']

StandardScaler

StandardScaler()

Now, we use this ColumnTransformer as a preprocessing step in a Pipeline with KNeighborsClassifier:

from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier

pipe = make_pipeline(ct, KNeighborsClassifier(n_neighbors=1))
pipe

Pipeline

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse=False),
                                                  ['home_ownership', 'grade']),
                                                 ('standardscaler',
                                                  StandardScaler(),
                                                  ['loan_amnt'])])),
                ('kneighborsclassifier', KNeighborsClassifier(n_neighbors=1))])

columntransformer: ColumnTransformer

ColumnTransformer(transformers=[('onehotencoder',
                                 OneHotEncoder(handle_unknown='ignore',
                                               sparse=False),
                                 ['home_ownership', 'grade']),
                                ('standardscaler', StandardScaler(),
                                 ['loan_amnt'])])

onehotencoder

['home_ownership', 'grade']

OneHotEncoder

OneHotEncoder(handle_unknown='ignore', sparse=False)

standardscaler

['loan_amnt']

StandardScaler

StandardScaler()

KNeighborsClassifier

KNeighborsClassifier(n_neighbors=1)

Now, everything we need to do is embedded in the pipe object, and we can just run fit on the training set and score on the test set:

pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

0.59

The result here is not too overwhelming, but a bit better than chance (which would be 50%, given that we constructed a balanced dataset). However, the overall code for building and evaluating the model was quite minimal, and leaves little opportunity for casual mistakes.

Selecting columns by type¶

If there are many columns, such as in the full lending club data, it can sometimes be a challenge to determine the types and correct preprocessing for all of them. Usually, it pays of to understand each individual column, and invest some time understanding the meaning of all the values, plotting them, and so on. For example, for integer values it might not be obvious whether they should be one-hot-encoded or not. The annual salary amount in the lending club data is continuous, but the easiest way to actually be certain about that is to know what the data represents. We know salary, or really any amount of money, will be continuous. Luckily in this dataset all the categorical variable are represented as strings, and so it’s relatively easy to understand what they mean. But without additional information, if a column contains small integers, these could either correspond to a continuous quantity, or be an encoding for categories.

This somewhat tricky case aside, for some quick plotting or prototyping, it might sometimes be enough to look at just the dtype of the columns. Let’s look again at the dtypes of the full lending club data:

loans.dtypes.value_counts()

float64    70
int64      39
object     36
dtype: int64

We could select only column of a certain type by using boolean masks:

# only select 'object' dtype columns, which are usually strings
loans.loc[:, loans.dtypes == 'object'].head()

	term	grade	sub_grade	emp_title	emp_length	home_ownership	verification_status	issue_d	loan_status	pymnt_plan	...	hardship_status	hardship_start_date	hardship_end_date	payment_plan_start_date	hardship_loan_status	disbursement_method	debt_settlement_flag	debt_settlement_flag_date	settlement_status	settlement_date
0	36 months	C	C1	Chef	10+ years	RENT	Not Verified	Dec-2018	Current	n	...	NaN	NaN	NaN	NaN	NaN	Cash	N	NaN	NaN	NaN
1	60 months	D	D2	Postmaster	10+ years	MORTGAGE	Source Verified	Dec-2018	Current	n	...	NaN	NaN	NaN	NaN	NaN	Cash	N	NaN	NaN	NaN
2	36 months	D	D1	Administrative	6 years	MORTGAGE	Source Verified	Dec-2018	Current	n	...	NaN	NaN	NaN	NaN	NaN	Cash	N	NaN	NaN	NaN
3	36 months	D	D2	IT Supervisor	10+ years	MORTGAGE	Source Verified	Dec-2018	Current	n	...	NaN	NaN	NaN	NaN	NaN	Cash	N	NaN	NaN	NaN
4	60 months	C	C4	Mechanic	10+ years	MORTGAGE	Not Verified	Dec-2018	Current	n	...	NaN	NaN	NaN	NaN	NaN	Cash	N	NaN	NaN	NaN

5 rows × 36 columns

loans.loc[:, loans.dtypes == 'float64'].columns

Index(['id', 'member_id', 'funded_amnt_inv', 'int_rate', 'installment',
       'annual_inc', 'url', 'dti', 'inq_last_6mths', 'mths_since_last_delinq',
       'mths_since_last_record', 'revol_util', 'out_prncp', 'out_prncp_inv',
       'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
       'total_rec_late_fee', 'recoveries', 'collection_recovery_fee',
       'last_pymnt_amnt', 'mths_since_last_major_derog', 'annual_inc_joint',
       'dti_joint', 'open_acc_6m', 'open_act_il', 'open_il_12m', 'open_il_24m',
       'mths_since_rcnt_il', 'total_bal_il', 'il_util', 'open_rv_12m',
       'open_rv_24m', 'max_bal_bc', 'all_util', 'inq_fi', 'total_cu_tl',
       'inq_last_12m', 'avg_cur_bal', 'bc_open_to_buy', 'bc_util',
       'mo_sin_old_il_acct', 'mths_since_recent_bc',
       'mths_since_recent_bc_dlq', 'mths_since_recent_inq',
       'mths_since_recent_revol_delinq', 'num_tl_120dpd_2m', 'pct_tl_nvr_dlq',
       'percent_bc_gt_75', 'revol_bal_joint', 'sec_app_inq_last_6mths',
       'sec_app_mort_acc', 'sec_app_open_acc', 'sec_app_revol_util',
       'sec_app_open_act_il', 'sec_app_num_rev_accts',
       'sec_app_chargeoff_within_12_mths',
       'sec_app_collections_12_mths_ex_med',
       'sec_app_mths_since_last_major_derog', 'deferral_term',
       'hardship_amount', 'hardship_length', 'hardship_dpd',
       'orig_projected_additional_accrued_interest',
       'hardship_payoff_balance_amount', 'hardship_last_payment_amount',
       'settlement_amount', 'settlement_percentage', 'settlement_term'],
      dtype='object')

Some of these are categorical, some of these are dates (which we might choose to drop or parse as dates), but some of these might even have continuous aspects, such as the employment length:

loans.emp_length.value_counts()

10+ years    333964
years       90476
< 1 year      81758
years       81013
year        66927
years       61623
years       61101
years       43380
years       38349
years       34708
years       32224
Name: emp_length, dtype: int64

If you feel that dtypes are a good way to select features on your dataset, you can do so in ColumnTransformers even more easily with make_column_selector:

from sklearn.compose import make_column_selector
ct = make_column_transformer((OneHotEncoder(handle_unknown='ignore', sparse=False),
                              # pass object dtype columns to OneHotEncoder
                             make_column_selector(dtype_include='object')),
                             # pass float and int dtypes to StandardScaler
                             (StandardScaler(), make_column_selector(dtype_include=['float', 'int'])))

# we're only using some of the columns as the whole dataset is quite messy
X = some_loans[['loan_amnt', 'int_rate', 'home_ownership', 'open_acc', 'grade']]
X.dtypes

loan_amnt           int64
int_rate          float64
home_ownership     object
open_acc            int64
grade              object
dtype: object

ct.fit(X)
# we can look at the fitted transformers to see which columns where passed where
ct.transformers_

[('onehotencoder',
  OneHotEncoder(handle_unknown='ignore', sparse=False),
  ['home_ownership', 'grade']),
 ('standardscaler', StandardScaler(), ['int_rate']),
 ('remainder', 'drop', [0, 3])]

The benefit of using make_column_selector instead of passing a boolean mask is that the correct columns are determined when fitting the ColumnTransformer which avoids mistakes when creating the mask. The make_column_selector function can also match column names or exclude certain dtypes.

Collinearity & Redundancy¶

Coming back to one-hot encoding, there’s one more important thing to know, which is that one-hot encoding is redundant. We can drop one of the columns for each categorical variable and still retain all the information: if all of the other columns are zero, the dropped one was 1, and if one of them is 1, the dropped one was 0. In other words, you could re-compute a column as one minus the sum of the other columns. Such a relationship is known as collinearity (or linear dependency). For some statistical models, the presence of such a relationship between the features can lead to numerical and statistical issues. Therefore, it is common in statistics to drop one of the columns. You can do that in pandas by using pd.get_dummies(X, drop_first=True) which will drop the first category for each feature:

pd.get_dummies(X).head()

	loan_amnt	int_rate	open_acc	home_ownership_MORTGAGE	home_ownership_RENT	grade_A	grade_C	grade_E
0	8000	6.46	12	1	0	1	0	0
1	6000	14.47	8	1	0	0	1	0
2	10000	8.81	7	1	0	1	0	0
3	10000	27.27	4	0	1	0	0	1
4	35000	16.14	22	1	0	0	1	0

pd.get_dummies(X, drop_first=True).head()

	loan_amnt	int_rate	open_acc	home_ownership_MORTGAGE	home_ownership_RENT	grade_C	grade_E
0	8000	6.46	12	1	0	0	0
1	6000	14.47	8	1	0	1	0
2	10000	8.81	7	1	0	0	0
3	10000	27.27	4	0	1	0	1
4	35000	16.14	22	1	0	1	0

In OneHotEncoder, you can specify drop='first' if you want to drop the first category, or you can specify the category you want to drop for each of the features. The only model in scikit-learn for which retaining all of the feature is an issue is LinearRegression (which we’ll discuss in Chapter TODO), but otherwise you don’t have to worry about it. For many of the model in sklearn that we’ll see in Chapter TODO, it can matter which feature you drop, and it might be harder to interpret the model once you do. Therefore I usually recommend keeping all features in one-hot-encoding, even though the representation is redundant.

Dropping redundant binary features¶

There is one exception though, which is binary features. If you have a binary categorical feature, i.e. a feature with two categories, it is obviously redundant to encode it using OneHotEncoder. Let’s look at the ‘term’ variable which has two values, ‘36 months’ and ‘60 months’:

loans['term'].value_counts()

 36 months    715939
 60 months    284061
Name: term, dtype: int64

If we encode this feature using OneHotEncoder, we will get two new features, one corresponding to the 36s month term, and the other to the 50 months term:

ohe = OneHotEncoder().fit(loans[['term', 'grade']])
# get feature names, passing in the original feature names
ohe.get_feature_names(['term', 'grade'])

array(['term_ 36 months', 'term_ 60 months', 'grade_A', 'grade_B',
       'grade_C', 'grade_D', 'grade_E', 'grade_F', 'grade_G'],
      dtype=object)

However, these two features encode exactly the same information, in that the 36 months term feature is just 1 minus the 60 month term feature. This redundancy is not useful and will make any model less easy to interpret, and we can drop it using drop='if_binary':

ohe = OneHotEncoder(drop='if_binary').fit(loans[['term', 'grade']])
ohe.get_feature_names(['term', 'grade'])

array(['term_ 60 months', 'grade_A', 'grade_B', 'grade_C', 'grade_D',
       'grade_E', 'grade_F', 'grade_G'], dtype=object)

Now there’s only one feature to represent this information, while the ‘grade’ feature with seven categories has retained all columns.

Impact Encoding & Dealing with many categories¶

Dealing with categorical features with many categories can be a bit tricky, as directly applying one-hot encoding is often not effective. Let’s look at the emp_title feature from the lending club data, which provides a job description of the borrower:

loans_paid.shape

(383181, 145)

title_counts = loans_paid.emp_title.value_counts()
title_counts

Teacher                                 6819
Manager                                 6609
Owner                                   3977
Driver                                  2982
Registered Nurse                        2859
                                        ... 
Corporate Delivery Specialist              1
telecom                                    1
Sewer Manitenance supervisor 1             1
Regulatory Compliance Survey Manager       1
Lead payroll admin                         1
Name: emp_title, Length: 108751, dtype: int64

We loaded 1,000,000 rows of data, and it contains 224,610 different employment titles. Several of the titles are very common, such as Teacher and Manager, while some of the titles appear only once, like ‘Court Evaluator’. TODO numbers Let’s see how many of the value appear only once:

title_once = title_counts == 1
title_once.sum()

That’s most of them! Clearly it will be hard to learn from a category for which we have only seen one example. This is likely another example of a power-law distribution: most of the job titles appear only once, there are some that appear several times, but then there are some that appear much more frequently than all the rest, such as Teacher and Manager. We can validate this by looking at the counts of the counts:

# a power-law distribution looks like a straight line in a log-log plot. This seems to hold reasonably well here.
title_counts.value_counts().plot(logy=True, logx=True)

<matplotlib.axes._subplots.AxesSubplot at 0x269a21cfbc8>

../_images/04-categorical-variables_109_1.png

Using a one-hot encoding on 224,610 categories will produce a feature space that will likely make learning hard. Even if we restrict ourselves to the categories that appear at least twice, we would still have over 50,000 features. One option is to define a cut-off, and only look at the top k, say top 50 or top 100, most common values, and move everything else into an “other” category. Let’s have a look at what the common values are:

import matplotlib.pyplot as plt
title_counts[:50].plot(kind='barh', figsize=(10, 8))
plt.tight_layout()
plt.title("50 most common job titles")

Text(0.5, 1.0, '50 most common job titles')

../_images/04-categorical-variables_111_1.png

Before we start encoding the feature, we might want to look at whether these are informative at all. We can for example look at how frequently each is associated with a fully paid loan.

# look at only the loans associated with a common job title
loans_50_jobs = loans_paid[loans_paid.emp_title.isin(title_counts[:50].index)]
loans_50_jobs.shape

(70672, 145)

# ensure our subsetting did what we wanted it to:
loans_50_jobs.emp_title.unique()

array(['Supervisor ', 'Teacher', 'Manager', 'Police Officer',
       'Truck Driver', 'Project Manager', 'Driver', 'supervisor',
       'Registered Nurse', 'Operations Manager', 'Vice President',
       'Program Manager', 'Executive Assistant', 'Sales', 'Supervisor',
       'RN', 'Sales Manager', 'Software Engineer', 'Controller',
       'Assistant Manager', 'Account Manager', 'manager',
       'General Manager', 'President', 'Paralegal', 'Administrator',
       'Director', 'Branch Manager', 'Mechanic', 'Technician',
       'Accountant', 'Owner', 'Analyst', 'driver', 'teacher', 'Engineer',
       'Store Manager', 'owner', 'Nurse', 'Administrative Assistant',
       'Office Manager', 'Attorney', 'Server', 'sales', 'Foreman',
       'Truck driver', 'Consultant', 'Electrician', 'Registered nurse',
       'CEO'], dtype=object)

# group by the employment title and take the loan status column
loan_status_by_title = loans_50_jobs.groupby('emp_title').loan_status
# some panda kung-fu: for each title, compute rations of fully paid and charged off, take only fully paid ones
paid_fraction = loan_status_by_title.value_counts(normalize=True).unstack('loan_status')['Fully Paid']
# TODO seaborn will make this better!

paid_fraction.sort_values().plot(kind='barh', figsize=(10, 8))
overall_payment_ratio = loans_50_jobs.loan_status.value_counts(normalize=True)['Fully Paid']
plt.vlines(overall_payment_ratio, 0, 50, linestyle=':')
plt.xlabel("Fraction fully paid")

Text(0.5, 0, 'Fraction fully paid')

../_images/04-categorical-variables_116_1.png

This plot shows that there seems to be some information in the employment title. You might also note that some of the titles have capitalized an uncapitalized variants. In principle, the spelling difference could contain additional information. However, at least when looking directly at the outcomes, the different capitalization variants largely have the same outcomes (we could do a more rigorous statistical analysis for this if we were interested). Consolidating the capitalization could be done simply by lowercasing everything:

loans_paid['emp_title_lower'] = loans_paid['emp_title'].str.lower()

C:\Users\t3kci\anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

If you explore the dataset more, you’ll find that there are also misspellings, which are a bit harder to correct and which we’ll leave for now. After we consolidated the capitalization we can replace all but the 50 most common titles by “other”:

title_counts_lower = loans_paid.emp_title_lower.value_counts()
common_titles = title_counts_lower.index[:50]
loans_paid['emp_title_50'] = loans_paid['emp_title_lower'].map(lambda x: x if x in common_titles else 'other')
loans_paid['emp_title_50'].value_counts()

C:\Users\t3kci\anaconda3\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until

other                       297838
manager                       8162
teacher                       8071
owner                         5769
driver                        4104
registered nurse              3850
supervisor                    3569
sales                         3486
rn                            3169
project manager               2444
general manager               2268
office manager                2265
truck driver                  2037
director                      1788
president                     1682
engineer                      1630
sales manager                 1542
operations manager            1388
police officer                1319
nurse                         1268
accountant                    1229
vice president                1220
store manager                 1215
technician                    1153
mechanic                      1108
account manager               1051
administrative assistant      1040
attorney                       984
analyst                        962
server                         923
branch manager                 871
assistant manager              868
executive assistant            832
paralegal                      779
foreman                        763
electrician                    756
software engineer              747
customer service               745
operator                       745
supervisor                     722
clerk                          682
controller                     669
machine operator               667
ceo                            666
consultant                     663
program manager                642
administrator                  634
it manager                     551
manager                        549
business analyst               548
principal                      548
Name: emp_title_50, dtype: int64

Now, we have a relatively clean categorical variable, with only 51 values, which we can easily one-hot encode. TODO better transition

Impact encoding¶

There is one more encoding scheme for categorical data that is commonly used and that recently has increased in popularity, known as impact encoding or target encoding. This method of encoding categorical variables uses information about the classification target and is particularly useful for categorical features with many categories.

The idea behind impact encoding is to replace the categorical feature by the average outcome of the respective category. In our case of employment title, the feature would become something like “likelihood of fully paying based on job title alone”, i.e. the value for the job title in Figure TODO. In other words, if we want to encode a value of, say, ‘ceo’, we look at the average frequency of of loans with ‘emp_title’ ceo to be fully paid:

# should we recompute with lower-cased here? Or get rid of lower-casing?
paid_fraction['CEO']

0.7534013605442177

So any time we would an ‘emp_title’ of ‘CEO’ our new feature (which will take the place of ‘emp_title’) will be 0.75…

In principle we could compute this new feature using pandas alone by first computing the paid fraction (redoing it for all titles, not just the top 50, and taking lower-casing into account):

# Fill in emp_title_lower with "missing" where it's missing (otherwise these will be ignored)
# inplace means we're overwriting the existing column
loans_paid.emp_title_lower.fillna('missing', inplace=True)
# redoing the pandas trickery from above
paid_fraction_full = loans_paid.groupby('emp_title_lower').loan_status.value_counts(normalize=True).unstack('loan_status')['Fully Paid']
# NaN means none were paid
paid_fraction_full.fillna(0, inplace=True)

C:\Users\t3kci\anaconda3\lib\site-packages\pandas\core\generic.py:6245: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)

And then adding the new feature based on emp_title to each row:

paid_fraction_full

emp_title_lower
 \toffice manager/medical assistant         0.0
 \tsecurity guard                           1.0
  ag                                        1.0
  assembler                                 0.0
  diversion investigator                    1.0
                                           ... 
zoo keeper                                  0.0
zoo lab coordinator                         0.0
zookeeper                                   1.0
zpic coordinator                            1.0
| principal business solution architect|    1.0
Name: Fully Paid, Length: 92041, dtype: float64

paid_fraction_full[loans_paid.emp_title_lower]

emp_title_lower
supervisor                              0.722992
assistant to the treasurer (payroll)    1.000000
teacher                                 0.784661
accounts examiner iii                   1.000000
senior director risk management         1.000000
                                          ...   
systems engineer                        0.843284
accounting                              0.782178
account rep                             0.833333
lpn                                     0.715953
equipment maint supervisor              1.000000
Name: Fully Paid, Length: 383181, dtype: float64

loans_paid.emp_title_lower

100                                supervisor 
152       assistant to the treasurer (payroll)
170                                    teacher
186                      accounts examiner iii
215            senior director risk management
                          ...                 
999995                        systems engineer
999996                              accounting
999997                             account rep
999998                                     lpn
999999              equipment maint supervisor
Name: emp_title_lower, Length: 383181, dtype: object

paid_fraction_full

emp_title_lower
 \toffice manager/medical assistant         0.0
 \tsecurity guard                           1.0
  ag                                        1.0
  assembler                                 0.0
  diversion investigator                    1.0
                                           ... 
zoo keeper                                  0.0
zoo lab coordinator                         0.0
zookeeper                                   1.0
zpic coordinator                            1.0
| principal business solution architect|    1.0
Name: Fully Paid, Length: 92041, dtype: float64

# we use join to map 'emp_title_lower' to the faction using the content of paid_fraction_full
loans_paid_new = loans_paid.join(paid_fraction_full, 'emp_title_lower')
# show the result; the new column is somewhat confusingly called 'Fully Paid' and we should probably rename it
loans_paid_new[['emp_title_lower', 'Fully Paid', 'loan_status']].head(10)

	emp_title_lower	Fully Paid	loan_status
100	supervisor	0.722992	Fully Paid
152	assistant to the treasurer (payroll)	1.000000	Fully Paid
170	teacher	0.784661	Fully Paid
186	accounts examiner iii	1.000000	Fully Paid
215	senior director risk management	1.000000	Fully Paid
269	front office lead	0.600000	Fully Paid
271	sewell collision center	1.000000	Fully Paid
296	manager	0.759863	Fully Paid
369	service advisor	0.735294	Fully Paid
379	stylist	0.738739	Fully Paid

However, there’s two caveats to this (relatively) simple approach: for a classification task like this, it might be beneficial to encode using the log-odds of the target (which is a simple non-linear transformation) instead of using the frequency (it will become clear why in Chapter TODO). Also, if a category appears only rarely, of just once, then the feature will be deceptively informative. If there is just one sample, the feature will encode just the target of this row. This is a typical case of information leakage. The resulting model is likely to learn that this column is particularly useful, though this knowledge will not translate to unseen data.

There are a couple of work-arounds for this, that we won’t discuss in detail here, that either use additional hold-out sets or smoothing. Many of them are implemented in the category_encoders package, from which I’ll particulary recomment the TargetEncoder (which implements todo) and GLMMEncoder (which implements TODO), which was shown to work well in practice todo reference masters thesis. The category_encoder package broadly follows the API of scikit-learn, but also allows you to specify which columns the encoding should be applied to. So you can either use this feature, or the ColumnTransformer.

# TargetEncoder implements a smoothing variant from 
from category_encoders import TargetEncoder
te = TargetEncoder().fit(loans_paid[['emp_title_lower']], loans_paid.loan_status == 'Fully Paid')
te.transform(loans_paid[['emp_title_lower']].head(10))

	emp_title_lower
100	0.722992
152	0.775412
170	0.784661
186	0.939599
215	0.775412
269	0.603155
271	0.775412
296	0.759863
369	0.735294
379	0.738739

You can see that for categories with many samples, such as supervisor and teacher, the feature is identical to what we computed above, whereas for those with less features, such as ‘sewell collision center’, the estimate is much lower.

Summary¶

Categorical variables are quite common in many different settings, and while some models might be able to deal with them natively, preprocessing is necessary for most machine learning models. Using one-hot-encoding is the most common method and usually a good place to start. If there are many categories, it might be useful to encode only the most common ones, and combine all the infrequent ones. Another common approach to dealing with categorical variables with many categories is target encoding, which is implemented in several variants in the category_encoders package.

Applied Machine Learning in Python