More on Pipelines

We already saw how pipelines can make our live easier in chapter todo. However, when using model evaluation tools such as cross_validate and GridSearchCV, using pipelines becomes essential for obtaining valid results. Also, the use of pipelines in GridSearchCV allows for a variety of powerful use-cases. We’ll explore both of these in this chapter.

Data leakage: a common error

Let’s start with an error that’s commonly made when using cross-validation, which is to leak information from the validation parts of the data. This is an error that has been made, not only countless times by beginning data scientists, but in several published scientific research articles. When doing any preprocessing, it is essential that the preprocessing happens within cross-validation, not outside of it. While we haven’t seen the details of feature selection yet, it provides and excellent example, and so we’ll quickly go over it.

Automatic univariate feature selection

When working with high dimensional datasets, it can be beneficial to work with only a subset of the features. This will reduce the computational burden, increase interpretability, and in some cases can even improve generalization performance. There are several methods for automating this process, which we will discuss in depth in chapter todo. One of the simplest methods of automatic feature selection is using univariate statistics to rank features. Univariate means we are looking only at one feature at a time, and evaluate its relationship with the target, often with a simple statistical measure such as an F test or t-test. We can then rank all the features by the strength of their response (or alternatively by how significant their association with the target was) and select the ones deemed most important. A version of this is implemented in the SelectPercentile transformer in scikit-learn, which allows you to keep a fixed percentage of the existing features. This can be a quick and easy way to subselect features from a very wide dataset and is commonly used. Here is a quick example on the breast cancer dataset:

from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectPercentile
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

# load the dataset and split it into training and test set
X, y  = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
print(X_train.shape)
(426, 30)
# Create a standard pipeline out of scaler and classifier
pipe_knn = make_pipeline(StandardScaler(), KNeighborsClassifier())
# Fit and evaluate as a baseline
pipe_knn.fit(X_train, y_train)
pipe_knn.score(X_test, y_test)
0.958041958041958
# create a pipeline subselecting 20% of the features according to univariate statistics
# Order of scaling and selection does not matter in this case
pipe_select = make_pipeline(StandardScaler(), SelectPercentile(percentile=20), KNeighborsClassifier())
# Fit the pipeline
pipe_select.fit(X_train, y_train)
# slice off the classifier, look at shape of transformed data:
pipe_select[:-1].transform(X_train).shape
(426, 6)

As expected, of the 30 original features, SelectPercentile only kept 20%, meaning 6. Now let’s evaluate the whole pipeline:

pipe_select.score(X_test, y_test)
0.958041958041958

The performance using only 20% of the features is actually identical to the performance when using all the features, but might be much more interpretable. We can see which features were selected by TODO.

Now, that we have familiarized ourselves with how SelectPercentile works (at least in general terms), let’s look at the example mentioned above.

# TODO hide
import numpy as np
rng = np.random.RandomState(42)
X = rng.normal(size=(100, 10000))
y = rng.normal(size=(100,)) > 0

Say someone gave you a binary classification dataset like this:

print(X.shape, y.shape)
# count appearances of 0 and 1 in y
print(np.bincount(y))
(100, 10000) (100,)
[53 47]

It’s very wide, meaning it has many features, compared to the number of samples. This is quite common in sensor networks or in biomedical data for example. Given the small size of the dataset, we might want to use cross-validation to assess performance, instead of using a single train-test split. One might start like this:

# select most informative 5% of features
select = SelectPercentile(percentile=5)
select.fit(X, y)
X_selected = select.transform(X)
print(X_selected.shape)
(100, 500)

Now the dataset seems much more managable at 500 features (which are arguably still a lot), and we can evaluate our model with cross_val_score:

from sklearn.model_selection import cross_val_score
# run cross-validation with the subselected features
cross_val_score(KNeighborsClassifier(), X_selected, y)
array([1., 1., 1., 1., 1.])

It looks like it’s our lucky day: we created a model that classifies our dataset perfectly across all folds. From this evaluation, we might be quite certain we found a good model. However, we made a mistake: we applied the feature selection procedure outside of the cross-validation. We should apply it inside the cross-validation instead. In scikit-learn, we can easily do that using a pipeline (as we did above).

pipe = make_pipeline(SelectPercentile(percentile=5), KNeighborsClassifier())
# run cross-validation on the original dataset using the pipeline
cross_val_score(pipe, X, y)
array([0.45, 0.5 , 0.5 , 0.5 , 0.7 ])

If we use the proper evaluation technique, our results change drastically: our model is around chance performance for a balanced dataset as this, in other words, we might conclude that the model didn’t learn anything. Where does this dramatic difference come from? When we called fit on SelectPercentile before the cross-validation, it had access to the full dataset, which includes the training and test parts for each of the splits. This means it could extract information from all parts of the data, even those that we meant to use as validation set during cross-validation. This is a classical example of information leakage, and a good reason to always use pipelines!

To make the difference in the computation a bit more apparent, I wrote down a more explicit version of the same computation, not using cross_val_score or Pipeline (we’re using KFold here which is a way to get the indices to perform K-fold cross-validation, we’ll see this in more detail in TODO):

preprocessing before cross validation

preprocessing within cross validation

# BAD!
select = SelectPercentile(percentile=5)
select.fit(X, y)  # includes the cv test parts!
X_sel = select.transform(X)
scores = []
for train, test in KFold().split(X, y):
    knn = KNeighborsClassifier().fit(X_sel[train], y[train])
    score = knn.score(X_sel[test], y[test])
    scores.append(score)
# GOOD!
scores = []
select = SelectPercentile(percentile=5)
for train, test in KFold().split(X, y):
    select.fit(X[train], y[train])
    X_sel_train = select.transform(X[train])
    knn = KNeighborsClassifier().fit(X_sel_train, y[train])
    X_sel_test = select.transform(X[test])
    score = knn.score(X_sel_test, y[test])
    scores.append(score)

equivalent to:

select = SelectPercentile(percentile=5)
X_selected = select.fit_transform(X)
scores = cross_val_score(KNeighborsClassifier(), X, y)
pipe = make_pipeline(SelectPercentile(percentile=5),
                     KNeighborsClassifier()
scores = cross_val_score(pipe, X, y)

If we want to estimate the generalization capability of our model, only the code on the right-hand side will give us the correct solution, and only this result will reflect how well the model will perform on new data. As a matter of fact, the data in X and y was generated completely at random, and there was no relationship between the two|. Using the procedure on the left-hand side allowed SelectPercentile to find some of the completely random features that happened to be related to the target, looking at the full dataset, including the validation part in each split. This is where information leaked. Using the procedure on the right-hand side, the feature selection could only select features based on the properties of the training part of the split. Features that had an accidental relationship on the training parts do not necessarily contain any information on the test parts, and so the performance of the model is estimated correctly to be at chance level.

Hopefully will convince you to use Pipeline in all your your work, in particular when using cross-validation. However, if we want to use a Pipeline within GridSearchCV (which you definitely should!), we have to adjust our code a bit.

Pipeline and GridSearchCV

Remember that when using GridSearchCV for tuning hyper-parameters, we pass the estimator together with a dictionary of parameter values. If we pass a Pipeline as the estimator, we need to ensure that the parameters we want to tune are applied to the correct step of the pipeline. In principle, there could be several steps of the pipeline having identical hyper-parameter names. The way to specify the hyperparmeters within a Pipeline it to address it by the name of the step of the pipeline, followed by a double underscore (known as ‘dunder’ in Python), followed by the name of the hyper-parameter. So if we created a pipeline with make_pipeline, and we want to tune the n_neighbors parameter in KNeighbors, we need to use kneighbors__n_neighbors as the hyper-parameter name; remember, the when using make_pipeline, the name that is assigned to each step is the lower-cased class name. Tuning the n_neighbors parameter on the breast cancer dataset could therefore look like this:

from sklearn.model_selection import GridSearchCV

# Load the dataset
X, y  = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# create a pipeline
knn_pipe = make_pipeline(StandardScaler(), KNeighborsClassifier())
# create the search grid.
# Pipeline hyper-parameters are specified as <step name>__<hyper-parameter name>
param_grid = {'kneighborsclassifier__n_neighbors': range(1, 10)}
# Instantiate grid-search
grid = GridSearchCV(knn_pipe, param_grid, cv=10)
# run the grid-search and report results
grid.fit(X_train, y_train)
print(grid.best_params_)
print(grid.score(X_test, y_test))
{'kneighborsclassifier__n_neighbors': 8}
0.965034965034965
You can always check the available hyper-parameters of any model by calling the ``get_params`` method:

```python
knn_pipe = make_pipeline(StandardScaler(), KNeighborsClassifier())
knn_pipe.get_params()
```
```
{'memory': None,
 'steps': [('standardscaler', StandardScaler()),
  ('kneighborsclassifier', KNeighborsClassifier())],
 'verbose': False,
 'standardscaler': StandardScaler(),
 'kneighborsclassifier': KNeighborsClassifier(),
 'standardscaler__copy': True,
 'standardscaler__with_mean': True,
 'standardscaler__with_std': True,
 'kneighborsclassifier__algorithm': 'auto',
 'kneighborsclassifier__leaf_size': 30,
 'kneighborsclassifier__metric': 'minkowski',
 'kneighborsclassifier__metric_params': None,
 'kneighborsclassifier__n_jobs': None,
 'kneighborsclassifier__n_neighbors': 5,
 'kneighborsclassifier__p': 2,
 'kneighborsclassifier__weights': 'uniform'}
```

Having a Pipeline inside GridSearchCV also allows us to tune hyper-parameters of the preprocessing steps. Say we want to tune how many feature we want to select in SelectPercentile, we can do it as follows:

# create a pipeline
select_pipe = make_pipeline(StandardScaler(), SelectPercentile(), KNeighborsClassifier())
# create the search grid.
# Pipeline hyper-parameters are specified as <step name>__<hyper-parameter name>
param_grid = {'kneighborsclassifier__n_neighbors': range(1, 10),
              'selectpercentile__percentile': [1, 2, 5, 10, 50, 100]}
# Instantiate grid-search
grid = GridSearchCV(select_pipe, param_grid, cv=10)
# run the grid-search and report results
grid.fit(X_train, y_train)
print(grid.best_params_)
print(grid.score(X_test, y_test))
{'kneighborsclassifier__n_neighbors': 8, 'selectpercentile__percentile': 100}
0.965034965034965

As you know, the when specifying multiple hyper-parameters, GridSearchCV tries out all possible combinations, so 9 * 6 = 54 different combinations where tried in this code. The result is that keepign all features leads to the best result; this is not very surprising, as our motivation for removing features is usually not improving the accuracy, and if we do feature selection at all, we might be interested in trading off simplicity of the model and generalization ability.

Setting Estimators with GridSearchCV

We can even go one step further and select what preprocessing to include or what model to apply. As a simple example, if we’re unsure whether MinMaxScaler or StandardScaler is more appropriate for our dataset, we could just have GridSearchCV figure that out for us. After declaring a Pipeline object, each step becomes a hyper-parameter to which we can assign an estimator of our choice. It might be more natural in this case to name the steps of our pipeline manually, though you don’t have to.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
# declare a two step pipeline, explicitly giving names to both steps.
pipe = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier())])
# The name of the first step is 'scaler' and we can assign different
# estimators to this step, such as MinMaxScaler or StandardScaler
# There is a special value 'passthrough' which skips the step
param_grid = {'scaler': [MinMaxScaler(), StandardScaler(), 'passthrough'],
              # we named the second step knn, so we have to use that name here
              'knn__n_neighbors': range(1, 10)}
# instantiate and run as before:
grid = GridSearchCV(pipe, param_grid, cv=10)
grid.fit(X_train, y_train)
print(grid.best_params_)
print(grid.score(X_test, y_test))
{'knn__n_neighbors': 8, 'scaler': StandardScaler()}
0.965034965034965

In this case, we didn’t win much, but this is a useful tool for automating model selection. However, keep in mind that each option that you add will add a multiplier to your runtime, as all possible combinations are tried. We’ll revisit this in chapter TODO.

Searching Lists of Grids

There is a little-known but very useful feature in GridSearchCV that I want to mention at this point. In fact, GridSearchCV can not only search over grids, but also over lists of grids, which are specified as lists of dictionaries. This comes in handy when trying to search over different preprocessing steps or models which have different hyper-parameters. For example, say we wanted to tune whether the MinMaxScaler should scale between 0 and 1 or between -1 and 1, while also considering the case if using StandardScaler. We can’t just add feature_range to the param_grid dictionary because StandardScaler doesn’t have a feature_range parameter. Instead we can create a list of two grids: one grid that always uses MinMaxScaler and one that always uses StandardScaler. This is a bit of a contrived example, but once we know more models and transformers there will be plenty of cases where this comes in handy.

The param_grid could then be specified as follows:

param_grid = [ # list of two dicts
    # first dict always uses MinMaxScaler
    {'scaler': [MinMaxScaler()],
     # two options for feature_range:
     'feature_range': [(0, 1), (-1, 1)]},
    # second dict always uses StandardScaler
    # there are no options that we're tuning
    {'scaler': [StandardScaler()]}   
]

There are a couple of points to note here: first, the values for scaler always need to be a list, even if it’s a list with a single element. So we can’t specify 'scaler': MinMaxScaler(). Second, I left out the tuning of n_neightbors here. If we want to tune n_neighbors as well as selecting the preprocessing, we need to specify the range for each of the grids, like so:

param_grid = [
    {'scaler': [MinMaxScaler()],
     'feature_range': [(0, 1), (-1, 1)],
     'knn__n_neighbors': range(1, 10)},

    {'scaler': [StandardScaler()],
     'knn__n_neighbors': range(1, 10)}   
]

This usage of GridSearchCV is a bit more advanced and it doesn’t come up that often, but it’s good to have in your back pocket.

Accessing attributes in grid-searched pipeline

Finally, I want to walk through how you can get to any attributes of your model if it is in a pipeline in a gridsearch. We have seen all the parts of this already, but it’s a bit involved and so I want to unpack it here. We fit a grid object above, which contained a Pipeline consisting of a 'scaler' step and a 'knn' step. Now let’s say we want to find out what the mean of the training data was (again, this is a bit contrived but will come handy later for model inspection). As we learned in chapter TODO, we can get access to the model fitted on the whole training data using the best_estimator_ attribute of GridSearchCV:

grid
GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('knn', KNeighborsClassifier())]),
             param_grid={'knn__n_neighbors': range(1, 10),
                         'scaler': [MinMaxScaler(), StandardScaler(),
                                    'passthrough']})
grid.best_estimator_
Pipeline(steps=[('scaler', StandardScaler()),
                ('knn', KNeighborsClassifier(n_neighbors=8))])

As you can see (and might have expected), grid.best_estimator_ is a pipeline. So if we want to access the scaler, we need to extract the step we’re interested in, for example using []:

grid.best_estimator_['scaler']
StandardScaler()

It’s not immediately obvious from the representation in Jupyter, but this is the scaler that was fitted on the whole training dataset. Now if we want to access the mean_ we can just do so:

# suppress scientific notation, only show two decimal points
np.set_printoptions(suppress=True, precision=2)
grid.best_estimator_['scaler'].mean_
array([ 14.12,  19.2 ,  91.89, 654.92,   0.1 ,   0.1 ,   0.09,   0.05,
         0.18,   0.06,   0.4 ,   1.21,   2.86,  40.13,   0.01,   0.03,
         0.03,   0.01,   0.02,   0.  ,  16.21,  25.51, 106.89, 873.72,
         0.13,   0.25,   0.27,   0.11,   0.29,   0.08])

TODO ColumnTransformer also?

Summary

In this chapter we saw the importance of using pipelines to avoid information leakage, in particular when using cross-validation. We also saw how you can combine Pipeline and GridSearchCV to tune your whole workflow with minimal code. Understanding Pipeline and how it interacts with model validation is critical for working with scikit-learn. Now, you know all of the most important building blocks of scikit-learn, and we have all the tools to start using the different models implemented in scikit-learn.