Feature importances and Model interpretation¶

Model Interpretation and Feature Selection¶

03/04/20

Andreas C. Müller

Alright, everybody. Today we will talk about Feature selection and model interpretation, look at how to do this in general, and how to doit with scikit-learn. We’re also going to look into a couple of other Python libraries that help doing these two things.

FIXME motivation for model interpretation! FIXME random forest code is confusing because I tried to simplify it. Also: we haven’t seen those yet?!

FIXME use / explain permutation importance

FIXME use seaborn clustermap for correlation matrix plot

FIXME add diagram for permutation importance FIXME explain when and how to interpret coeffients FIXME mutual information needs better explanation FIXME lasso show random between correlated, same for tree, compare to RF FIXME IceBOX example with cross? FIXME should be two lectures really! FIXME maybe drop shap and shaply FIXME better explanation for ice-box and partial dependence FIXME partial dependence definitely needs a process diagram FIXME remove boston FIXME icebox do real simpsons paradox example FIXME unsupervised feature selection terrible? FIXME why is shap in the wrong direction? Why spurious importances?

Model Interpretation (post-hoc)?¶

Not inference!¶

Not causality!¶

still useful(?)¶

Possibly useful for feature selection, feature engineering, model debugging, explaining predictions. Alternatively: use model that are easy to interpret! OR: derive a simpler explainable model from a non-explainable one (model compression)

Caveat: if your model is bad, interpreting it makes no sense!

Types of explanations¶

##Explain model globally

How does the output depend on the input?
Often: some form of marginals

Explain model locally¶

Why did it classify this point this way?
Explanation could look like a “global” one but be different for each point.
“What is the minimum change to classify it differently?”

Marginals: how does the prediction change as function of a particular Features?

Explaning the model \(\neq\) explaining the data¶

model inspection only tells you about the model
the model might not accurately reflect the data

“Features important to the model”?¶

Naive:

coef_ for linear models (abs value or norm for multi-class)
feature_importances_ for tree-based models

Use with care!

Linear Model coefficients¶

Relative importance only meaningful after scaling
Correlation among features might make coefficients completely uninterpretable
L1 regularization will pick one at random from a correlated group
Any penalty will invalidate usual interpretation of linear coefficients

Drop Feature Importance¶

\[I_i^\text{drop} = \text{Acc}(f, X, y) - \text{Acc}(f', X_{-i}, y)\]

.smallest[

def drop_feature_importance(est, X, y):
    base_score = np.mean(cross_val_score(est, X, y))
    scores = []
    for feature in range(X.shape[1]):
        mask = np.ones(X.shape[1], 'bool')
        mask[feature] = False
        X_new = X[:, mask]
        this_score = np.mean(cross_val_score(est, X_new, y))
        scores.append(base_score - this_score)
    return np.array(scores)
```]
.smallest[
- Doesn't really explain model (refits for each feature)
- Can't deal with correlated features well
- Very slow
- Can be used for feature selection
]


on it's own not very useful, we'll see it later
rebuilds many models, not explaining this particular model!
+++
## Permutation importance
Idea: measure marginal influence of one feature

$$I_i^\text{perm} = \text{Acc}(f, X, y) - \mathbb{E}_{x_i}\left[\text{Acc}(f(x_i, X_{-i}), y)\right]$$

```python
def permutation_importance(est, X, y, n_repeat=100):
  baseline_score = estimator.score(X, y)
  for f_idx in range(X.shape[1]):
      for repeat in range(n_repeat):
          X_new = X.copy()
          X_new[:, f_idx] = np.random.shuffle(X[:, f_idx])
          feature_score = estimator.score(X_new, y)
          scores[f_idx, repeat] = baseline_score - feature_score

–

Applied on validation set given trained estimator.
Also kinda slow.

LIME¶

.smaller[

Build sparse linear local model around each data point.
Explain prediction for each point locally.
Paper: “Why Should I Trust You?” Explaining the Predictions of Any Classifier
Implementation: ELI5, https://github.com/marcotcr/lime ]

.center[ :scale 60% ]

SHAP¶

Build around idea of Shaply values
very roughly: does drop-out importance for every subset of features
intractable, sampling approximations exists
fast variants for linear and tree-based models
awesome vis and tools: https://github.com/slundberg/shap
allows local / per sample explanations

Case Study¶

Toy data¶

X.shape

(100000, 8)

.wide-left-column[ :scale 100%

] .narrow-right-column[ :scale 100%

]

Models on lots of data¶

.smallest[

lasso = LassoCV().fit(X_train, y_train)
lasso.score(X_test, y_test)

0.545

ridge = RidgeCV().fit(X_train, y_train)
ridge.score(X_test, y_test)

0.545

lr = LinearRegression().fit(X_train, y_train)
lr.score(X_test, y_test)

0.545

param_grid = {'max_leaf_nodes': range(5, 40, 5)}
grid = GridSearchCV(DecisionTreeRegressor(), param_grid, cv=10, n_jobs=3)
grid.fit(X_train, y_train)
grid.score(X_test, y_test)

0.545

rf = RandomForestRegressor(min_samples_leaf=5).fit(X_train, y_train)
rf.score(X_test, y_test)

0.542 ]

Coefficients and default feature importance¶

:scale 100%

tree and RF have no directionality in importances
lasso (and to some degree tree) selected randomly among correlated
RF assigns high importance to random features

Permutation Importances¶

.center[ :scale 50% ]

SHAP values¶

.center[ :scale 50% ]

More model inspection¶

Partial Dependence Plots¶

Marginal dependence of prediction on one (or two features)

\[f_i^{\text{pdp}}(x_i) = \mathbb{E}_{X_{-i}}\left[f(x_i, x_{-i})\right]\]

Idea: Get marginal predictions given feature.
How? “integrate out” other features using validation data
Fast methods available for tree-based models (doesn’t require validation data)
Nonsensical for linear models.

Partial Dependence¶

.tiny[

from sklearn.inspection import plot_partial_dependence
boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target,random_state=0)

gbrt = GradientBoostingRegressor().fit(X_train, y_train)

fig, axs = plot_partial_dependence(gbrt, X_train, np.argsort(gbrt.feature_importances_)[-6:], feature_names=boston.feature_names)

]

–

.center[ :scale 60% ]

But there’s also something else that’s quite interesting, which is called Partial Dependence Plots that’s actually possible for all trees. Unfortunately, in scikit-learn, it’s available only for the gradient boosting. So the idea here is to not only see what parameters are important but how will they influence the target. And so after you fit your model, I’m using the Boston data set to the gradient boosting regressor, there’s a thing called plot partial dependence, which gets the model and the training data set and then the features for which I want to make the partial dependence plots.

So here, I’m looking at the six most important features. The feature importance is important according to the trees. So I sort this and take the six most important ones.

And so this is what the partner dependence plots look like. So this is the most important feature, the second most important feature and so on. This is sort of the marginal contribution of each feature. Basically, in each tree, you’re summing out the contribution of all the other features, and you look only at what is the contribution of this particular feature. In general, this would be hard to compute. For tree-based models you can compute is a very efficient way because you can basically just sum up over the whole space by traversing the tree.

Here with increased room size, the price increases and you can see how it increases and you can see that there’s like a big step function here. And you can see that for increased LSTAT, the price decreases. For the others, there aren’t any drastic effect. There seems to be some threshold for NOX. And something funky going on DIS.

The question is, so I said for random forest, you can’t find out the direction of which way the feature influences. But that was for the feature importance. So the feature importance tells the impurity decrease. So here I also look at the feature importance and they’re just numbers so they don’t give you the direction. For both gradient boosting and random forest, you can look at partial dependency plot, and they will give you a non-parametric model of how a single feature influences it. Unfortunately, it’s not available in scikit-learn. You could do the same thing for random forests. Because in the end, the way they predict is quite similar.

Bivariate Partial Dependence Plots¶

.smaller[

plot_partial_dependence(
    gbrt, X_train, [np.argsort(gbrt.feature_importances_)[-2:]],
    feature_names=boston.feature_names, n_jobs=3, grid_resolution=50)

] .center[ :scale 80% ]

You can also do this in bivariate. So here, looking at the two most important features in a bivariate plot, I’m actually not giving it a list of features, but I’m giving it a list of list of features. And so the two most important features, LSTAT, and ROOM.

You can see there’s no very complex interaction. As you can see, the effect of these two features taking into account all the other features.

This is something you can usually do for linear model easily. If you do this for the linear model, you would have lines everywhere but for more complicated models this is very hard to do. Way more tricky to do in neural networks.

Partial Dependence for Classification¶

.tiny-code[

from sklearn.inspection import plot_partial_dependence
for i in range(3):
    fig, axs = plot_partial_dependence(gbrt, X_train, range(4), n_cols=4,
                                       feature_names=iris.feature_names, grid_resolution=50, label=i)

] .center[ :scale 50% ]

We can do the same for classification. Here is a partial dependency plot for the iris dataset.

As I said, for classification we’re usually using One Versus Rest. So here, we have three classifiers.

One classifier for sertosa, one for Versicolor and one for virginica.

You can see that for the first two features they’re like flat. But for sertosa the decision function is high for small part petal length, medium peddling for versicolor and large for virginca. These are all put through logistic function and then normalized.

PDP Caveats¶

.left-column[ ]¶

.right-column[ :scale 100% ]

Ice Box¶

.smaller[

like partial dependence plots, without the .mean(axis=0) ]

.center[ :scale 60% ] .tiny[

https://pdpbox.readthedocs.io/en/latest/
https://github.com/AustinRochford/PyCEbox
https://github.com/scikit-learn/scikit-learn/pull/16164 ]

# generate data from original 2d linear model
from sklearn.preprocessing import scale

# 35 13 50? 
rng = np.random.RandomState(13)
n_samples = 100000
n_informative = 2
n_correlated_per_inf = 2
n_noise = 4
noise_std = .0001
noise_correlated_std = .51
noise_y = .3


X_original = rng.uniform(-1, 1, size=(n_samples, n_informative))
#coef = rng.normal(size=n_informative)
coef = np.array([-3.2, 1.4])
y = np.dot(X_original, coef) + rng.normal(scale=noise_y, size=n_samples)

correlated_transform = np.zeros((n_correlated_per_inf * n_informative, n_informative))
for i in range(n_informative):
    correlated_transform[i * n_correlated_per_inf: (i + 1) * n_correlated_per_inf, i] = rng.normal(size=n_correlated_per_inf)
X_original += rng.normal(scale=np.array([1, 1]) * noise_correlated_std, size=X_original.shape)
X_correlated = np.dot(X_original, correlated_transform.T)

X = np.hstack([X_correlated, np.zeros((n_samples, n_noise))])
X += rng.normal(scale=noise_std, size=X.shape)
X = scale(X)

X.shape

(100000, 8)

plt.imshow(np.cov(X, rowvar=False), cmap='bwr_r')
plt.title("Synthetic data Covariance")
plt.xlabel("feature index")
plt.ylabel("feature index")
plt.colorbar()
plt.savefig("images/covariance.png")

fig, axes = plt.subplots(2, 4, figsize=(16, 8))
for i, ax in enumerate(axes.ravel()):
    ax.plot(X[::10, i], y[::10], '.', alpha=.1)
    ax.set_xlabel("feature {}".format(i))
    ax.set_ylabel("target y")
plt.savefig("images/toy_data_scatter.png")

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

from sklearn.linear_model import LassoCV, RidgeCV, LinearRegression

for i in range(X.shape[1]):
    lr = LinearRegression().fit(X_train[:, [i]], y_train)
    print(lr.score(X_test[:, [i]], y_test))

0.4532506867871262
0.4532514261726276
0.09275569004419326
0.09275599504713594
-0.00020773972150145426
-0.00017276124646903313
-0.00018329268681771538
-0.00020494411292082404

lasso = LassoCV().fit(X_train, y_train)
lasso.score(X_test, y_test)

0.5453241219700229

ridge = RidgeCV().fit(X_train, y_train)
ridge.score(X_test, y_test)

0.5453306487434062

lr = LinearRegression().fit(X_train, y_train)
lr.score(X_test, y_test)

0.5453299378456281

from sklearn.decomposition import PCA
pca = PCA(n_components=.99).fit(X_train)
X_train_pca = pca.transform(X_train)
lr_pca = LinearRegression().fit(X_train_pca, y_train)
inverse_lr_pca_coef = pca.inverse_transform(lr_pca.coef_)
lr_pca.score(pca.transform(X_test), y_test)

0.5453305950074661

pca.n_components_

from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

param_grid = {'max_leaf_nodes': range(5, 40, 5)}
grid = GridSearchCV(DecisionTreeRegressor(), param_grid, cv=10, n_jobs=3)
grid.fit(X_train, y_train)
grid.score(X_test, y_test)

0.5452326112689287

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(min_samples_leaf=5).fit(X_train, y_train)
rf.score(X_test, y_test)

0.5426645543931425

from sklearn.ensemble import ExtraTreesRegressor
et = ExtraTreesRegressor().fit(X_train, y_train)
et.score(X_test, y_test)

0.5216260871052119

def plot_importance(some_dict):
    plt.figure(figsize=(10, 4))
    df = pd.DataFrame(some_dict)
    ax = plt.gca()
    df.plot.bar(ax=ax, width=.9)
    ax.set_ylim(-1.5, 1.5)
    ax.set_xlim(-.5, len(df) - .5)
    ax.set_xlabel("feature index")
    ax.set_ylabel("importance value")
    plt.vlines(np.arange(.5, len(df) -1), -1.5, 1.5, linewidth=.5)

tree = grid.best_estimator_
plot_importance({'lasso': lasso.coef_, 'ridge': ridge.coef_, 'lr': lr.coef_, 'tree': tree.feature_importances_, 'rf':rf.feature_importances_})
plt.title("Coefficients and entropy improvement on large data")
plt.savefig("images/standard_importances.png")

from sklearn.model_selection import cross_val_score

def drop_feature_importance(est, X, y):
    base_score = np.mean(cross_val_score(est, X, y))
    scores = []
    for feature in range(X.shape[1]):
        mask = np.ones(X.shape[1], 'bool')
        mask[feature] = False
        X_new = X[:, mask]
        this_score = np.mean(cross_val_score(est, X_new, y))
        scores.append(base_score - this_score)
    return np.array(scores)

from sklearn.inspection import permutation_importance

perm_ridge_test = permutation_importance(ridge, X_test, y_test)['importances_mean']
perm_lasso_test = permutation_importance(lasso, X_test, y_test)['importances_mean']
perm_tree_test = permutation_importance(tree, X_test, y_test)['importances_mean']

perm_rf_test = permutation_importance(rf, X_test, y_test)['importances_mean']

perm_lr_test = permutation_importance(lr, X_test, y_test)['importances_mean']

tree = grid.best_estimator_
plot_importance({'lasso': perm_lasso_test, 'ridge': perm_ridge_test, 'lr': perm_lr_test,'tree': perm_tree_test, 'rf':perm_rf_test})
plt.title("Permutation importance on test set (large training data)")
plt.savefig("images/permutation_importance_big.png")

import shap

def shap_linear(model, X_train, X_test):
    linear_explainer = shap.LinearExplainer(model, X_train)
    shap_values = linear_explainer.shap_values(X_test)
    s = shap_values.mean(axis=0)
    s /= np.linalg.norm(s)
    return s

shap_ridge = shap_linear(ridge, X_train, X_test)
shap_lasso = shap_linear(lasso, X_train, X_test)

def shap_trees(model, X_train, X_test, approximate=False, tree_limit=None):
    tree_explainer = shap.TreeExplainer(model, X_train)
    shap_values = tree_explainer.shap_values(X_test, approximate=approximate, tree_limit=tree_limit)
    s = shap_values.mean(axis=0)
    s /= np.linalg.norm(s)
    return s

shap_tree = shap_trees(tree, X_train, X_test)

Passing 75000 background samples may lead to slow runtimes. Consider using shap.sample(data, 100) to create a smaller background data set.
100%|===================| 24983/25000 [09:37<00:00]

# limit trees cause I'm in a hurry
shap_forest = shap_trees(rf, X_train, X_test, approximate=True, tree_limit=None)

plot_importance({'lasso': shap_lasso, 'ridge': shap_ridge, 'tree': shap_tree, 'rf':shap_forest})
plt.title("SHAP values on test set (large training data)")
plt.savefig("images/shap_big.png")

# X_test is actually big, not small, but that would be confusing naming, right?
X_train_small, X_test_small, y_train_small, y_test_small = train_test_split(X, y, train_size=0.001, random_state=0)
X_train_small.shape

(100, 8)

lasso_small = LassoCV().fit(X_train_small, y_train_small)
lasso_small.score(X_test_small, y_test_small)

0.5361580595868876

ridge_small = RidgeCV().fit(X_train_small, y_train_small)
ridge_small.score(X_test_small, y_test_small)

0.5336853626423791

lr_small = LinearRegression().fit(X_train_small, y_train_small)
lr_small.score(X_test_small, y_test_small)

0.5262746934049003

from sklearn.decomposition import PCA
pca_small = PCA(n_components=.99).fit(X_train_small)
X_train_pca = pca_small.transform(X_train_small)
lr_pca_small = LinearRegression().fit(X_train_pca, y_train_small)
inverse_lr_pca_coef = pca.inverse_transform(lr_pca_small.coef_)
lr_pca_small.score(pca_small.transform(X_test_small), y_test_small)

0.5303117477435796

param_grid = {'max_leaf_nodes': range(2, 20)}
grid_tree_small = GridSearchCV(DecisionTreeRegressor(), param_grid, cv=10, n_jobs=3)
grid_tree_small.fit(X_train_small, y_train_small)
grid_tree_small.score(X_test_small, y_test_small)

0.34452144133533447

rf_small = RandomForestRegressor(min_samples_leaf=5).fit(X_train_small, y_train_small)
rf_small.score(X_test_small, y_test_small)

0.5053610963883473

tree_small = grid_tree_small.best_estimator_
plot_importance({'lasso': lasso_small.coef_, 'ridge': ridge_small.coef_, 'lr': lr_small.coef_, 'tree': tree_small.feature_importances_, 'rf':rf_small.feature_importances_})
plt.title("Coefficients and entropy improvement on 100 samples")
plt.savefig("images/standard_importances_small.png")

perm_ridge_test_small = permutation_importance(ridge_small, X_test_small, y_test_small).mean(axis=1)
perm_lasso_test_small = permutation_importance(lasso_small, X_test_small, y_test_small).mean(axis=1)
perm_tree_test_small = permutation_importance(tree_small, X_test_small, y_test_small).mean(axis=1)

perm_rf_test_small = permutation_importance(rf_small, X_test_small, y_test_small).mean(axis=1)

plot_importance({'lasso': perm_lasso_test_small, 'ridge': perm_ridge_test_small, 'tree': perm_tree_test_small, 'rf':perm_rf_test_small})
plt.title("Permutation importance on test set (100 training samples)")
plt.savefig("images/permutation_importance_small.png")

shap_ridge_small = shap_linear(ridge_small, X_train_small, X_test_small)
shap_lasso_small = shap_linear(lasso_small, X_train_small, X_test_small)

/home/andy/anaconda3/envs/py37/lib/python3.7/site-packages/shap/explainers/linear.py:47: UserWarning: The default value for feature_dependence has been changed to "independent"!
  warnings.warn('The default value for feature_dependence has been changed to "independent"!')

shap_tree_small = shap_trees(best_tree_small, X_train_small, X_test_small)

from sklearn.tree import plot_tree
plt.figure(figsize=(10, 4))
plot_tree(grid_tree_small.best_estimator_);

dt = DecisionTreeRegressor(max_leaf_nodes=3).fit(X_train_small[:, [0, 2]], y_train_small)
dt.score(X_test_small[:, [0, 2]], y_test_small)

0.34452144133533447

best_tree = grid_tree_small.best_estimator_
drop_tree_small = drop_feature_importance(best_tree, X_train_small, y_train_small)
perm_tree_small = permutation_importance(best_tree, X_test_small, y_test_small).mean(axis=1)

plot_importance({'lasso': np.abs(lasso_small.coef_), 'ridge': np.abs(ridge_small.coef_), 'tree': grid_tree_small.best_estimator_.feature_importances_,
                 'perm_tree': perm_tree_small, 'drop_tree': drop_tree_small})
plt.ylabel("feature importance (abs of coef for linear models)")
plt.title("Tree model with 100 samples")
plt.savefig("images/tree_less_data_all.png")

def shap_trees(model, X_train, X_test):
    tree_explainer = shap.TreeExplainer(model, X_train)
    shap_values = tree_explainer.shap_values(X_test)
    s = shap_values.mean(axis=0)
    s /= np.linalg.norm(s)
    return s

shap_tree_small = shap_trees(best_tree, X_train_small, X_test_small)

Applied Machine Learning in Python

Feature importances and Model interpretation¶

Model Interpretation and Feature Selection¶

Model Interpretation (post-hoc)?¶

Not inference!¶

Not causality!¶

still useful(?)¶

Types of explanations¶

Explain model locally¶

Explaning the model \(\neq\) explaining the data¶

“Features important to the model”?¶

Linear Model coefficients¶

Drop Feature Importance¶

LIME¶

SHAP¶

Case Study¶

Toy data¶

Models on lots of data¶

Coefficients and default feature importance¶

Permutation Importances¶

SHAP values¶

More model inspection¶

Partial Dependence Plots¶

Partial Dependence¶

Bivariate Partial Dependence Plots¶

Partial Dependence for Classification¶

PDP Caveats¶

.left-column[ ]¶

Ice Box¶

Random Forest¶

Partial Dependence¶

ICEBox¶

Feature Selection¶

Supervised feature selection¶