+ - 0:00:00
Notes for current slide

FIXME bullet points! animation searching for split in tree? categorical data split? predition animation tree? min impurity decrease missing extra tree: only one thereshold per feature?

Notes for next slide
  • Very powerful modeling method – non-linear!
  • Doesn’t care about scaling of distribution of data!
  • “Interpretable”
  • Basis of very powerful models!
  • add stacking classifier from PR?

Introduction to Machine learning with scikit-learn

Trees and Forests

Andreas C. Müller

Columbia University, scikit-learn

https://github.com/amueller/ml-training-intro

1 / 25

FIXME bullet points! animation searching for split in tree? categorical data split? predition animation tree? min impurity decrease missing extra tree: only one thereshold per feature?

Why Trees?

2 / 25
  • Very powerful modeling method – non-linear!
  • Doesn’t care about scaling of distribution of data!
  • “Interpretable”
  • Basis of very powerful models!
  • add stacking classifier from PR?

Decision Trees for Classification

3 / 25

Idea: series of binary questions

4 / 25

Building Trees



Continuous features:

  • “questions” are thresholds on single features.
  • Minimize impurity

5 / 25

Does the layout look okay??

Criteria (for classification)

  • Gini Index:

$$H_\text{gini}(X_m) = \sum_{k\in\mathcal{Y}} p_{mk} (1 - p_{mk})$$

  • Cross-Entropy:

$$H_\text{CE}(X_m) = -\sum_{k\in\mathcal{Y}} p_{mk} \log(p_{mk})$$

observations in node m

classes

distribution over classes in node m

6 / 25

Prediction

7 / 25
  • Traverse tree based on feature tests
    • Predict most common class in leaf

Regression trees

$$\text{Prediction: } \bar{y}_m = \frac{1}{N_m} \sum_{i \in N_m} y_i $$

Mean Squared Error: $$ H(X_m) = \frac{1}{N_m} \sum_{i \in N_m} (y_i - \bar{y}_m)^2 $$

Mean Absolute Error: $$ H(X_m) = \frac{1}{N_m} \sum_{i \in N_m} |y_i - \bar{y}_m| $$

8 / 25
  • Without regularization / pruning:
    • Each leaf often contains a single point to be “pure”

Visualizing trees with sklearn

from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data,cancer.target,
stratify=cancer.target,
random_state=0)
from sklearn.tree import DecisionTreeClassifier, export_graphviz
tree = DecisionTreeClassifier(max_depth=2)
tree.fit(X_train, y_train)
# tree visualization
tree_dot = export_graphviz(tree, out_file=None, feature_names=cancer.feature_names)
print(tree_dot)
9 / 25

Easier visualization

PR #9251 (or a gist )

from tree_plotting import plot_tree
tree_dot = plot_tree(tree, feature_names=cancer.feature_names)

10 / 25

Parameter Tuning

  • Pre-pruning and post-pruning (not in sklearn yet)

  • Limit tree size (pick one, maybe two):

    • max_depth

    • max_leaf_nodes

    • min_samples_split

    • min_impurity_decrease

    • ...

11 / 25

No pruning

12 / 25

max_depth = 4

13 / 25

max_leaf_nodes = 8

14 / 25

min_samples_split = 50

15 / 25
from sklearn.model_selection import GridSearchCV
param_grid = {'max_depth':range(1, 7)}
grid = GridSearchCV(DecisionTreeClassifier(random_state=0),param_grid=param_grid,
cv=10)
grid.fit(X_train, y_train)

16 / 25
from sklearn.model_selection import GridSearchCV
param_grid = {'max_leaf_nodes':range(2, 20)}
grid = GridSearchCV(DecisionTreeClassifier(random_state=0),
param_grid=param_grid, cv=10)
grid.fit(X_train, y_train)

17 / 25

Extrapolation

18 / 25

Extrapolation

19 / 25

Extrapolation

20 / 25

Instability

X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, stratify=iris.target, random_state=0)
tree = DecisionTreeClassifier(max_leaf_nodes=6)
tree.fit(X_train, y_train)
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, stratify=iris.target, random_state=1)
tree = DecisionTreeClassifier(max_leaf_nodes=6)
tree.fit(X_train, y_train)

???.
21 / 25

Random Forests

22 / 25
  • Smarter bagging for trees!

Randomize in two ways

  • For each tree:
    • Pick bootstrap sample of data
  • For each split:

    • Pick random sample of features
  • More trees are always better

23 / 25

Tuning Random Forests

  • Main parameter: max_features

    • around sqrt(n_features) for classification
    • Around n_features for regression
  • n_estimators > 100

  • Prepruning might help, definitely helps with model size!
  • max_depth, max_leaf_nodes, min_samples_split again
24 / 25

Variable Importance

X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, stratify=iris.target, random_state=1)
rf = RandomForestClassifier(n_estimators=100).fit(X_train, y_train)
rf.feature_importances_
plt.barh(range(4), rf.feature_importances_)
plt.yticks(range(4), iris.feature_names);

array([ 0.126, 0.033, 0.445, 0.396])

25 / 25

Why Trees?

2 / 25
  • Very powerful modeling method – non-linear!
  • Doesn’t care about scaling of distribution of data!
  • “Interpretable”
  • Basis of very powerful models!
  • add stacking classifier from PR?
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow