class: center, middle ![:scale 40%](images/sklearn_logo.png) ### Introduction to Machine learning with scikit-learn # Trees and Forests Andreas C. Müller Columbia University, scikit-learn .smaller[https://github.com/amueller/ml-training-intro] ??? FIXME bullet points! animation searching for split in tree? categorical data split? predition animation tree? min impurity decrease missing extra tree: only one thereshold per feature? --- class: spacious, middle # Why Trees? ??? - Very powerful modeling method – non-linear! - Doesn’t care about scaling of distribution of data! - “Interpretable” - Basis of very powerful models! - add stacking classifier from PR? --- class: centre,middle # Decision Trees for Classification ??? --- # Idea: series of binary questions .center[ ![:scale 70%](images/trees_img_1.png) ] ??? --- # Building Trees .left-column[ ![:scale 100%](images/trees_img_2.png)
Continuous features: - “questions” are thresholds on single features. - Minimize impurity ] .right-column[ ![:scale 100%](images/trees_img_3.png) ![:scale 100%](images/trees_img_4.png) ] ??? Does the layout look okay?? --- # Criteria (for classification) - Gini Index: `$$H_\text{gini}(X_m) = \sum_{k\in\mathcal{Y}} p_{mk} (1 - p_{mk})$$` - Cross-Entropy: `$$H_\text{CE}(X_m) = -\sum_{k\in\mathcal{Y}} p_{mk} \log(p_{mk})$$` ![:scale 5%](images/trees_img_7.png) observations in node m ![:scale 3%](images/trees_img_8.png) classes ![:scale 5%](images/trees_img_9.png) distribution over classes in node m ??? --- # Prediction .center[ ![:scale 80%](images/trees_img_10.png) ] ??? - Traverse tree based on feature tests - Predict most common class in leaf --- # Regression trees `$$\text{Prediction: } \bar{y}_m = \frac{1}{N_m} \sum_{i \in N_m} y_i $$` Mean Squared Error: `$$ H(X_m) = \frac{1}{N_m} \sum_{i \in N_m} (y_i - \bar{y}_m)^2 $$` Mean Absolute Error: `$$ H(X_m) = \frac{1}{N_m} \sum_{i \in N_m} |y_i - \bar{y}_m| $$` ??? - Without regularization / pruning: - Each leaf often contains a single point to be “pure” --- # Visualizing trees with sklearn .smaller[ ```python from sklearn.datasets import load_breast_cancer cancer = load_breast_cancer() X_train, X_test, y_train, y_test = train_test_split(cancer.data,cancer.target, stratify=cancer.target, random_state=0) from sklearn.tree import DecisionTreeClassifier, export_graphviz tree = DecisionTreeClassifier(max_depth=2) tree.fit(X_train, y_train) # tree visualization tree_dot = export_graphviz(tree, out_file=None, feature_names=cancer.feature_names) print(tree_dot)``` ] ??? --- # Easier visualization [PR #9251](https://github.com/scikit-learn/scikit-learn/pull/9251) (or [a gist](https://gist.github.com/amueller/1f8d3c03305642ab3bad677d9b443b80) ) ```python from tree_plotting import plot_tree tree_dot = plot_tree(tree, feature_names=cancer.feature_names) ``` .center[ ![:scale 80%](images/mpl_tree_plot.png) ] ??? --- class:spacious # Parameter Tuning - Pre-pruning and post-pruning (not in sklearn yet) - Limit tree size (pick one, maybe two): - max_depth - max_leaf_nodes - min_samples_split - min_impurity_decrease - ... ??? --- class:spacious # No pruning .center[ ![:scale 100%](images/trees_img_13.png) ] ??? --- class:spacious # max_depth = 4 .center[ ![:scale 100%](images/trees_img_14.png) ] ??? --- # max_leaf_nodes = 8 .center[ ![:scale 50%](images/trees_img_15.png) ] ??? --- class:spacious # min_samples_split = 50 .center[ ![:scale 70%](images/trees_img_16.png) ] ??? --- .smaller[ ```python from sklearn.model_selection import GridSearchCV param_grid = {'max_depth':range(1, 7)} grid = GridSearchCV(DecisionTreeClassifier(random_state=0),param_grid=param_grid, cv=10) grid.fit(X_train, y_train) ``` ] .center[ ![:scale 70%](images/trees_img_17.png) ] ??? --- .smaller[ ```python from sklearn.model_selection import GridSearchCV param_grid = {'max_leaf_nodes':range(2, 20)} grid = GridSearchCV(DecisionTreeClassifier(random_state=0), param_grid=param_grid, cv=10) grid.fit(X_train, y_train) ``` ] .center[ ![:scale 70%](images/trees_img_18.png) ] ??? --- #Extrapolation .center[ ![:scale 80%](images/ram_prices.png) ] ??? --- #Extrapolation .center[ ![:scale 80%](images/ram_prices_train.png) ] ??? --- #Extrapolation .center[ ![:scale 80%](images/ram_prices_test.png) ] ??? --- # Instability .left-column[ .tiny-code[ ```python X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, stratify=iris.target, random_state=0) tree = DecisionTreeClassifier(max_leaf_nodes=6) tree.fit(X_train, y_train) ``` ] ![:scale 70%](images/trees_img_21.png) ] .right-column[ .tiny-code[```python X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, stratify=iris.target, random_state=1) tree = DecisionTreeClassifier(max_leaf_nodes=6) tree.fit(X_train, y_train) ``` ] .center[![:scale 70%](images/trees_img_22.png) ]] ???. --- class: spacious # Random Forests .center[![:scale 90%](images/trees_img_28.png)] ??? - Smarter bagging for trees! --- # Randomize in two ways .left-column[ - For each tree: - Pick bootstrap sample of data - For each split: - Pick random sample of features - More trees are always better ] .right-column[ ![:scale 100%](images/trees_img_26.png) ![:scale 100%](images/trees_img_29.png) ] ??? --- class:some-space # Tuning Random Forests - Main parameter: max_features - around sqrt(n_features) for classification - Around n_features for regression - n_estimators > 100 - Prepruning might help, definitely helps with model size! - max_depth, max_leaf_nodes, min_samples_split again ??? --- # Variable Importance .smaller[ ```python X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, stratify=iris.target, random_state=1) rf = RandomForestClassifier(n_estimators=100).fit(X_train, y_train) rf.feature_importances_ plt.barh(range(4), rf.feature_importances_) plt.yticks(range(4), iris.feature_names); ``` ```array([ 0.126, 0.033, 0.445, 0.396])``` ] .center[ ![:scale 40%](images/trees_img_32.png) ]