class: center, middle ### W4995 Applied Machine Learning # Trees, Forests & Ensembles 02/18/19 Andreas C. Müller ??? FIXME bullet points! animation searching for split in tree? categorical data split? predition animation tree? min impurity decrease missing extra tree: only one thereshold per feature? FIXME: saw 1d tree explanation somewhere that was really neat? FIXME: short? Not if I actually explain the bullet points. FIXME: bias variance plot replace by two kinds of error (approximation, generalization) FIXME: length actually good. so more of a note than fixme lol FIXME: test curve mislabeled warm start FIXME explain "values" in figure in slide notes FIXME plots sometimes say values and sometimes samples, make sure it's consistent! FIXME some plots say entropy but actually use gini?! wtf, redo plots FIXME add graphviz back in as it's a bit higher quality? FIXME entropy calculation for tree, maybe do it once explicitly? FIXME Go over finding a single split in some toy data! 1d regression? --- class: spacious, middle # Why Trees? ??? Trees are really very powerful models both for classification and regression, they're very commonly used. Trees are very popular in the industry. Only beaten by deep neural networks. Tree-based models are one of the main tools in the toolbox of most machine learning practitioners. Trees really don't care about the scale of the data so you don't really have to do a lot of preprocessing. Trees don't care about the distribution of your data. There are versions of trees that can deal with categorical variables and with missing values. By combining multiple trees you can build stronger models. - Very powerful modeling method – non-linear! - Doesn’t care about scaling of distribution of data! - “Interpretable” - Basis of very powerful models! - add stacking classifier from PR? --- class: centre,middle # Decision Trees for Classification ??? --- # Idea: series of binary questions .center[ ![:scale 70%](images/tree_illustation.png) ] ??? Binary classification trees are binary trees, where each node corresponds to a question asked about the data. This example is a fictional task where you want to distinguish hawks, penguins, dolphins, and bears. You ask a series of binary questions that will lead to which animal it might be. Depending on what the answer is for a particular question, you could ask another question. This way you can narrow it down for possible outcomes. Usually, when we are doing classification we have continuous features and so the questions are thresholds on single features. --- # Building Trees .left-column[ ![:scale 100%](images/tree_building_iteration_1.png)
Continuous features: - “questions” are thresholds on single features. - Minimize impurity ] .right-column[ ![:scale 100%](images/tree_building_iteration_2.png) ![:scale 100%](images/tree_building_iteration_9.png) ] ??? Here is an example of a tree with depth one, that's basically just thresholding a single feature. In this example, the question being asked is, is X1 less than or equal to 0.0596. The boundary between the 2 regions is the decision boundary. The decision for each of the region would be the majority class on it. This decision tree of depth one would classify everything that is below the horizontal line as class red since there are more red points (32>2) and everything above as blue since there are more blue points (48>18). The way this threshold was chosen, the way that tree of depth one was learned was to find the best feature and the best threshold to minimize impurity. Which means it tries to make the resulting subsets, the top half and the bottom half here, to be as pure as possible (to consist mostly of one class) Basically, we're searching over all possible features and all possible thresholds. Possible thresholds are all values of the feature that we're observing. So I would start iterating over the first feature and for the first feature, I would look at all possible thresholds and see if this is a good question to ask, i.e. is this a good way to split the data set into two classes. If you search over all possible features, then the best split is found. Once we find this split, we can then basically apply this algorithm recursively to the two areas that the split created. For top area here, we can ask what the best feature is and what the best threshold for this feature is to make the resulting regions as pure as possible. Here in the bottom, it asks is X0 less or equal than minus 0.4? The model will predict accordingly, either blue or red. If only one class remain after a split, the tree will not add any splits since the regions have become pure. Each node corresponds to a split and the leaves correspond to regions. All tree building algorithms start greedily building a tree, so they don't really go back and revisit because it would be too computationally expensive. Does the layout look okay?? --- # Criteria (for classification) - Gini Index: `$$H_\text{gini}(X_m) = \sum_{k\in\mathcal{Y}} p_{mk} (1 - p_{mk})$$` - Cross-Entropy: `$$H_\text{CE}(X_m) = -\sum_{k\in\mathcal{Y}} p_{mk} \log(p_{mk})$$` $X_m$ observations in node m $\mathcal{Y}$ classes $p_{m\cdot}$ distribution over classes in node m ??? For classification, there are two common criteria to preserve the purity of the leaf. One is called the Gini Index and the other one is Cross-Entropy. Gini Index works well in practices. Cross-Entropy is the entropy of the distribution over classes. You want to minimize the entropy over the classes, you want to have a very spike distribution in the classes, you want mostly to be one class and less of the others so that the trees are very certain about what the class is for each leaf. --- # Prediction .center[ ![:scale 80%](images/tree_prediction.png) ] ??? To make a prediction, I traverse the tree I ask all the questions in the node that I encounter and then I can predict the majority class in the leaf. One thing that's nice about this is this prediction is really, really fast since this tree is not very deep. - Traverse tree based on feature tests - Predict most common class in leaf --- # Regression trees `$$\text{Prediction: } \bar{y}_m = \frac{1}{N_m} \sum_{i \in N_m} y_i $$` Mean Squared Error: `$$ H(X_m) = \frac{1}{N_m} \sum_{i \in N_m} (y_i - \bar{y}_m)^2 $$` Mean Absolute Error: `$$ H(X_m) = \frac{1}{N_m} \sum_{i \in N_m} |y_i - \bar{y}_m| $$` ??? You can do the same thing also for regression. In regression, the prediction is usually the mean over the target targets in a leaf. The impurity criteria that I minimized is usually the mean squared error. Basically, you look at how good of a prediction is the mean if I split like this. For each possible split, you can compute the mean for the two resulting nodes and you compute how well the mean in this leaf predicts. You can use the mean absolute error, if you want it to be more robust outliers then you don’t penalize with the square norm, you only penalize with the L1 norm. One thing about the regression trees is that they have more tendency to get very deep if you don't restrict them. Because usually in regression, all targets are distinct, they’re distinct float numbers. So if you want all leaves to be pure, then all of these will have only one node in them, unless you have a special case where you have equal flow numbers as a target, which is not common. Another thing that's really nice about trees is that you can visualize them and you can look at them, and you can possibly even explain to your boss what they mean. - Without regularization / pruning: - Each leaf often contains a single point to be “pure” --- # Visualizing trees with sklearn .smaller[ ```python from sklearn.datasets import load_breast_cancer cancer = load_breast_cancer() X_train, X_test, y_train, y_test = train_test_split(cancer.data,cancer.target, stratify=cancer.target, random_state=0) from sklearn.tree import DecisionTreeClassifier, export_graphviz tree = DecisionTreeClassifier(max_depth=2) tree.fit(X_train, y_train) ``` ] ??? The first way to visualize tree in scikit-learn is a little bit tricky. The second one is a one I hacked. I'm going to use the breast cancer data center, which does the classification of breast tissues. So trees can do much more complicated things than multiclass. For the ease of visualization, I'm going to do binary. I can build the decision tree classifier, and there's a method in scikit-learn called export _graphviz, which is basically a file format and the layout engine for showing graphs. So you can take the tree, the estimator that’s the tree. You can also give it the feature names if you want. You can either write this to a file, or you can get back to string in this dot language. And that's what it looks like. This is a tree of depth 2 and I gave it the feature names. --- # Visualizing trees with sklearn ```python from sklearn.tree import plot_tree tree_dot = plot_tree(tree, feature_names=cancer.feature_names) ``` .center[ ![:scale 80%](images/mpl_tree_plot.png) ] ??? The other thing that I like to use which is easier is this pull request or there’s a gist that you download, it’s a single file called the tree_plotting. And that uses math plot to plot the trees. Here, at the top you can see the test, then you can see the criteria, then the number of samples in this node and then which class the samples belong to. In the original breast cancer dataset, the top node corresponds to the whole dataset. The tree gets purer and purer as you move down the tree. --- class:spacious # Parameter Tuning - Pre-pruning and post-pruning (not in sklearn yet) - Limit tree size (pick one, maybe two): - max_depth - max_leaf_nodes - min_samples_split - min_impurity_decrease - ... ??? I want to talk a little bit about the different parameters that you can choose. Most of the tree growing algorithms are greedy. There are two ways to restrict the growth of the tree after you train them. There are pre-pruning and post-pruning. In pre-pruning, you restrict while you're growing. In post-pruning, you build the whole tree (greedily) and then you possibly merge notes. So you possibly undo sort of bad decisions that you made, but you don't restructure the tree. In scikit-learn right now, there's only pre-pruning. The most commonly used criteria are the maximum depth of the tree, the maximum number of leaf nodes, the minimum samples in a node split and minimum decrease impurity. --- class:spacious # No pruning .center[ ![:scale 100%](images/no_pruning.png) ] ??? Here’s the full tree for the same dataset. --- class:spacious # max_depth = 4 .center[ ![:scale 100%](images/max_depth_4.png) ] ??? Maximum depth of the tree restricted to 4. This is like a very simple way. If you are actually building just a single decision tree, this might not be the best way because for example, here you split away the single point which might not be great and other leaves are very mixed. So you can't make a good prediction here. Since it puts the same depth limit on everywhere in the tree, this is not very fine grained. What it does is it means that your predictions will be very fast because you never have to go very deep in the tree. It also makes the tree size very small. If you built a lot of trees, you might get issues with the RAM. If you have a lot of very deep trees, they will at some point be bigger than your RAM, and you'll run into troubles, or if they're very deep, it might be very slow to predict at some point. So this is a way to restrict how much memory and how much prediction time you're willing to give. --- # max_leaf_nodes = 8 .center[ ![:scale 50%](images/max_leaf_nodes_8.png) ] ??? If you use max_leaf_nodes, it will always put the one that has the greatest impurity decrease first. You can see that in different parts of the trees, there's different depth. It will prioritize the ones that decrease the impurity the most. --- class:spacious # min_samples_split = 50 .center[ ![:scale 70%](images/min_samples_split_50.png) ] ??? Min_samples_split is quite simple, it just says only split nodes that have this many samples on them. What I don't like about this criterion set here, for example, it splits off a single point, which I don't think is really great, because it's not going to improve the prediction very much. So both the max leaf nodes and the minimum impurity decrease will allow you to basically get rid of these very small leaves that don't really add anything to the prediction. Impurity decrease will only make splits that decrease the impurity by X amount. If you want to tune this, you pick one of these and then you just tune this one parameter. --- .smaller[ ```python from sklearn.model_selection import GridSearchCV param_grid = {'max_depth':range(1, 7)} grid = GridSearchCV(DecisionTreeClassifier(random_state=0), param_grid=param_grid, cv=10) grid.fit(X_train, y_train) ``` ] .center[ ![:scale 70%](images/grid_max_depth.png) ] ??? So for example, if you want to tune max depth and restricted the depth of the tree. So run a grid search over max depth and you can do the parameter curves as we did before and you can see that maybe depth of five is good on this dataset. Again, you can see the typical curve with the more complexity you allow in your classifier, the better you get the training set, but possibly you overfit on a test set. So, here, you could say maybe it's overfitting a little bit, but it's not very visible. --- .smaller[ ```python from sklearn.model_selection import GridSearchCV param_grid = {'max_leaf_nodes':range(2, 20)} grid = GridSearchCV(DecisionTreeClassifier(random_state=0), param_grid=param_grid, cv=10) grid.fit(X_train, y_train) ``` ] .center[ ![:scale 70%](images/grid_max_leaf_nodes.png) ] ??? Here is if you tune a maximum number of leaf nodes, you can see it’s slightly more overfitting happening but generally, there's not a big difference on this toy dataset between building the full tree or having a smaller tree. Obviously, if you have a small tree, predictions are going to be faster and need less memory to store and you can explain it better. So the thing with being able to explain is, if you have a tree up depth 14, then it's not really explainable anymore, no human can look at all of this and say “Oh, I understand exactly what's going on”. If you have 8 leaves, it's a very good chance you can draw this, and you can think about it, and you can understand that. I wouldn't really say trees are interpretable but small trees are interpretable. So that's also the reason why you might want to restrict the rows of the tree. So you can easily communicate and easily explained. --- class:spacious # Relation to Nearest Neighbors - Predict average of neighbors – either by k, by epsilon ball or by leaf. - Trees are much faster to predict. - Both can’t extrapolate ??? FIXME plots --- #Extrapolation .center[ ![:scale 80%](images/ram_prices.png) ] ??? Trees are actually quite similar to nearest neighbors in some sense because both nearest neighbors and trees pick the average of their neighbors or delete the neighbors’ votes (depending on classification or regression) In KNearest Neighbors or epsilon nearest neighbors, where you have a radius around the data point in which you look for neighbors. In trees, you look at the same leaf, and you basically let the other point in the leaf vote what the output should be. Trees are consistent if you give them enough data and they will eventually learn anything because eventually, they're just like nearest neighbors. Nearest neighbors are very slow to predict because you need to compute the distances to all training points while trees very fast to predict because you just need to traverse the binary tree. One thing that you should keep in mind is that both can't extrapolate. --- #Extrapolation .center[ ![:scale 80%](images/ram_prices_train.png) ] ??? --- #Extrapolation .center[ ![:scale 80%](images/ram_prices_test.png) ] ??? This is the RAM prices in dollar per megabyte historically. This is an instance of Moore's law. On log scale, it’s pretty linear because of the ram prices half every 18 months or something. Let's say I want to build the prediction model for this. And I want to build it up to 2000, and I want to make predictions about the future. Now I'm going to compare a linear model in a tree-based model. The linear model, I kind of cheated a little bit because I did the log transform so that linear model can actually model it. I didn't restrict the tree model so it'll basically learn the training data set perfectly. Now the question is, if I want to do prediction, what are we going to predict? You probably are not going to be surprised by the linear prediction but you might be by the tree prediction. The tree predicts just a constant and the constant is basically the last point observed. This is the same thing that the nearest neighbor base model would do. So if you want to predict here, basically, you traverse the tree and you find basically the closest data point to it, the one or the other data point that is in the same leaf, and the ones that will share the leaf with are the ones on the very right. And then it will computer mean of these data points. And the mean is going to be just the value if the leaf is pure. If I did like coarser split level, it will tune the mean of all of these, but it’ll always compute the mean of some data points on the training set. So it'll never be able to predict these values that are not seen in a training set. In practice, this is usually not really a big issue. For example, if you want to do like extrapolation like this, you could instead of trying to predict the value, you could use the tree to predict the difference to the last day or to the last year or something like this, and then it would work. But it's definitely something you should keep in mind. Basically, outside of where you’ve observed training data, it'll just be a constant prediction. Basically falling back to what it was last seen. That might be a little bit surprising sometimes. Generally, extrapolation in machine learning is hard. And if you need to extrapolate, there's a chance you're not going to be able to do with any model. But in particular, for tree-based and nearest neighbor based models, you should keep in mind that they can't do that. --- # Instability .left-column[ .tiny-code[ ```python X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, stratify=iris.target, random_state=0) tree = DecisionTreeClassifier(max_leaf_nodes=6) tree.fit(X_train, y_train) ``` ] ![:scale 70%](images/instability_1.png) ] .right-column[ .tiny-code[```python X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, stratify=iris.target, random_state=1) tree = DecisionTreeClassifier(max_leaf_nodes=6) tree.fit(X_train, y_train) ``` ] .center[![:scale 70%](images/instability_2.png) ]] ??? Another weakness of trees that I want to mention is that they are very unstable. Using the iris dataset, let's say I want a small tree. So I say I want a maximum of six leaves nodes and I split my Iris dataset, and I build the tree, and then I split the same data set again, and take a different random state. Hopefully, I would assume that if it's the same dataset, and I split randomly, then the models should be similar. However, if you look at these models, even in the root note, the very first decision it made is different. One compares petal length to 2.45 while the other compares pedal width to 0.8. And the number of points that are split off are different too. So there's not really an easy correspondence between these two trees. So this is sort of a caveat about interoperability. You can interpret the tree but maybe if the dataset was only slightly different, the tree might look completely different. So this tree structure is really something that's very unstable. --- # Feature importance .left-column[ .tiny-code[```python X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, stratify=iris.target, random_state=0) tree = DecisionTreeClassifier(max_leaf_nodes=6) tree.fit(X_train,y_train) ```] .center[![:scale 60%](images/instability_1.png) ] ] .right-column[ .tiny-code[ ```python tree.feature_importances_ array([0.0, 0.0, 0.414, 0.586]) ``` ] ![:scale 80%](images/tree_importances.png) ] ??? Another interesting feature of trees that I want to mention is feature importance. While looking at the trees, a nice way to inspect the tree if it's small, if your trees very big, it's very hard to understand what's happening in the tree. So what you can get out is how important each feature is for the tree. This is similar to the coefficients in the linear model, where you can see that “this feature influences the model in this way”. And the way this is done is basically each time a particular feature was chosen in a tree, you accumulate how much it decreased the impurity. It'll give more importance to features that were used often and where to use of it, than decreased impurity. Problem with the importance is you have the same issues with instability. Because the tree structures unstable, the features picked is unstable and so the importance will be unstable. Similar to L1 norm, if you have too many correlated features, it might pick one or the other. If you have very correlated features, basically, the feature importance could be anything. Any linear combination of the two is completely equivalent. So keep that in mind when looking at feature importance. But they’re a very nice way to summarize what the tree does if you're in reasonably low dimensions and your tree is big. Obviously, if you're in super high dimensions, you might not be able to look at them, you might be able to look at the most important features. The question was if use a feature multiple times, do I add up the degrees in purities and the answer is yes. Question is can I explain the difference between extrapolating and generalizing. In generalizing, usually you make this IID assumption that you draw data from the same distribution, and you have some samples from the distribution from which you learn, and other samples from a distribution which you want to predict. And if you're able to predict well, then it is generalizing. Here in the extrapolation example, the distribution that I want to try to predict on was actually different from the distribution I learned on because they’re completely disjoint. The years I trained on and the years that I'm testing on are completely disjoint and the target range is also completely disjoint. So this is not an IID task. In particular, even if it was IID, extrapolation sort of means looking at things outside the training set. One more comment about feature importance is that they give you the importance of a feature, but not the direction. If you look at the linear model and you say that a coefficient is large, or a coefficient is very small, this means for regression, if it's larger than it has a positive influence on target, or if it's classification, this feature has a positive influence to being a member of this class. For trees, you don't have the directionality, here you only get this is important only to make a decision. And the relationship with the class might not be monotonous. So you only have a magnitude you don't have a direction in the importance. - Unstable Tree $\rightarrow$ Unstable feature importances. - Might take one or multiple from a group of correlated features. --- # Categorical Data - Can split on categorical data directly - Intuitive way to split: split in two subsets - 2 ^ n_values many possibilities - Possible to do in linear time exactly for gini index and binary classification. - Heuristics done in practice for multi-class. - Not in sklearn :( ??? You can use categorical data in trees, and you can do this in R up to a couple of hundred different values. In scikit-learns, unfortunately, it’s not there yet. The cool thing about it working with categorical data is that it allows you to split into any subsets of the categories. So if you have, let's say have one categorical variable with five levels, or five different values, you could split it in a tree in all possible ways. So in three, two, four, one and with all of the possible combinations. Usually, if you would want to do that you will need to try out two to the power number of values many possibilities for the splits, two to the power five in this example. That might not be feasible. But actually, for some criteria, for example, for Gini index you can do this efficiently in linear time. So in scikit-learn right now, you have to use one hot encoding, which is not powerful because one hot encoding can only split one feature at a time. So you would need to grow a much deeper tree to split categorical variable in all possible ways. There's also I think in R there's a bunch of tree implementations that work with missing values and there are several different strategies to deal with missing values. But it's also not entirely clear which strategy to deal with a missing value in trees is the best. So one strategy is, if you have a missing value, you can basically go down both sides at once, if the question is about the value that's missing, just go down both things. But that's kind of slow. The other thing that it's actually just another value you can split on and you can say, “if I split on a missing value, I have a separate branch just for missing values.” But you can easily emulate this by replacing it with a binary feature which says missing or not missing. --- class:spacious # Predicting probabilities - Fraction of class in leaf. - Without pruning: Always 100% certain! - Even with pruning might be too certain. ??? You can also predict probabilities with trees. Although, I would recommend against it. The issue with trees is that they are very good at overfitting, they can fit anything if I grow them deep enough. If I grow them to the full depth, they will always be 100% certain. Because the probability is the fraction of the number of points in the leaves that belong to this class. So if I have a leaf with like 10 data points, and 7 of them belong to class A, then it will say, with 70% probability a point here is class A. But if you split until everything's pure, it will always be 100% certain, which is pretty meaningless. So you can do some pruning, and you can restrict the rows of the tree, but it still might be certain, and so the probability estimates that come out of a tree might not be very good. FIXME PLOT --- class:spacious # Conditional Inference Trees - Select “best” split with correcting for multiple-hypothesis testing. - More “fair” to categorical variables. - Only in R so far (party) ??? --- # Different splitting methods .center[![:scale 80%](images/splits_kinect.png)] .smaller[(taken from Shotton et. al. Real-Time Human Pose Recognition ..)] ??? Another reason why trees are very popular is since they're very flexible. This is actually from a paper about the Kinect version one. Which allows you to use your body as a game controller and builds like a 3d image of you and follows your hand. They used decision trees in a quite clever way to find where different parts of the body are. Here you have a depth image of a person. The questions that are asked are not looking at single pixels but if you want to decide for a particular pixel, you can ask questions about pixels that are somewhere else. So you can ask, how deep is the pixel five above me? How deep is the pixel five below me? And so on. And it can also do comparisons in saying, what's the difference between the value here and there? And so basically, you can come up with more complex ways to ask questions. And this leads to more powerful tree models. That's quite common in computer vision to ask either about comparing pixels or comparing regions of pixels or something like that. Usually, by default, you just threshold each feature but you can come up with arbitrary things. I also know models where each node is like….linear model, for example, and the test is what the response of the linear model is? Shortcomings of the trees are they very unstable, they can’t extrapolate and they tend to overfit a lot. - Could use anything as split candidate! - Linear models used if extrapolation is needed. - Computer vision: pixel comparisons - Kinect (first generation): depth comparison --- class: centre,middle # Ensemble Models ??? One way to get rid of these problems is to actually use multiple trees and Ensemble. Ensemble is a general term that captures any way you combine multiple models into one. There's like a lot of different kinds of ensemble models that work in many different ways. We're going to talk particularly about random forest. But that's only one association of this more general framework. --- # Poor man’s ensembles - Build different models - Average the result - Owen Zhang (long time kaggle 1st): build XGBoosting models with different random seeds. - More models are better – if they are not correlated. - Also works with neural networks - You can average any models as long as they provide calibrated (“good”) probabilities. - Scikit-learn: VotingClassifier hard and soft voting ??? If you do computation, we'll see poor man’s ensembles which is just build the same model multiple times with different random seeds and average them. Owen Zhang who used to be in the first place in kaggle for five years, he just builds XGBoost models, which are off the tree based models with different random seeds, and then he just averaged all of them. And if you averaged multiple good models, it will get better. This is ideal for competitions. These are not usually done in practice because it makes the model hard to understand, it takes more RAM, it takes more time to predict because you have to do it like number of models many times. In scikit-learn, there’s this voting classifier which just takes the list of models of different types or be the same model type with different visualizations. And it will train all of them and average the results. You can either do soft voting or hard voting. Soft voting means averaging the probabilities and take the arg max. Hard voting means everything makes a prediction and then you take the thing that was most commonly predicted. --- # VotingClassifier .smaller[ ```python voting = VotingClassifier( [('logreg', LogisticRegression(C=100)), ('tree', DecisionTreeClassifier(max_depth=3, random_state=0))], voting='soft') voting.fit(X_train, y_train) lr, tree = voting.estimators_ voting.score(X_test, y_test), lr.score(X_test, y_test), tree.score(X_test, y_test) ``` ``` 0.88 0.84 0.80 ``` ] .center[![:scale 80%](images/voting_classifier.png)] ??? It looks like this. Here is the logistic regression model on this 2D data set. This is the decision tree based model and the voting classifier weights - probability estimates by most of them. And if you want, you can do that. Again, mostly done in competitions. The easiest way to do ensembles is to just average models. Problem is, if they overfit in the same way, you're not going to get rid of the overfitting. So what you really want to do is you want to make sure you have models that are quite different from each other and you averaged the models that are different. --- class: spacious # Bagging (Bootstrap AGGregation) .left-column[ - Generic way to build “slightly different” models - BaggingClassifier, BaggingRegressor ] .right-column[ .center[![:scale 100%](images/bootstrap_sample.png)] ] ??? One way to do this is called Bagging or Bootstrap Aggregation. And the idea in bagging is, imagine that these are your data points, you have five data points and so you want to make sure that you built similar but slightly different models. And you want to do more than just change the random seed. One way is to take bootstrap samples of your dataset. Bootstrap samples mean sampling with replacement, number of samples many points. So here in this example, from these five data points, I would sample a new dataset that again has five data points (sample with replacement). On average this will mean that each dataset will have about 66% of the original data points, and the other data points will be just duplicates. So you leave out some data, duplicate some other data, and you end up with a slightly different dataset. You can do this obviously, as often as you want. If you average these models, if the model was prone to overfitting, then the average might not overfit as much. - Draw bootstrap samples from dataset - (as many as there are in the dataset, with repetition) --- class: spacious # Bias and Variance .center[![:scale 40%](images/bias_vs_variance.png)] .left[ .smaller[http://scott.fortmann-roe.com/docs/BiasVariance.html] ] ??? What you often hear in this context is talking about bias and variance, which is not entirely the same thing as underfitting and overfitting but quite related to it. The idea is that the errors you make on the test set are basically combinations of the error you make in modeling plus the uncertainty in the modeling. The idea is that you can have models that are very consistent, but off. So that would be a high bias low variance model. So this would be, for example, a low dimensional linear model that is mixed inconsistent predictions, but it might not be able to capture the target very well since it's not powerful enough. You want ideally a model that has low bias and low variance. So that's sort of captures the data concept very well and also has very little variance in the predictions it makes. A high variance model is one that on average captures the predictions but has very high variance in the prediction it makes. The idea of bagging is that you can take multiple high variance models and if you average them, then it will reduce the variance of the average. If the predictions are not very correlated, then there's like a very nice, easy proof to say that if I average multiple high variance estimators, I get a low variance estimator. They need to be sort of good enough, on average, if they're all really, really bad, and if you average then it will still be pretty bad. --- class: spacious # Bias and Variance in Ensembles - Breiman showed that generalization depends on strength of the individual classifiers and (inversely) on their correlation - Uncorrelating them might help, even at the expense of strength ??? There’s this proof that shows that if you have decorrelated errors, then averaging will help. Basically, the correlation between the errors that you make shows up there. And so the idea is to make the errors as different as possible. Maybe even go further than the bagging in trying to make sure the different models that you build are as different from each other as possible, while still being reasonably good models. This leads to the idea of random forests, which basically tries to randomize trees, even more than you would do with bagging, and then average results of these trees. FIXME worst slide! --- class: spacious # Random Forests .center[![:scale 90%](images/random_forest.png)] ??? Here example of a random forest on a 2-dimensional dataset. - Smarter bagging for trees! --- # Randomize in two ways .left-column[ - For each tree: - Pick bootstrap sample of data - For each split: - Pick random sample of features - More trees are always better ] .right-column[ ![:scale 100%](images/bootstrap_sample.png) ![:scale 100%](images/feature_sample.png) ] ??? The way that random forest work is they randomize tree building in two ways. As with bagging if you do a bootstrap sample of the dataset, so each tree you take a bootstrap sample of the data set and then for each split in the tree, you take a sampling without replacement of the features. So let's say you have this representation of the dataset again. And for each node where you want to make a splitting decision before you want to scan all over the features and all the thresholds, you select just the subset of the number of features and just look for splits there. And this is done for each node independently. In the end, the tree will use all the features probably, if it's deep enough, eventually it might use all the features but you randomize the tree building process in a way that hopefully de-correlated the error of the different trees. And so this adds another hyperparameter in the tree building. --- class:some-space # Tuning Random Forests - Main parameter: max_features - around sqrt(n_features) for classification - Around n_features for regression - n_estimators > 100 - Prepruning might help, definitely helps with model size! - max_depth, max_leaf_nodes, min_samples_split again ??? Which are max features, which is the number of features that you want to look at each split. So the heuristic is usually for classification you want something like the square root of the number of features, whereas, for regression, you usually want something that is around the same size as the number of features. This basically, controls the variance of the model because if you set this to number of features, it will be just sort of the old decision tree. If you set this to one, it will pick a feature at random, and then it needs to split on that feature, basically. So this will be like a very, very random tree that probably would grow very deep because it can really make good decisions. This is the main parameter that you need to tune, although random forest is actually very robust to these parameter settings. Usually, if you leave this to the default, it will be reasonably well. By default, in scikit-learn, the number of trees in the forest is way too low. Usually, you want something like 100, or 500 but scikit-learn only gives you like 10. Sometimes people find that pre-pruning techniques like maximum depth or maximum leaf nodes help. Generally, the idea doesn't really matter that much, how good the vigil trees are as long as you have enough of them. But limiting the size of the tree will definitely help with model size. That said, if you randomize a lot, these trees will grow very deep. Because if you're unlucky, you'll always just split on bad feature so you will always not really get much further in getting pure leaves and so you need to do a lot of splits. So setting something like max depth will allow you to decrease RAM size and decrease prediction time. --- class:spacious # Extremely Randomized Trees - More randomness! - Randomly draw threshold for each feature! - Doesn’t use bootstrap - Faster because no sorting / searching - Can have smoother boundaries ??? Alright, so there's a slightly different variant of random forest that's also kind of interesting, which is Extremely Randomized Trees. They work very similarly except for the fact that you draw randomly a threshold for each feature. So here, you don't search over the thresholds, you just pick one at random. And usually, you don't use bootstrap. Since you don’t need to do sorting these are faster to build. You probably get smoother decision boundaries since the thresholds are drawn at random. Sometimes they aren’t as bad as random forest but they're definitely not commonly used as the default random forest. One of the nice things about extremely randomized trees and the random forests is they work without any preprocessing. If you set number of estimators large enough, they will just work. They will not be the best model but they will always work. --- # Warm-Starts .smaller[ ```python train_scores = [] test_scores = [] rf = RandomForestClassifier(warm_start=True) estimator_range = range(1, 100, 5) for n_estimators in estimator_range: rf.n_estimators = n_estimators rf.fit(X_train, y_train) train_scores.append(rf.score(X_train, y_train)) test_scores.append(rf.score(X_test, y_test)) ``` ] .center[ ![:scale 39%](images/warm_start_forest.png) ] ??? The more trees you use, the better it is since you decrease the variance more. Don't ever grid search the number of trees, because the higher will be better and you're just wasting your time. But there are diminishing returns. So maybe 100 trees are good enough and if you built 500 trees, again, you're wasting your time and you might not want to do that. In scikit-learn, when you use Warm-Starts, you basically iteratively add more and more trees to the forest and you can stop whenever you think it's good enough. Basically, the fit method will keep all the previous trees and add on more trees. I go from 1 feature to 100 features in steps of five. I set the number estimator. So I started with one tree and then increased by five every time. I fit the model. And then I look at training and test score. This is much better than grid searching it because in grid search I would do the same work over and over again, while here, I'm adding more and more trees and find out what is it good enough. Usually, I would double the search like 50, 100, and 200 and so on and find where it stabilizes, and then only I will stop adding more trees. If you don't want to do that, just picking large enough number and stick to that number. Then you don't need to fiddle around as much but you might be wasting computational resources that could be used otherwise for like parameter tuning, for example. --- # Out-of-bag estimates - Each tree only uses ~66% of data - Can evaluate it on the rest! - Make predictions for out-of-bag, average, score. - Each prediction is an average over different subset of trees .smaller[ ```python train_scores = [] test_scores = [] oob_scores = [] feature_range = range(1, 64, 5) for max_features in feature_range: rf = RandomForestClassifier(max_features=max_features, oob_score=True, n_estimators=200, random_state=0) rf.fit(X_train, y_train) train_scores.append(rf.score(X_train, y_train)) test_scores.append(rf.score(X_test, y_test)) oob_scores.append(rf.oob_score_) ``` ] ??? Another thing is looking at out of back estimates. This is something that's specific to random forest. That said, if you do bootstrap sampling, usually you get like 66% of your data for each estimator. That means there's about 33% of your data that you haven't used for each individual tree. So you can use that data as a test set for that particular tree. The out of back estimate in a random forest is for each tree, I individually predict on the 33% on which I haven't trained. And then for each data point, I look at all the trees that held out this particular data point and let them vote. So then for each data point, I only have a subset of the trees that makes a prediction but if I have enough trees overall, it'll still be fine. Let's say, I used 300 estimators, then basically, on average, for each data point, I will have 100 estimators that didn't use those data points. And so it can use this data point as a test set for this 100 estimators and can let them vote on what the prediction should be. And this way, basically, I get a test set score for free. In fact, it's like a bootstrap estimate of a test set score. And so basically, I don't need to use a test set, I can just use the out of bag estimate. So here, for example, using this to find the best number of features. I look for max number of features in a particular range in a random forest classifier. In this case, for illustration purposes, I use both the out of bag estimate and the test set scores to see how they compare. --- .center[ ![:scale 90%](images/oob_estimates.png) ] ??? You can see that since I didn’t prune the trees in any way, the training scores is always one because it's always perfect and you perfectly overfit everything. The test sets scores and the out of bag scores are at least somewhat similar in trend and also in magnitude because the out of bag estimate uses only one-third of the trees, it might be slightly worse. But basically, I got this for free without spending any additional data or any additional computation. And so both of these probably have a habit of uncertainty. I have no uncertainty estimate but theoretically, this is sort of an unbiased estimate of the test set accuracy. --- # Variable Importance .smaller[ ```python X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, stratify=iris.target, random_state=1) rf = RandomForestClassifier(n_estimators=100).fit(X_train, y_train) rf.feature_importances_ plt.barh(range(4), rf.feature_importances_) plt.yticks(range(4), iris.feature_names); ``` ```array([ 0.126, 0.033, 0.445, 0.396])``` ] .center[ ![:scale 40%](images/forest_importances.png) ] ??? As for the trees, for the random forest, I can get variable importance. And these are more useful now because they're more robust. Again, they don't have directionality. But now, if you have two correlated features, probably if you build enough trees, and because of the way they’re randomized, both of these features will be picked some of the time. And so if you average over all of the trees, then both will have the same amount of importance. So this will basically just give you like a smoother estimate of the feature importance that will be more robust and will not fail on correlated features. --- class: center, middle # Questions ? ??? Question: Is there a statistical test where the variable is important? It's probably the wrong question to ask. Because these are like greedily trained and so trying to extract any statistical meaning out of that is probably not the best idea. So if you're trying to understand your dataset or your data generating process, I probably wouldn't use tree-based models. These are really good for predicting no matter what your input is. Question: How is the feature importance calculated? Whenever a particular feature was used in the tree, you look at the decrease in impurity, and you aggregate these and at the end, you normalize it sum to one. ---