class: center, middle ### W4995 Applied Machine Learning # Model evaluation 02/24/20 Andreas C. Müller ??? FIXME boston FIXME explain scorer interface vs metrics interface, plotting has scorer interface FIXME ROC curve slide is bad. Illustration of why random is .5 FIXME format string FIXME demonstrate that AUC / average precision are rancing metrics FIXME add calibration FIXME remove regression? FIXME top right of pr curve actually maximizes f1 FIXME no interpolation on pr curve? FIXME add plotting FIXME macro vs weighted average example FIXME roc auc not good for imbalanced!? show example!! FIXME remove regression, go into more depth on cost-sensitive? FIXME relate to calibration! FIXME How to pick thresholds in multi-class? FIXME Add log-loss to metrics, and show why I don’t like it (.4, .6, 0) is better than (.36, .34, .3) FIXME explain why ROC AUC doesn't depend on class imbalance by calculating example FIXME write down example on slide about recall being zero and AUC being 1? FIXME show example of number nonzero for scorer API --- class: centre,middle # Metrics for Binary Classification ??? --- # Review : confusion matrix .center[ ![:scale 55%](images/confusion_matrix.png) ] .larger[ $$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$ ] ??? Diagonal divided by everything. --- class: center, middle # Positive & Negative are arbitrary ## Though often the minority class is considered positive --- # .smaller[ ```python from sklearn.metrics import confusion_matrix, plot_confusion_matrix data = load_breast_cancer() X_train, X_test, y_train, y_test = train_test_split( data.data, data.target, stratify=data.target, random_state=0) lr = LogisticRegression().fit(X_train, y_train) y_pred = lr.predict(X_test) print(confusion_matrix(y_test, y_pred)) print(lr.score(X_test, y_test)) plot_confusion_matrix(lr, X_test, y_test, cmap='gray_r') ``` ``` [[48 5] [ 4 86]] 0.94 ``` ] ![:scale 20%](images/plot_confusion_matrix.png) ??? confusion_matrix and accuracy_score take y_true, y_pred Note that plot_confusion_matrix takes the estimator. We might change that. We'll talk a bit more about that interface later. --- # Problems with Accuracy Data with 90% negatives: .smaller[ ```python from sklearn.metrics import accuracy_score for y_pred in [y_pred_1, y_pred_2, y_pred_3]: print(accuracy_score(y_true, y_pred)) ``` ``` 0.9 0.9 0.9 ``` ] -- .center[ ![:scale 70%](images/problems_with_accuracy.png) ] ??? - Imbalanced classes lead to hard-to-interpret accuracy. --- class:split-40 # Precision, Recall, f-score .left-column[ `$$ \large\text{Precision} = \frac{\text{TP}}{\text{TP}+\text{FP}}$$`
`$$\large\text{Recall} = \frac{\text{TP}}{\text{TP}+\text{FN}}$$`
`$$\large\text{F} = 2 \frac{\text{precision} \cdot\text{recall}}{\text{precision}+\text{recall}}$$`] .right-column[
Positive Predicted Value (PPV)
Sensitivity, coverage, true positive rate.
Harmonic mean of precision and recall ] ??? All depend on definition of positive and negative. --- class: center, spacious # The Zoo ![:scale 100%](images/zoo.png) https://en.wikipedia.org/wiki/Precision_and_recall ??? --- class: compact # Normalizing the confusion matrix .smallest[ ```python confusion_matrix(y_true, y_pred) ``` ![:scale 30%](images/confusion_matrix.png) ] .left-column[ .smallest[ ```python confusion_matrix(y_true, y_pred, normalize='true') ``` ![:scale 80%](images/confusion_matrix_norm_true.png) ] ] .right-column[ .smallest[ ```python confusion_matrix(y_true, y_pred, normalize='pred') ``` ![:scale 80%](images/confusion_matrix_norm_pred.png) ] ] --- class:compact .smallest[ ```python classification_report(y_true, y_pred) ``` ] .left-column[ ![:scale 38%](images/confusion_matrix_col.png)] .smallest[ .right-column[ ``` precision recall f1-score support 0 0.90 1.00 0.95 90 1 0.00 0.00 0.00 10 accuracy 0.90 100 macro avg 0.45 0.50 0.47 100 weighted avg 0.81 0.90 0.85 100 precision recall f1-score support 0 1.00 0.89 0.94 90 1 0.50 1.00 0.67 10 accuracy 0.90 100 macro avg 0.75 0.94 0.80 100 weighted avg 0.95 0.90 0.91 100 precision recall f1-score support 0 0.94 0.94 0.94 90 1 0.50 0.50 0.50 10 accuracy 0.90 100 macro avg 0.72 0.72 0.72 100 weighted avg 0.90 0.90 0.90 100 ``` ]] ??? --- # Averaging strategies `$$\text{macro }\frac{1}{\left|L\right|} \sum_{l \in L} R(y_l, \hat{y}_l)$$` `$$\text{weighted } \frac{1}{n} \sum_{l \in L} n_l R(y_l, \hat{y}_l)$$` .smaller[ ```python print("Weighted average: ", recall_score(y_test, y_pred_1, average="weighted")) print("Macro average: ", recall_score(y_test, y_pred_1, average="macro")) ``` ``` Weighted average: 0.90 Macro average: 0.50 ``` ]
??? Two common ways to approach multitask is to look at averages over binary metrics. So you can do binary metrics for recall, precision f1 score. But in principle, you could do it for more things. And in scikit-learn has several averaging strategies. There is macro, weighted, micro and samples. You should really not worried about micro samples, which only apply to multi-label prediction. We haven't talked about multi-label prediction and we're not going to talk about multi-label prediction. Multi-level prediction is when each sample can have multiple labels. Macro and weighted are the two options. They are interesting. For example, for recall, we can look at macro average recall, which is the average over all the labels and then the recall for just this label. So basically, here y_l is the true label, y ̂ is the predicted label. The positive class is now label l, and the negative classes is any other label, so this is sort of one versus rest version of recall. You can also do a weighted average, you weight by the number of samples in each class, and then divide by the total number of samples. So again, for each class, I compute recall for that class, and then I weight that by the number of samples. This is actually what classification report reports here. It's kind of interesting that if you do this, then actually looking at precision or recall just by themselves is informative. So if you look at macro average recall, that's actually also called balanced accuracy or if you look at macro average precision, this will be the useful metric by itself, because it looks at the precision of each of the classes. So basically, the choice between macro and weighted is whether you think each class should have the same weight, or each sample should have the same weight. Let’s say, we have recall for one class is 1 and for the other class 0. And for the majority class recall is 1 and for the minority, class recall is 0. For macro average recall, you would get 0.5 because you just take the average of the two. If you look at a weighted, then the outcome would be the proportion of the positive class. So if 90% of the data is positive, you will get 90% here. So basically weighted takes the class sizes into account and so the bigger classes count for more. If you're actually interested in the smaller classes, macro is probably better because each class weights the same. Which one is the best really depends on your application? Macro is counted each class the same while weighted is count each sample the same. I did this here on the digit data set, which is basically balanced and so there's no difference. micro vs macro same for other metrics. --- class: spacious # Balanced Accuracy .smaller[ ```python balanced_accuracy_score(y_t, y_p) == recall_score(y_t, y_p, average='macro') ``` ] $$\text{balanced_accuracy} = \frac{1}{2}\left( \frac{TP}{TP + FN} + \frac{TN}{TN + FP}\right )$$ - Always 0.5 for chance predictions - Equal to accuracy for balanced datasets --- # Mammography Data .smallest[ .left-column[ ```python from sklearn.datasets import fetch_openml # mammography https://www.openml.org/d/310 data = fetch_openml('mammography', as_frame=True) X, y = data.data, data.target X.shape ``` (11183, 6) ```python y.value_counts() ``` ``` -1 10923 1 260 ``` ] .right-column[ .center[ ![:scale 100%](images/mammography_data.png) ] ] .reset-column[ ```python # make y boolean # this allows sklearn to determine the positive class more easily X_train, X_test, y_train, y_test = train_test_split(X, y == '1', random_state=0) ``` ] ] ??? I use this mammography data set, which is very imbalanced. This is a data set that has many samples, only six features and it's very imbalanced. The datasets are about mammography data, and whether there are calcium deposits in the breast. They are often mistaken for cancer, which is why it's good to detect them. Since its rigidly low dimensional, we can do a scatter plot. And we can see that these are much skewed distributions and there's really a lot more of one class than the other. --- .smallest[ .left-column[ ```python svc = make_pipeline(StandardScaler(), SVC(C=100, gamma=0.1)) svc.fit(X_train, y_train) svc.score(X_test, y_test) ``` ``` 0.986 ``` ] .right-column[ ```python rf = RandomForestClassifier() rf.fit(X_train, y_train) rf.score(X_test, y_test) ``` ``` 0.989 ``` ] ] -- .reset-column[ ] .smallest[ .left-column[ ``` precision recall f1-score support False 0.99 1.00 0.99 2732 True 0.81 0.53 0.64 64 accuracy 0.99 2796 macro avg 0.90 0.76 0.82 2796 weighted avg 0.98 0.99 0.99 2796 ``` ] .right-column[ ``` precision recall f1-score support False 0.99 1.00 0.99 2732 True 0.90 0.56 0.69 64 accuracy 0.99 2796 macro avg 0.94 0.78 0.84 2796 weighted avg 0.99 0.99 0.99 2796 ``` ] ] --- class:spacious # Goal setting! - What do I want? What do I care about? - Can I assign costs to the confusion matrix? - What guarantees do we want to give? ??? (precision, recall, or something else) (i.e. a false positive costs me 10 dollars; a false negative, 100 dollars) --- # Changing Thresholds .tiny-code[ ```python y_pred = rf.predict(X_test) print(classification_report(y_test, y_pred)) ``` ``` precision recall f1-score support False 0.99 1.00 0.99 2732 True 0.90 0.56 0.69 64 accuracy 0.99 2796 macro avg 0.94 0.78 0.84 2796 weighted avg 0.99 0.99 0.99 2796 ``` ```python y_pred = rf.predict_proba(X_test)[:, 1] > .30 print(classification_report(y_test, y_pred)) ``` ``` precision recall f1-score support False 0.99 0.99 0.99 2732 True 0.71 0.64 0.67 64 accuracy 0.99 2796 macro avg 0.85 0.82 0.83 2796 weighted avg 0.99 0.99 0.99 2796 ``` ] ??? --- class: compact # Precision-Recall curve .smaller[ ```python svc = make_pipeline(StandardScaler(), SVC(C=100, gamma=0.1)) svc.fit(X_train, y_train) plot_precision_recall_curve(svc, X_test, y_test, name='SVC') ``` ] .center[ ![:scale 65%](images/precision_recall_curve.png) ] ??? --- # Side note: Scikit-learn plotting API .smaller[ ```python prc = plot_precision_recall_curve(svc, X_test, y_test, name='SVC') # plot pops up here prc ``` ```
``` ```python vars(pr_svc) ``` ``` {'precision': array([0.023, 0.023, 0.023, ..., 1. , 1. , 1. ]), 'recall': array([1. , 0.984, 0.984, ..., 0.031, 0.016, 0. ]), 'average_precision': 0.6743452407177641, 'estimator_name': 'Pipeline', 'line_':
, 'ax_':
, 'figure_':
} ``` ```python # plot again without recomputing prc.plot() ``` ] --- class: compact # ROC Curve .center[ ## (Receiver Operating Characteristic) ] .smaller[ ```python plot_roc_curve(svc, X_test, y_test, name='SVC') ``` ] .center[ ![:scale 65%](images/roc_curve_svc.png) ] ??? --- .left-column[ # PR-Curve ![:scale 100%](images/precision_recall_curve.png) ] .right-column[ # ROC-Curve ![:scale 100%](images/roc_curve_svc.png) ] .reset-column[ - Share one axis (though transposed!?) - Interpolation is meaningful on ROC curve but not PR curve ] --- # Comparing RF and SVC .smallest[ ```python pr_svc = plot_precision_recall_curve(svc, X_test, y_test, name='SVC') # if we used computed before, we could just call pr_svc.plot() # using ax=plt.gca() will plot into the existing axes instead of creating new ones pr_rf = plot_precision_recall_curve(rf, X_test, y_test, ax=plt.gca()) ``` ] .center[ ![:scale 60%](images/rf_vs_svc.png) ] ??? We’re going to compare random forest and support vector machine. You can see they're sort of similar. But in some areas, they are different. For example, you can see that the random forest is little bit more stable for very high precision but in some places the SVM is better than random forest and vice versa. So which of the two is better classifier really depends on which area you're interested in. By comparing these two curves, again, sort of a very fine grain thing that you can do manually, if you want to do this for many models, and want to pick the best model. This is maybe not really feasible. So you might want to summarize this in a single number. So one number you could think of is, while I know I definitely want to have a recall of 90% so I'm going to put my threshold here and compare at that threshold. If you don't really have a particular goal yet, you can also consider all thresholds at once, and basically compute the area under this curve. It's not exactly the same but that's sort of what average precision does. --- # Average Precision .center[ ![:scale 100%](images/avg_precision.png) ] ??? Average precision basically does a step function integral of this thing. So for each possible threshold, you look at the precision at this threshold times the change in recall. So this is like fitting a step function under the curve and then computing the integral. This allows you to basically take all possible thresholds into account at once. Q: How do I adjust the precision-recall ratio for the random forest? I mean, it says predict_proba. So I can alter threshold that at different values if I want. For a support vector machine, which doesn't have predict_proba, I can also adjust the threshold the decision function. Average precision is a pretty good metric to basically compare models if there are imbalanced classes. Related to area under the precision-recall curve (with step interpolation) --- # Area Under ROC Curve (aka AUC) .center[ ![:scale 65%](images/roc_curve_svc_auc.png) ] - Always .5 for random/constant prediction ??? And so when you look at the area under the curve, which is called ROC AUC, it's always 0.5 for random or constant predictions. So it's very easy to say, how much better than random are you if you look at ROC AUC. I encourage you to read the paper in the link. One thing is that the ROC curve is usually much smoother and it makes more sense to interpolate it, where the precision-recall curve makes less sense to interpolate us. But the precision-recall curve can sometimes pick up on more fine-grained differences. Even though AUC stands for area under the curve, if you see it in literature always means the area under the receiver operating curve. So AUC always means ROC AUC. AUC is a ranking metric. So it takes all possible thresholds into account, which means that it's independent of the default thresholds. So again, you can have something with very low accuracy but still with perfect ROC AUC. --- # Keep in mind: ranking vs predictions ```python dec = svc.decision_function(X_test) np.all((dec > 0) == svc.predict(X_test)) ``` ``` True ``` ```python print(f1_score(y_test, dec > 0)) print(average_precision_score(y_test, dec)) ``` ``` 0.641 0.635 ``` -- ```python dec_new = dec - 10 print(f1_score(y_test, dec_new > 0)) print(average_precision_score(y_test, dec_new)) ``` ``` 0.0 0.635 ``` --- # Threshold and ranking metrics .left-column[ .tiny[ ```python from sklearn.metrics import f1_score f1_rf = f1_score(y_test, rf.predict(X_test)) print(f"f1_score of random forest: {f1_rf:.3f}") f1_svc = f1_score(y_test, svc.predict(X_test)) print(f"f1_score of svc: {f1_svc:.3f}") ``` ``` f1_score of random forest: 0.709 f1_score of svc: 0.715 ``` ```python from sklearn.metrics import balanced_accuracy_score ba_rf = balanced_accuracy_score(y_test, rf.predict(X_test)) print(f"Balanced accuracy of random forest: {ba_rf:.3f}") ba_svc = balanced_accuracy_score(y_test, svc.predict(X_test)) print(f"Balanced accuracy of svc: {ba_svc:.3f}") ``` ``` Balanced accuracy of random forest: 0.765 Balanced accuracy of svc: 0.764 ``` ] ] .right-column[ .tiny[ ```python from sklearn.metrics import average_precision_score ap_rf = average_precision_score( y_test, rf.predict_proba(X_test)[:, 1]) print(f"Average precision of random forest: {ap_rf:.3f}") ap_svc = average_precision_score( y_test, svc.decision_function(X_test)) print(f"Average precision of svc: {ap_svc:.3f}") ``` ``` Average precision of random forest: 0.682 Average precision of svc: 0.693 ``` ```python from sklearn.metrics import roc_auc_score auc_rf = roc_auc_score(y_test, rf.predict_proba(X_test)[:, 1]) print(f"Area under ROC curve of random forest: {auc_rf:.3f}") auc_svc = roc_auc_score(y_test, svc.decision_function(X_test)) print(f"Area under ROC curve of svc: {auc_svc:.3f}") ``` ``` Area under ROC curve of random forest: 0.936 Area under ROC curve of svc: 0.817 ``` ]] ??? Previously, I just have sort of a simple comparison between f1 and average precision. I'm comparing these two models. So if I look at f1 score, it's going to compare basically these two points and the SVC is slightly is better. But F1 score only looks at the default threshold. If I want to look at the whole RC curve, I can use average precision. Even if I look at all possible thresholds, SVC is still better. Average precision is sort of a very sensitive metric that allows you to basically make good decisions even if the classes are very imbalanced and that also takes all possible thresholds into account. One thing that you should keep in mind if you use this is that this does not give you a particular threshold. For example, if you use the default threshold, your accuracy might be zero. This only looks at ranking. The curve and the area under the curve, don't depend on where the default threshold is. So if you predict everything, like for some weird reason, if you predict everything as the minority class, your accuracy will be really bad. But if they are ranked in the right order, so that all the positive classes are ranked higher than the negative classes, then you might still get an area under the curve that's very close to one. So this only looks at ranking. AP only considers ranking! --- class: spacious # Further reading on evaluation curves - [The Relationship Between Precision-Recall and ROC Curves](https://www.biostat.wisc.edu/~page/rocpr.pdf) - [Precision-Recall-Gain Curves: PR Analysis Done Right](https://papers.nips.cc/paper/5867-precision-recall-gain-curves-pr-analysis-done-right.pdf) --- # Summary of metrics for binary classification Threshold-based: - (balanced) accuracy - precision, recall, f1 Ranking: - average precision - ROC AUC ??? So let's briefly summarize the metrics for binary classification. So there are basically two approaches. One looks just at the predictions or look at soft predictions and take multiple thresholds into account. So the ones that I called threshold base, which is for single threshold, the ones that we talked about in our accuracy, which is just a fraction of correctly predicted, precision, recall, and f1. So the issue with accuracy is obviously that it can't distinguish between things that are quite different, for example, for imbalanced data sets, the classifier always predicts majority class often has like very high accuracy but tells you nothing. So in particular, for imbalanced classes, accuracy is a pretty bad measure. Precision and recall together are pretty good measures, though you always need to look at both numbers. One way to look at both numbers at once is the f1 score, though, using the harmonic mean is a little bit arbitrary. But still, only f1 score is a somewhat reasonable score if you only look at the actual predicted classes. For the ranking based losses, I think both average precision and ROC AUC are pretty good choices, ROC AUC, I like it because I know what 0.5 means, while for average precision, it's a little bit more tricky to see what the different orders of magnitude mean, or the different scales mean, but it can be more fine-grained measure. And so if you want to do cross-validation, my go-to is usually ROC AUC. In particular, for imbalance binary datasets, I would usually use ROC AUC. But then make sure that your threshold is sensible, but anything other than accuracy is okay. Just don’t look at only precision or only recall, if you do a grid search for just precision, you will probably get garbage results. add log loss? --- class: centre,middle # Multi-class classification ??? The next thing I want to talk about is multiclass classification. --- class:split-40 #Confusion Matrix .left-column[ .tiny-code[ ```python from sklearn.datasets import load_digits from sklearn.metrics import accuracy_score digits = load_digits() # data is between 0 and 16 X_train, X_test, y_train, y_test = train_test_split( digits.data / 16., digits.target, random_state=0) lr = LogisticRegression().fit(X_train, y_train) pred = lr.predict(X_test) print("Accuracy: {:.3f}".format(accuracy_score(y_test, pred))) plot_confusion_matrix(lr, X_test, y_test, cmap='gray_r') ``` ``` Accuracy: 0.964 ``` ![:scale 70%](images/confusion_matrix_digits.png) ] ] .right-column[ .tiny-code[ ```python print(classification_report(y_test, pred)) ``` ``` precision recall f1-score support 0 1.00 1.00 1.00 37 1 0.91 0.93 0.92 43 2 0.98 1.00 0.99 44 3 1.00 0.96 0.98 45 4 0.95 0.97 0.96 38 5 0.96 0.96 0.96 48 6 0.98 0.98 0.98 52 7 0.98 0.96 0.97 48 8 0.96 0.90 0.92 48 9 0.92 0.98 0.95 47 accuracy 0.96 450 macro avg 0.96 0.96 0.96 450 weighted avg 0.96 0.96 0.96 450 ``` ] ] ??? So again, for multiclass classification, you look at the confusion matrix. It's even more telling in a way than it was for binary classification but it's also pretty big. So here, I'm using the digital data set, which has 10 classes, and here's the confusion matrix. The diagonal is everything that's correct, the off-diagonal are mistakes. And you can see very fine-grained which classes were mistaken for which other classes. So this is very nice if you want to look in a very detailed view of the performance of the classifier. But again, you can't really use it for model selection, or not easily. You can also again, look at the classification report, which will give you precision-recall and F score for each for the classes, again, very, very fine grained. So you can see, for example, that Class 0, was predicted while Class 8 is the hardest. It might make more sense to look at in precision average classes, recall over the classes or f1 score for the classes. --- # Multi-class ROC AUC - Hand & Till, 2001, one vs one `$$ \frac{1}{c(c-1)}\sum_{j=1}^{c}\sum_{k \neq j}^{c} AUC(j,k)$$` - Provost & Domingo, 2000, one vs rest `$$ \frac{1}{c}\sum_{j=1}^{c}p(j) AUC(j,\text{rest}_j)$$` ??? You can do that similar for ROC AUC. It will soon be available in scikit-learn. There are basically two versions of doing multiclass ROC AUC. One is called Hand & Till while the other is called Provost & Domingo. And the first one is basically one versus one where you iterate over all the classes, and then you iterate over all the other classes, then you look at the AUC of one class versus the other class. And this one is basically one versus rest where you look at the AUC of one class versus all the other classes. H&T is not weighted whereas, on P&D, p(j) means number of samples in j, and P&D is weighted. You can also do weighted OvO or unweighted OvR. There are at least 4 different ways to do multi-class AUC. It’s not clear to me how they behave in practice. If you weighted then you count the samples are the same weight. If you do unweight, you give the different classes the same weight. So this is definitely a pretty good metric if you want a ranking metric for multi-class classification. FIXME unify notation with slide on averaging --- # Summary of metrics for multiclass classification Threshold-based: - accuracy - precision, recall, f1 (macro average, weighted) Ranking: - OVR ROC AUC - OVO ROC AUC ??? The threshold based ones are sort of the same, only now looking at only precision, or only recall, if you average over all the classes actually makes sense. So you can do a grid search over macro average recall, and it'll give you reasonable results. For ranking, in theory, you could also do OvR or OvO on the precision-recall curve. But I haven’t seen anyone done that or any papers on that. So let's not do that for now. Both can be weighted or unweighted versions of that. Both are great for imbalance problems again, but now you have to pick multiple thresholds and it's unclear how to do that well or at least for me. So let’s say you use OVO ROC and you get a classifier with a high value there, and now you want to make predictions. What thresholds do you use? I don't actually have a good answer. Because you need n minus one different threshold to actually make a decision. So for this reason, probably for multi-class, I might not use the ranking metrics unless you have to be cursed, they're sort of hard to interpret. And I would probably use something like macro average precision or macro average recall or macro average f1. --- class: centre,middle # Metrics for regression models ??? --- class:spacious # Build-in standard metrics - $\text{R}^2$ : easy to understand scale - MSE : easy to relate to input - Mean absolute error, median absolute error: more robust ??? The built-in standard metrics by default in scikit-learn uses R squared. The other obvious metric is Mean Squared Error. The good thing about R squared is it's around zero and maximum of one. It's very easy to understand the scale. If it's close to one is good. If it's negative, it's really bad. If it's close to zero, it's bad. While the MSE is easier to relate to the input. So the scale of the mean squared error is sort of relates to the scale of the output. You can also use things like mean absolute error, and median absolute error, which is more robust. If you square things, then outliers have a very big impact. So if we don't square things, then outliers have less of an impact. So often people prefer to use mean absolute error to limit the impact of outliers. If being low is good, you need to have a neg before and if being high is good you won't be needing to it be neg. I think these are actually the only metrics that are in scikit-learn. --- class:spacious # Absolute vs Relative: MAPE `$$ \text{MAPE} = \frac{100}{n} \sum_{i=1}^{n}\left|\frac{y-\hat{y}}{y}\right|$$` .center[ ![:scale 50%](images/mape.png) ] ??? There's another metric I want to talk about called MAPE. Which comes up in particular in forecasting. MAPE stands for Mean Absolute Percentage Error. MAPE captures is the percentage error. It divides the residual by the true value. Here for each individual data point, we’re dividing by the true value. Basically, what that means is that if a data point is large, we care less about how well we predict it. Basically, we allow larger error bars around high values. And I tried to plot this here, this is sort of for the one feature, LSTAT, to have like a reasonable x-axis since that helps us visualize this. And so here is the true dataset in blue and the predictive dataset in orange predicted using all features. And so in the bottom right, you can see it that we are under predicting by about 10. And this gives us a MAPE of 36%. On the top left, we are under predicting by even more, but the MAPE is smaller. So even though the absolute error on the top is bigger than the absolute error on the bottom, the relative error in the top is much smaller than the relative error in the bottom. This is a very commonly used measure in forecasting. Be careful when to use it because sometimes it might be not super intuitive. In particular, if something is not defined, if anything is zero, and if something is very close to zero, you need to perfectly predict it otherwise, you have unbound error. So this is usually used for things that are greater than one. --- # Prediction plots .wide-left-column[ .smaller[ ```python from sklearn.linear_model import Ridge from sklearn.datasets import load_boston boston = load_boston() X_train, X_test, y_train, y_test = train_test_split( boston.data, boston.target) ridge = Ridge(normalize=True) ridge.fit(X_train, y_train) pred = ridge.predict(X_test) plt.plot(pred, y_test, 'o') ``` ]] .narrow-right-column[ ![:scale 100%](images/regression_boston.png) ] ??? I think one of the best debugging tools is prediction plots, where you plot the predicted versus the true outcome. This is on the Boston dataset here, I just do a ridge regression model. And so here, I plot the predicted outcome by the ridge regression on the test data set versus the true outcome. And so ideally, it should be the same, they it should be on the diagonal. But you can see that there's like, a couple things wrong with it. So for example, there are a couple of things where the true prediction is 50K, but I under predict severely. There’s a lot of underprediction going on. In particular, for low values, I’m predicting too high and for high values, I’m predicting too low. Looking at this you can see different trends and see if there are any trends or not. --- class:split-40 # Residual Plots .left-column[![:scale 50%](images/regression_boston.png)
![:scale 60%](images/regression_boston_2.png)] .right-column[ ![:scale 100%](images/regression_boston_3.png)] ??? A different way to look at this is to basically rotate that 45 degrees. So here we are looking at the true versus true minus predicted. This allows you even more easily to see if there are trends, and you can see that basically, we're over predicting for low values and we're under predicting for high values. You can also look at the histogram of the residuals and you can see that they’re mostly centered around zero, which is good so there's probably no systematic bias but you can see that there are some values where we underpredict by quite a lot. These are nice because you can do these plots no matter what dimension your data set is. --- # Target vs Feature ![:scale 100%](images/target_vs_feature.png) ??? If you want to look at in more detail, you can look at per feature plots. --- # Residual vs Feature ![:scale 100%](images/residual_vs_feature.png) ??? Here, I'm plotting y test minus the prediction against each of the features and see if there are particular aspects of the feature that are not captured. For example, you can see here that for a low LSTAT, we are underpredicting more than for a high LSTAT. That shows that there might be trends in these features that we're not capturing. The X-axis is features. For each data point in the test dataset, I look at the value of this feature, and I look at the residual, which is the true value minus the prediction. Ideally, this should all be sort of horizontal because the errors that the model makes should be independent of all of the features. If the error that the model makes is not independent of the feature that means there's some aspect of the feature that the model didn't capture. This is basically the dependence of the target on the feature that is not captured by the prediction. And you can meditate that and then at some point it'll make sense. These are all more and more fine grain plots to understand the errors that you're making. --- # Picking metrics - Accuracy rarely what you want - Problems are rarely balanced - Find the right criterion for the task - OR pick a substitude, but at least think about it - Emphasis on recall or precision? - Which classes are the important ones? ??? In summary, for both multiclass and binary, you should really think about what your metrics are. And unless your problem is balanced, and even maybe then don't use accuracy. Accuracy is really bad. Problems are rarely balanced, in a real world and usually there heavily unbalanced unless someone artificially balances them. So it's really important to think about what is the right criterion for a particular task, and then you can optimize that criterion. So as I said, that can be like having costs associated with the confusion matrix, emphasizing precision or recall, setting specific goals such as have recall of at least X percent, or have precision of at least X percent and for multi-class deciding whether are all the classes the same importance, or all the samples the same importance, what are the important classes, even for binary, which one is the positive class? So whenever you talk about precision and recall, it’s important to think about which one is the positive class and how will changing the positive class change these two values? So and then finally, obviously, am I going to use the default threshold? Why am I going to use the default threshold? Do I use a ranking based metric? Or is there a reason why I would want to use a ranking based metric? --- # Using metrics in cross-validation .smaller[ ```python cancer = load_breast_cancer() X, y = cancer.data, cancer.target # default scoring for classification is accuracy rf = RandomForestClassifier(random_state=0) print("default scoring ", cross_val_score(rf, X, y)) # providing scoring="accuracy" doesn't change the results explicit_accuracy = cross_val_score(rf, X, y, scoring="accuracy") print("explicit accuracy scoring ", explicit_accuracy) ap = cross_val_score(rf, X, y, scoring="average_precision") print("average precision", ap) ``` ``` default scoring [0.93 0.947 0.991 0.974 0.973] explicit accuracy scoring [0.93 0.947 0.991 0.974 0.973] average precision [0.992 0.973 0.999 0.995 0.999] ``` ] ??? So once you pick them, it's very easy to use any of these metrics in scikit-learn. So they're all functions inside the metrics module, but usually, you're going to want to use them in for cross-validation, or grid search. And then it's even easier. There's argument called scoring. It takes a string or a callable. And so here, if you give it a string, it'll just do the right thing automatically. By default its accuracy. So if you don't do anything, it's the same as if you provide accuracy. But you can also set ROC AUC, and it's going to use ROC AUC. So the thing is ROC AUC actually needs probabilities right or a decision function, so behind the scenes, it does the right thing and gets out the decision function for the positive class to compute the ROC AUC correctly. Using better a metric than accuracy is as simple as saying scoring equal to something. And it’s exactly the same as in grid search CV. So if you want to do a grid search CV or randomized search CV, you can set scoring equal to ROC AUC or average precision or recall macro. Same for GridSearchCV
Will make GridSearchCV.score use your metric! --- # Multiple Metrics ``` from sklearn.model_selection import cross_validate res = cross_validate(RandomForestClassifier(), X, y, scoring=["accuracy", "average_precision", "recall_macro"], return_train_score=True, cv=5) pd.DataFrame(res) ``` .tiny[
fit_time
score_time
test_accuracy
train_accuracy
test_average_precision
train_average_precision
test_recall_macro
train_recall_macro
0
0.171312
0.016973
0.921053
1.0
0.992115
1.0
0.918277
1.0
1
0.133090
0.017563
0.938596
1.0
0.972293
1.0
0.923190
1.0
2
0.110232
0.015200
0.982456
1.0
0.999098
1.0
0.981151
1.0
3
0.110211
0.015076
0.956140
1.0
0.992924
1.0
0.950397
1.0
4
0.124239
0.019509
0.973451
1.0
0.998858
1.0
0.974011
1.0
] --- # Built-in scoring ```python from sklearn.metrics import SCORERS print("\n".join(sorted(SCORERS.keys()))) ``` .smaller[ ``` accuracy adjusted_mutual_info_score adjusted_rand_score average_precision balanced_accuracy completeness_score explained_variance f1 f1_macro f1_micro f1_samples f1_weighted fowlkes_mallows_score homogeneity_score jaccard jaccard_macro jaccard_micro jaccard_samples jaccard_weighted max_error mutual_info_score neg_brier_score neg_log_loss neg_mean_absolute_error neg_mean_gamma_deviance neg_mean_poisson_deviance neg_mean_squared_error neg_mean_squared_log_error neg_median_absolute_error neg_root_mean_squared_error normalized_mutual_info_score precision precision_macro precision_micro precision_samples precision_weighted r2 recall recall_macro recall_micro recall_samples recall_weighted roc_auc roc_auc_ovo roc_auc_ovo_weighted roc_auc_ovr roc_auc_ovr_weighted v_measure_score ``` ] ??? Here's a list of all the built-in scores, you can look at the documentation, there's really a lot of them. So these are for classification or regression or clustering. If you look in the documentation, you'll actually see which ones are for classification, which one is binary only, which ones are multiclass, which ones are regression and which ones are for clustering. The thing in scikit-learn, whatever you use for scoring greater needs to be better the way it's written. So you can’t use mean_squared_error for doing a grid search, because the grid search assumes greater is better so you have to use negative mean squared error. So for regression or log loss, you use mean squared error. For classification, you need to use the neg log loss one. --- # The Scorer interface (vs the metrics interface) ```python # metric function interface: y_probs = rf.predict_proba(X_test) ap_rf = average_precision_score(y_test, y_probs[:, 1]) y_pred = rf.predict(X_test) ba_rf = balanced_accuracy_score(y_test, y_pred) # scorer interface: from sklearn.metrics import get_scorer ap_scorer = get_scorer('average_precision') ap_rf = ap_scorer(rf, X_test, y_test) ab_scorer = get_scorer('balanced_accuracy') ba_rf = ab_scorer(rf, X_test, y_test) ``` --- class:spacious # Providing you your own callable - Takes estimator, X, y - Returns score – higher is better (always!) ```python def accuracy_scoring(est, X, y): return (est.predict(X) == y).mean() ``` ??? You can also provide your own metric, for example, if you want to do multiclass ROC AUC, you can provide a callable as scoring instead of a string. For any of the built-in ones, you can just provide a string. In this case, I’ve re-implemented accuracy. And the arguments for this needs to be an estimator, x - which is the test data and y - which is the test data true labels or whatever data you want to score. To re-implement accuracy, you have to call predict on the test data, check whether it's equal to y, and then compute the mean. This is actually a very powerful framework so you can do a lot of things with it. --- class: compact # You can access the model! .tiny-code[ ```python def nonzero(est, X, y): return np.sum(est.coef_ != 0) param_grid = {'alpha': np.logspace(-5, 0, 10)} # scoring can be string, single scorer, list or dict grid = GridSearchCV(Lasso(), param_grid, return_train_score=True, scoring={'r2': 'r2', 'num_nonzero': nonzero}, refit='r2') grid.fit(X_train, y_train) a = results.plot('param_alpha', 'mean_train_r2') b = results.plot('param_alpha', 'mean_test_r2', ax=plt.gca()) ax2 = plt.gca().twinx() results.plot('param_alpha', 'mean_train_num_nonzero', ax=ax2, c='k') ``` ] .center[ ![:scale 50%](images/lasso_alpha_triazine.png) ] ??? The main thing that I want to illustrate here is that you can have access to the model. So you can write a function for grid search that can do anything with the model it wants. So you can do deep introspection into the model and use that for your model selection. And then I can set up set scoring and then my callable and then I can run grid search as I usually would. Only now it's select a bigger C because bigger C give fewer support vectors. Question is why is the default accuracy? I guess one reason is, we can't change it anymore. Because everybody assumes it is and if it would be anything else, people would be very confused because that's sort of the most natural metric that people think about, like the number of correctly classified samples. Otherwise, in the first lecture, I would have to explain ROC AUC to you if it was ROC AUC and it would be harder to understand for people. Question is if I don't have probabilities can I compute ROC and yes, it does not depend probabilities at all. It's just I go through all possible thresholds. It's really just a ranking. I go through all possible thresholds and look at if I've used this threshold and look at what's the precision? What's the recall? Or what's the false positive rate, what's the true positive rate, and you only need a ranking. Basically, I only need to be able to sort the points and then for each possible threshold, I need to call the precision and recall. And so as long as it can sort the points, it doesn't matter. I can use any monotonous transformation of the decision function and it will still be the same because it only considers the ranking. --- class: center, middle # Questions ?